Data Products
Every record is morphologically decomposed, tonally annotated, and rule-traced. This is not web-scraped text — it is engineered linguistic infrastructure generated by 39 generators across the BTS engine, tonal pipeline, Amina speaker platform, and SENTGEN system.
What data exists today
Phonological Inventories — 664 Languages
T0The complete phonological scaffold for every registered Bantu language — something no current LLM can produce without errors. Ask any frontier model to list the complete vowel inventory, consonant digraphs, or valid syllable structures for Bemba, Tonga, or Lozi, and it will hallucinate. We have the verified ground truth.
Morphological Training Data — 250M+ Records
T1–T2Full 9-slot verb decomposition with morpheme labels, slot positions, phonological rules applied, constraint checks, and provenance tracking. Every record carries 8 AI training task types. 151 morphological constraints verified per form for Bemba.
Specialized Generator Datasets (Bemba Production)
T2Beyond verb morphology, the production Bemba cartridge has 26 active generators covering the full grammar:
D-Series — Structured Training Data Generators
T215 generators that transform raw morphological data into structured pedagogical training sequences for AI models:
Tonal Ground Truth — 23,000+ Pairs
T2Declarative/interrogative acoustic minimal pairs with mora-level H/L tone assignments, F₀ onset perturbation parameters, and embedded testable theses. MRS-certified: 132 stress-test records with 100% tonal accuracy (Gemini 2.5 Flash audit).
SENTGEN — Sentence-Level Training Data
T2Language-agnostic sentence generator producing bilingual training data with configurable code-switching. 84 templates across 7 categories, 17 lexicon pools (1,059 items), 11 constraint rules. Outputs in 6 AI training formats. Configurable code-switching: 40% Bemba native / 60% English loan on occupation terms.
Amina — Speaker-Validated Audio
T1–T248kHz mono recordings from native speakers across 25+ countries. Every recording validated against engine predictions. Speaker calibration profiles built from isolated syllable recordings. Validation Passports certify which linguistic rules each recording confirms.
What data is coming
Code-Switching Recordings
Naturalistic multi-language recordings capturing switching between Bantu languages and English/French/Portuguese. SENTGEN already supports configurable code-switching ratios (currently 40/60 Bemba/English) — expanding to full naturalistic speech.
Conversations & Stories
Multi-turn dialogues, folktales, personal narratives, and how-to instructions with tonal annotation. SENTGEN templates expanding to conversational and narrative categories.
Register & Formality Variants
D08 (Register Generator) and D14 (Idiom Generator) — formal, informal, colloquial, and idiomatic variants. Currently blocked pending cultural sensitivity review.
Discourse Sequences
D12 (Discourse Generator) — multi-sentence sequences with topic continuity, pronoun resolution, and tonal discourse markers. Cultural review required before activation.
POC → Production for 4+ Languages
Tonga, Nyanja, Lozi, and Zulu are candidates for T1→T2 promotion. Requires human linguist review, extended root inventories, and NS verification passes.
HFST Transducers for All T1+ Languages
Compiling the first all-Bantu-language HFST system. Offline O(n) morphological analysis and generation. Enterprise clients get compiled binaries for local deployment.
Record Schema
Every record carries full morpheme decomposition, tonal analysis, rule provenance, and 8 AI training task types.
{
"surface_form": "baleebomba",
"language": "bem",
"bts_version": "3.2.1",
"slots": {
"subject": {"morph": "ba", "gloss": "3PL", "tone": "H"},
"tense": {"morph": "ale", "gloss": "PROG", "tone": "L"},
"object": {"morph": "e", "gloss": "REFL", "tone": "L"},
"root": {"morph": "bomb", "gloss": "work", "tone": "L"},
"final_vowel": {"morph": "a", "gloss": "IND", "tone": "L"}
},
"tonal_analysis": {
"surface_tones": ["H", "H", "L", "L", "L"],
"rules_applied": ["lexical", "binary_spreading"],
"meeussen_triggered": false
},
"constraints_checked": 151,
"constraints_passed": 151,
"task_types": ["decomposition", "composition", "translation",
"slot_predict", "harmony_predict", "error_detect",
"error_correct", "cross_lang_key"],
"english": "they are working (for themselves)"
}
8 AI Training Task Types Per Record
Every morphological record is usable for 8 distinct training objectives simultaneously.
Parse surface form into morpheme slots
Combine slots into correct surface form
Translate to/from English with slot alignment
Predict missing morpheme slot
Predict vowel harmony output
Identify grammatical violations
Fix malformed words with rule citation
Canonical form for cross-language comparison
Download Sample Data
Free JSONL samples to evaluate data quality, schema structure, and tonal annotations. No account required.
Output Formats
SENTGEN exports to 6 AI training formats:
Need custom data, specific language coverage, or enterprise access?