Roadmap
From proof-of-concept to production across 664 languages. Each tier represents increasing data completeness and validation depth.
Tier Progression: T0 → T5
Alphabet — Phonological Foundation
COMPLETE — 664 languagesComplete phonological inventory for every registered Bantu language — vowels, consonants, digraphs, syllable structures, tone system classification, morphophonemic rules. 8 standardized JSON files per language. The foundation no frontier LLM can reproduce without errors. Amina syllable inventories seeded from phonological scaffold; ready for speaker recruitment.
Proof of Concept
COMPLETE — 664 languages · 1M+ recordsFull-stack BantuNomics–Amina language data infrastructure. Language-agnostic engines generating morphological data from AI-assembled cartridges — 50 anchor verb roots, 50-84 nouns, complete morpheme inventories. E1, NUMBERS, and SYSLEARN generators active. 40-50 files per language.
Amina integration begins here. Three recording streams seeded: syllable grids (140+ syllables/language), tonal word pairs (minimal pairs with H/L tone patterns), and tonal sentence pairs (6 prosodic moods). Speaker onboarding active with explicit consent, 48kHz mono WAV capture, acoustic quality validation. All recordings documented, consent-based, provenance-tracked from first capture.
Production Level I
1 language (Bemba) · 250M+ recordsHuman-reviewed, natively verified production cartridge. 1,100+ verb roots (Hoch dictionary), 26 active generators. Full D-series training generators, SENTGEN operational (84 templates, 7 categories, code-switching). MRS certified — 132 stress-test records, 100% tonal accuracy.
Deepened Amina hybrid integration. Speaker calibration profiles built from syllable recordings — per-speaker F₀ baselines, vowel formants, onset perturbation factors. Tone pair recordings validated through 3 lines of defense: per-recording acoustic analysis, cross-variant pair validation, multi-rater human consensus. Validation Passports issued. Recordings feed back into BTS for thesis validation. All data documented, consent-based, provenance-tracked. Commercial license available: full corpus with speaker metadata.
Production Level II — Enhanced
TARGET · 10B+ recordsAll core generators battle-tested. NS review ≥ 50%. All D-series training generators enabled and verified. Cultural sensitivity review complete. Multi-register coverage (formal, informal, colloquial). Expanded SENTGEN with domain-specific templates.
Amina at scale. ≥ 25% of generators acoustically verified by native speakers. Multi-speaker corpus per language with cross-speaker consistency validation. Provenance chains extend from BTS generation through speaker recording to validated output. Commercial license: full corpus with speaker metadata and complete provenance documentation.
Production Level III — Mastery
FUTURE · 500B+ recordsNS review ≥ 90%. Multi-dialect coverage with morphological rules exhaustively documented. Academic-grade linguistic documentation. Full MRS compositionality testing. SENTGEN with conversations, stories, naturalistic code-switching.
Amina full dialect coverage. Multi-speaker corpus with dialectal variation — every record consent-documented, provenance-tracked, speaker-attributed. Exclusive partnership tier: priority access to new language releases and custom data collection runs.
Reference Implementation — Untouchable
FUTURE · 50T+ recordsComplete linguistic mastery. NS review 100%. Published academic verification. BTS Language Council approval. HFST finite-state transducers compiled and benchmarked. The definitive reference implementation for the language family.
Amina complete corpus. Full speaker corpus — naturalistic speech, domain-specific recordings, complete provenance chain, consent-based, commercially licensed with full speaker metadata. Exclusive partnership: priority access to new language releases, custom data collection runs, and co-development of new language cartridges.
Planned Data Streams
Code-Switching
Recordings of natural language switching between Bantu languages and colonial languages (English, French, Portuguese). Critical for real-world AI deployment.
Conversations
Multi-turn dialogues between native speakers. Tonally annotated with turn-taking prosody and discourse-level tonal patterns.
Stories & Narratives
Extended monologue recordings — folktales, personal narratives, how-to instructions. Captures natural prosodic contours and storytelling register.
Slang & Urban Register
Contemporary urban speech capturing neologisms, borrowed vocabulary, and register variation not found in formal language resources.
Language Expansion Timeline
Production: Bemba (Icibemba)
1,100+ curated roots, 250M+ morphological records, 23,000+ tonal pairs, MRS certified, 100% tonal accuracy.
T1 → T2: Nyanja, Tonga, Lozi, Zulu
POC cartridges complete with pedigree ratings. Speaker validation in progress. Target: human-reviewed production cartridges (T2) by end of year.
T2+ for Top 20 Bantu Languages
Swahili, Shona, Kinyarwanda, Sesotho, Setswana, Xhosa, and others. Generation active with tonal pipeline configured per language.
Registry: All 664 Bantu Language Varieties
Continuous expansion of cartridge coverage. Community-driven cartridge seeding for lower-resource languages.