Development Roadmap

Roadmap

From proof-of-concept to production across 664 languages. Each tier represents increasing data completeness and validation depth.

Tier Progression: T0 → T5

Alphabet — Phonological Foundation

COMPLETE — 664 languages

Complete phonological inventory for every registered Bantu language — vowels, consonants, digraphs, syllable structures, tone system classification, morphophonemic rules. 8 standardized JSON files per language. The foundation no frontier LLM can reproduce without errors. Amina syllable inventories seeded from phonological scaffold; ready for speaker recruitment.

Proof of Concept

COMPLETE — 664 languages · 1M+ records

Full-stack BantuNomics–Amina language data infrastructure. Language-agnostic engines generating morphological data from AI-assembled cartridges — 50 anchor verb roots, 50-84 nouns, complete morpheme inventories. E1, NUMBERS, and SYSLEARN generators active. 40-50 files per language.

Amina integration begins here. Three recording streams seeded: syllable grids (140+ syllables/language), tonal word pairs (minimal pairs with H/L tone patterns), and tonal sentence pairs (6 prosodic moods). Speaker onboarding active with explicit consent, 48kHz mono WAV capture, acoustic quality validation. All recordings documented, consent-based, provenance-tracked from first capture.

Production Level I

1 language (Bemba) · 250M+ records

Human-reviewed, natively verified production cartridge. 1,100+ verb roots (Hoch dictionary), 26 active generators. Full D-series training generators, SENTGEN operational (84 templates, 7 categories, code-switching). MRS certified — 132 stress-test records, 100% tonal accuracy.

Deepened Amina hybrid integration. Speaker calibration profiles built from syllable recordings — per-speaker F₀ baselines, vowel formants, onset perturbation factors. Tone pair recordings validated through 3 lines of defense: per-recording acoustic analysis, cross-variant pair validation, multi-rater human consensus. Validation Passports issued. Recordings feed back into BTS for thesis validation. All data documented, consent-based, provenance-tracked. Commercial license available: full corpus with speaker metadata.

Production Level II — Enhanced

TARGET · 10B+ records

All core generators battle-tested. NS review ≥ 50%. All D-series training generators enabled and verified. Cultural sensitivity review complete. Multi-register coverage (formal, informal, colloquial). Expanded SENTGEN with domain-specific templates.

Amina at scale. ≥ 25% of generators acoustically verified by native speakers. Multi-speaker corpus per language with cross-speaker consistency validation. Provenance chains extend from BTS generation through speaker recording to validated output. Commercial license: full corpus with speaker metadata and complete provenance documentation.

Production Level III — Mastery

FUTURE · 500B+ records

NS review ≥ 90%. Multi-dialect coverage with morphological rules exhaustively documented. Academic-grade linguistic documentation. Full MRS compositionality testing. SENTGEN with conversations, stories, naturalistic code-switching.

Amina full dialect coverage. Multi-speaker corpus with dialectal variation — every record consent-documented, provenance-tracked, speaker-attributed. Exclusive partnership tier: priority access to new language releases and custom data collection runs.

Reference Implementation — Untouchable

FUTURE · 50T+ records

Complete linguistic mastery. NS review 100%. Published academic verification. BTS Language Council approval. HFST finite-state transducers compiled and benchmarked. The definitive reference implementation for the language family.

Amina complete corpus. Full speaker corpus — naturalistic speech, domain-specific recordings, complete provenance chain, consent-based, commercially licensed with full speaker metadata. Exclusive partnership: priority access to new language releases, custom data collection runs, and co-development of new language cartridges.

Planned Data Streams

PLANNED

Code-Switching

Recordings of natural language switching between Bantu languages and colonial languages (English, French, Portuguese). Critical for real-world AI deployment.

PLANNED

Conversations

Multi-turn dialogues between native speakers. Tonally annotated with turn-taking prosody and discourse-level tonal patterns.

PLANNED

Stories & Narratives

Extended monologue recordings — folktales, personal narratives, how-to instructions. Captures natural prosodic contours and storytelling register.

PLANNED

Slang & Urban Register

Contemporary urban speech capturing neologisms, borrowed vocabulary, and register variation not found in formal language resources.

Language Expansion Timeline

NOW

Production: Bemba (Icibemba)

1,100+ curated roots, 250M+ morphological records, 23,000+ tonal pairs, MRS certified, 100% tonal accuracy.

2026 H2

T1 → T2: Nyanja, Tonga, Lozi, Zulu

POC cartridges complete with pedigree ratings. Speaker validation in progress. Target: human-reviewed production cartridges (T2) by end of year.

2027

T2+ for Top 20 Bantu Languages

Swahili, Shona, Kinyarwanda, Sesotho, Setswana, Xhosa, and others. Generation active with tonal pipeline configured per language.

ONGOING

Registry: All 664 Bantu Language Varieties

Continuous expansion of cartridge coverage. Community-driven cartridge seeding for lower-resource languages.