Platform Overview
A five-layer production system that generates, validates, and certifies structured training data for any Bantu language. One engine. 664 cartridges. Same architecture.
Architecture Pipeline
The Bantu Technical Standard (BTS) Engine
A language-agnostic engine with 39 registered generators and 16 language-agnostic generators (LAGs). Zero language-specific code — all linguistic data lives in cartridges.
9-Slot Verb Template
~480K slot combinations per root → ~110K valid forms (22% pass constraint filter) → full morphological decomposition with provenance tracking for every record.
Engine Statistics
Language Maturity
The Tier System: T0 → T5
Each tier integrates BTS engine capability, tonal pipeline coverage, Amina speaker validation, and SENTGEN sentence-level data into a single maturity progression.
Alphabet — Phonological Foundation
COMPLETE — 664 languagesThe foundation no LLM currently has. Complete phonological inventory for every registered Bantu language — vowels, consonants, digraphs, syllable structures, tone system classification, morphophonemic rules. 8 standardized JSON files per language. Amina syllable inventories seeded from phonological scaffold.
Proof of Concept
COMPLETE — 664 languages · 1M+ recordsFull-stack BantuNomics–Amina language data infrastructure. Language-agnostic engines generating morphological data from AI-assembled cartridges — 50 anchor verb roots, 50-84 nouns, complete morpheme inventories. E1, NUMBERS, and SYSLEARN generators active. 40-50 files per language. Amina integration begins here: three recording streams seeded (syllable grids, tonal word pairs, tonal sentence pairs with 6 prosodic moods). Speaker onboarding with explicit consent, 48kHz mono WAV, acoustic quality validation. All recordings documented, consent-based, provenance-tracked from first capture.
Production Level I
1 language (Bemba) · 250M+ recordsHuman-reviewed, natively verified production cartridge. 1,100+ verb roots (Hoch dictionary), 26 active generators, 250M+ records. Full D-series training generators. SENTGEN operational (84 templates, 7 categories, code-switching). MRS certified — 100% tonal accuracy. Deepened Amina hybrid integration: speaker calibration profiles (per-speaker F₀, vowel formants, onset perturbation). 3 lines of defense — per-recording acoustic analysis, cross-variant pair validation, multi-rater human consensus. Validation Passports issued. Recordings feed back into BTS for thesis validation. Documented, consent-based, provenance-tracked. Commercial license: full corpus with speaker metadata.
Production Level II — Enhanced
TARGET · 10B+ recordsAll core generators battle-tested. NS review ≥ 50%. All D-series generators enabled and verified. Cultural sensitivity review complete. Multi-register coverage. Expanded SENTGEN. Amina at scale: ≥ 25% of generators acoustically verified by native speakers. Multi-speaker corpus with cross-speaker consistency validation. Provenance chains from BTS generation through speaker recording to validated output. Commercial license: full corpus with speaker metadata and complete provenance documentation.
Production Level III — Mastery
FUTURE · 500B+ recordsNS review ≥ 90%. Multi-dialect coverage. Academic-grade documentation. Full MRS compositionality testing. SENTGEN with conversations, stories, code-switching. Amina full dialect coverage: multi-speaker corpus with dialectal variation — every record consent-documented, provenance-tracked, speaker-attributed. Exclusive partnership: priority access to new language releases and custom data collection runs.
Reference Implementation — Untouchable
FUTURE · 50T+ recordsComplete linguistic mastery. NS review 100%. Published academic verification. BTS Language Council approval. HFST transducers compiled and benchmarked. The definitive reference implementation. Amina complete corpus: full speaker corpus — naturalistic speech, domain-specific recordings, complete provenance chain, consent-based, commercially licensed with full speaker metadata. Exclusive partnership: priority access, custom data collection runs, co-development of new language cartridges.
The First All-Bantu-Language HFST
We are building the first comprehensive HFST system for the entire Bantu family — compiling cartridges into finite-state transducers for offline O(n) morphological analysis and generation.
Language cartridges compile into HFST transducers — mathematical models encoding 151+ morphological constraints.
Given any surface form, decompose it into morphemes with slot labels in O(n) time. No API needed.
Given a morpheme specification, produce the correct surface form. Bidirectional by construction.