System Architecture

Platform Overview

A five-layer production system that generates, validates, and certifies structured training data for any Bantu language. One engine. 664 cartridges. Same architecture.

Architecture Pipeline

Layer 1
BTS Engine
39 Generators, 664 Cartridges
Layer 2
Tonal Pipeline
5-Step Tone Assignment
Layer 3
Amina
Speaker Validation + Audio
Layer 4
SENTGEN
Sentence-Level Data + Code-Switching
Layer 5
Defense
MRS + LDR Certification

The Bantu Technical Standard (BTS) Engine

A language-agnostic engine with 39 registered generators and 16 language-agnostic generators (LAGs). Zero language-specific code — all linguistic data lives in cartridges.

9-Slot Verb Template

// Every Bantu verb follows this structure
[PreInit] [Subject] [Tense] [Object] [Root] [Ext₁] [Ext₂] [FinalV] [Loc]

~480K slot combinations per root → ~110K valid forms (22% pass constraint filter) → full morphological decomposition with provenance tracking for every record.

Engine Statistics

Total records250M+
Registered generators39
Language-agnostic generators (LAGs)16
POC cartridges664
Production cartridge (Bemba)1,100+ roots, 26 active generators
Morphological constraints (Bemba)151 rules

Language Maturity

The Tier System: T0 → T5

Each tier integrates BTS engine capability, tonal pipeline coverage, Amina speaker validation, and SENTGEN sentence-level data into a single maturity progression.

T0

Alphabet — Phonological Foundation

COMPLETE — 664 languages

The foundation no LLM currently has. Complete phonological inventory for every registered Bantu language — vowels, consonants, digraphs, syllable structures, tone system classification, morphophonemic rules. 8 standardized JSON files per language. Amina syllable inventories seeded from phonological scaffold.

BTS Engine
8 files/language: alphabet, constraints, exceptions, morphophonemics, tone system, syllabary (core + complex)
Tonal
Tone system classified (2-tone, 3-tone, tonal/non-tonal)
Amina
Syllable inventory seeded from phonological scaffold; ready for speaker recruitment
SENTGEN
Not active (requires T1 cartridge)
T1

Proof of Concept

COMPLETE — 664 languages · 1M+ records

Full-stack BantuNomics–Amina language data infrastructure. Language-agnostic engines generating morphological data from AI-assembled cartridges — 50 anchor verb roots, 50-84 nouns, complete morpheme inventories. E1, NUMBERS, and SYSLEARN generators active. 40-50 files per language. Amina integration begins here: three recording streams seeded (syllable grids, tonal word pairs, tonal sentence pairs with 6 prosodic moods). Speaker onboarding with explicit consent, 48kHz mono WAV, acoustic quality validation. All recordings documented, consent-based, provenance-tracked from first capture.

BTS Engine
40-50 files/lang. E1, NUMBERS, SYSLEARN. Language-agnostic architecture. 1M+ records across 664 languages
Tonal
Tonal pipeline parameters configured per cartridge. H/L assignment active
Amina
3 recording streams. Documented consent. 48kHz mono WAV. Provenance-tracked from first capture
SENTGEN
Ready when sentence templates + lexicon pools populated
Rated POC: Bemba Tonga Nyanja Lozi Zulu Swahili Shona Kinyarwanda Lingala
T2

Production Level I

1 language (Bemba) · 250M+ records

Human-reviewed, natively verified production cartridge. 1,100+ verb roots (Hoch dictionary), 26 active generators, 250M+ records. Full D-series training generators. SENTGEN operational (84 templates, 7 categories, code-switching). MRS certified — 100% tonal accuracy. Deepened Amina hybrid integration: speaker calibration profiles (per-speaker F₀, vowel formants, onset perturbation). 3 lines of defense — per-recording acoustic analysis, cross-variant pair validation, multi-rater human consensus. Validation Passports issued. Recordings feed back into BTS for thesis validation. Documented, consent-based, provenance-tracked. Commercial license: full corpus with speaker metadata.

BTS Engine
184 files. 36 generators (26 active). 12+ word categories. 250M+ records
Tonal
23,000+ tonal verb pairs. MRS: 132 stress-tests, 100% accuracy
Amina
Speaker calibration profiles. 3 lines of defense. Validation Passports. Consent-based, provenance-tracked
SENTGEN
84 templates, 7 categories, 17 lexicon pools. ~11.8B combinations. Code-switching active
T3

Production Level II — Enhanced

TARGET · 10B+ records

All core generators battle-tested. NS review ≥ 50%. All D-series generators enabled and verified. Cultural sensitivity review complete. Multi-register coverage. Expanded SENTGEN. Amina at scale: ≥ 25% of generators acoustically verified by native speakers. Multi-speaker corpus with cross-speaker consistency validation. Provenance chains from BTS generation through speaker recording to validated output. Commercial license: full corpus with speaker metadata and complete provenance documentation.

T4

Production Level III — Mastery

FUTURE · 500B+ records

NS review ≥ 90%. Multi-dialect coverage. Academic-grade documentation. Full MRS compositionality testing. SENTGEN with conversations, stories, code-switching. Amina full dialect coverage: multi-speaker corpus with dialectal variation — every record consent-documented, provenance-tracked, speaker-attributed. Exclusive partnership: priority access to new language releases and custom data collection runs.

T5

Reference Implementation — Untouchable

FUTURE · 50T+ records

Complete linguistic mastery. NS review 100%. Published academic verification. BTS Language Council approval. HFST transducers compiled and benchmarked. The definitive reference implementation. Amina complete corpus: full speaker corpus — naturalistic speech, domain-specific recordings, complete provenance chain, consent-based, commercially licensed with full speaker metadata. Exclusive partnership: priority access, custom data collection runs, co-development of new language cartridges.

The First All-Bantu-Language HFST

We are building the first comprehensive HFST system for the entire Bantu family — compiling cartridges into finite-state transducers for offline O(n) morphological analysis and generation.

Compile

Language cartridges compile into HFST transducers — mathematical models encoding 151+ morphological constraints.

Analyze

Given any surface form, decompose it into morphemes with slot labels in O(n) time. No API needed.

Generate

Given a morpheme specification, produce the correct surface form. Bidirectional by construction.