Tonal Ground Truth
Standard Bantu orthography does not mark tone. The same written word can mean completely different things depending on pitch. Every AI model trained on Bantu text is tonally blind. We fix that.
Why Tone Matters
In Bantu languages, tone is not accent or emphasis — it is meaning. A shift of just 12 Hz in fundamental frequency can separate "to work" from "to be wet." Both words share the same spelling.
Standard NLP pipelines cannot distinguish them. Training on flat text is like training an English model on text where uppercase and lowercase are removed — you lose structural information that changes meaning.
Across 664 Bantu languages spoken by 400 million people, tone operates at the lexical level (word meaning), grammatical level (tense, mood), and discourse level (questions vs statements). Current AI misses all three.
Transcriptions look correct but miss meaning — tonal contrast was never modelled.
Intelligible but unnatural speech — pitch treated as post-processing, not linguistic structure.
Distinct meanings collapsed into one token sequence — training text hides the contrast entirely.
The 5-Step Tonal Pipeline
Deterministic assignment of High (H) or Low (L) tones to every mora in every generated form.
Lexical Assignment
Each root, subject marker, and tense marker has a documented underlying tone. These are read from the language cartridge — no hardcoded values.
Meeussen's Rule
Resolves H+H tone collisions in the prefix domain. When two adjacent High tones meet, one is systematically lowered. The domain and behavior are language-specific.
Melodic Overlay
Certain tense markers impose grammatical tone patterns that override root tones. The melodic patterns are configured per language in the cartridge.
Binary Spreading
High tones spread exactly one mora rightward. The spreading type (binary, ternary, unbounded) varies by language and dialect.
OCP Cleanup
The Obligatory Contour Principle resolves any remaining adjacent High-tone violations, producing the final surface tone pattern.
Syllabic Ground Truth
Every Bantu language has a finite, enumerable set of valid syllables. Each is an acoustic anchor point.
140+ Syllables Per Language
Bemba has 29 onset groups × 5 vowels = 140+ unique syllables. Each has a unique acoustic fingerprint determined by its consonant and vowel.
Calibration Loop
When speakers record isolated syllables at 48kHz, measurements calibrate predictions to each individual speaker's vocal tract.
Thesis-Validation Architecture
Every generated record embeds testable linguistic claims. Recordings validate or invalidate them.
Each tonal record carries a set of theses — predictions about what the acoustic signal should look like if the tonal rules were applied correctly. When a native speaker records the form, the acoustic measurements are compared against these predictions. A thesis passes if the physical measurements confirm the rule; it fails if they don't.
Recording confirms predicted tonal pattern — record certified
Recording contradicts prediction — re-prompted with corrective guidance
Systematic failures trigger cartridge review and rule refinement
Acoustic Minimal Pairs
Same words recorded as both statement and question — exposing the tonal contrast that orthography hides.
8 Syntactic Frame Categories
Each verb form is placed in controlled sentence frames that isolate specific tonal behaviors:
What This Reveals
Minimal pairs expose tonal contrasts invisible in text:
- ● Statement F₀ contour falls at boundary; question rises
- ● Penultimate lengthening shifts in question context
- ● Tonal spreading behavior changes across syntactic frames
- ● AI models trained on these pairs learn to distinguish meaning by pitch
The Defense Layer
Scientific certification that the data is internally consistent and linguistically correct.
Minimum Representative Sample
~130-300 carefully selected records exercise every boundary case. If the MRS is correct, bulk data correctness follows by construction.
Linguistic Delta Report
Mathematical proof of internal consistency. Variance (Δ) between MRS and bulk data. Δ < 1% = AUTHENTIC certification.
Per-Record Certification
Every record carries a machine-readable certificate listing which linguistic rules were verified by acoustic physics.