Tonal Ground Truth — BantuNomics

Why Tone Matters

In Bantu languages, tone is not accent or emphasis — it is meaning. A shift of just 12 Hz in fundamental frequency can separate "to work" from "to be wet." Both words share the same spelling.

Standard NLP pipelines cannot distinguish them. Training on flat text is like training an English model on text where uppercase and lowercase are removed — you lose structural information that changes meaning.

Across 664 Bantu languages spoken by 400 million people, tone operates at the lexical level (word meaning), grammatical level (tense, mood), and discourse level (questions vs statements). Current AI misses all three.

ASR Speech Recognition

Transcriptions look correct but miss meaning — tonal contrast was never modelled.

TTS Text-to-Speech

Intelligible but unnatural speech — pitch treated as post-processing, not linguistic structure.

LLMs Language Models

Distinct meanings collapsed into one token sequence — training text hides the contrast entirely.

The 5-Step Tonal Pipeline

Deterministic assignment of High (H) or Low (L) tones to every mora in every generated form.

1

Lexical Assignment

Each root, subject marker, and tense marker has a documented underlying tone. These are read from the language cartridge — no hardcoded values.

2

Meeussen's Rule

Resolves H+H tone collisions in the prefix domain. When two adjacent High tones meet, one is systematically lowered. The domain and behavior are language-specific.

3

Melodic Overlay

Certain tense markers impose grammatical tone patterns that override root tones. The melodic patterns are configured per language in the cartridge.

4

Binary Spreading

High tones spread exactly one mora rightward. The spreading type (binary, ternary, unbounded) varies by language and dialect.

5

OCP Cleanup

The Obligatory Contour Principle resolves any remaining adjacent High-tone violations, producing the final surface tone pattern.

Syllabic Ground Truth

Every Bantu language has a finite, enumerable set of valid syllables. Each is an acoustic anchor point.

140+ Syllables Per Language

Bemba has 29 onset groups × 5 vowels = 140+ unique syllables. Each has a unique acoustic fingerprint determined by its consonant and vowel.

Pure vowelsa, e, i, o, u

Plain CVba, ka, la, ma...

Labializedbwa, fwa, twa...

Prenasalizedmba, nda, nga...

Calibration Loop

When speakers record isolated syllables at 48kHz, measurements calibrate predictions to each individual speaker's vocal tract.

● F₀ onset perturbation — voiced plosives pull pitch down ~15%

● Intrinsic vowel pitch — /i/, /u/ naturally ~10% higher than /a/

● Duration baselines — short ~122ms, long ~245ms, prenasalized ~164ms

● Formant anchors — F₁/F₂ values verify vowel identity

Thesis-Validation Architecture

Every generated record embeds testable linguistic claims. Recordings validate or invalidate them.

Each tonal record carries a set of theses — predictions about what the acoustic signal should look like if the tonal rules were applied correctly. When a native speaker records the form, the acoustic measurements are compared against these predictions. A thesis passes if the physical measurements confirm the rule; it fails if they don't.

Pass

Recording confirms predicted tonal pattern — record certified

Fail

Recording contradicts prediction — re-prompted with corrective guidance

Revise

Systematic failures trigger cartridge review and rule refinement

Acoustic Minimal Pairs

Same words recorded as both statement and question — exposing the tonal contrast that orthography hides.

8 Syntactic Frame Categories

Each verb form is placed in controlled sentence frames that isolate specific tonal behaviors:

Simple declarative vs. yes/no question

Focus constructions (subject, object, verb)

Relative clauses

Negation contexts

What This Reveals

Minimal pairs expose tonal contrasts invisible in text:

● Statement F₀ contour falls at boundary; question rises
● Penultimate lengthening shifts in question context
● Tonal spreading behavior changes across syntactic frames
● AI models trained on these pairs learn to distinguish meaning by pitch

The Defense Layer

Scientific certification that the data is internally consistent and linguistically correct.

MRS Stress-Testing

Minimum Representative Sample

~130-300 carefully selected records exercise every boundary case. If the MRS is correct, bulk data correctness follows by construction.

LDR Certification

Linguistic Delta Report

Mathematical proof of internal consistency. Variance (Δ) between MRS and bulk data. Δ < 1% = AUTHENTIC certification.

Validation Passport

Per-Record Certification

Every record carries a machine-readable certificate listing which linguistic rules were verified by acoustic physics.

See full quality methodology →