← Research | Technical Paper

The Tonal Frontier: Automating Ground Truth for Bantu Language AI

Conti Cintu et al. • 3MegaLabs • April 2026

Tonal Pipeline Bemba 100% MRS Accuracy

Share: Twitter/X LinkedIn

Abstract

We present a deterministic 5-step tonal pipeline that assigns High (H) or Low (L) tones to every mora in every computationally generated Bantu verb form. The pipeline — Lexical Assignment, Meeussen's Rule, Melodic Overlay, Binary Spreading, and OCP Cleanup — is fully parameterized by language-specific cartridge files, making it language-agnostic by design. We demonstrate the system on Bemba (Icibemba), producing 23,000+ tonal verb pairs with embedded testable theses. An independent audit by Gemini 2.5 Flash found 100% tonal accuracy across 132 stress-test records: Meeussen's Rule 48/48, Binary Spreading 82/82, Nasal Harmony 22/22, Melodic Override 26/26. We introduce the thesis-validation architecture where every generated record carries predictions that native speaker recordings can confirm or refute, creating a self-correcting loop between computational generation and acoustic measurement.

1. Introduction

Standard orthography for Bantu languages does not mark tone. The word ukubomba in Bemba looks the same whether it means "to work" or carries a different tonal realization. Every AI model trained on Bantu text is therefore tonally blind — it has never seen the dimension of the language that carries lexical, grammatical, and discourse meaning.

This paper describes BantuNomics' approach to solving this problem: the Bantu Technical Standard (BTS) engine powers a deterministic pipeline that computes the correct tonal pattern for any morphologically generated verb form, based on rules documented in the Bantu linguistics literature and parameterized through language-specific cartridge files.

2. The 5-Step Tonal Pipeline

Every generated verb form passes through five deterministic steps, each reading its parameters from the language cartridge:

Step 1: Lexical Assignment

Each morpheme (root, subject marker, tense marker, object marker) carries an underlying tone — High (H) or Low (L) — documented in the cartridge. This step reads those values and assigns them to each mora in the generated form. No rules are applied yet; this is the lexical baseline.

Step 2: Meeussen's Rule

When two adjacent High tones meet in the prefix domain, Meeussen's Rule resolves the collision by lowering one of them (H+H → H+L). The domain and direction of application vary by language and are configured per cartridge. In Bemba, the rule applies within the macrostem when a High subject marker meets a High tense marker.

In our Bemba corpus, 48 stress-test records specifically exercise Meeussen's Rule boundary cases. All 48 were judged correct by independent audit.

Step 3: Melodic Overlay

Certain tense-aspect-mood (TAM) markers impose a grammatical tone pattern — a "melodic template" — that overrides lexical tones on the stem. The melodic patterns (P1 through P4 in the Bemba cartridge) are language-specific and configured per TAM marker. 26 stress-test records exercise melodic overrides; all 26 passed audit.

Step 4: Binary Spreading

After lexical and grammatical tones are assigned, High tones spread rightward. In Bemba, spreading is binary: a H tone spreads exactly one mora to the right. Other languages may use ternary or unbounded spreading — the type is a cartridge parameter. 82 stress-test records target spreading behavior; all 82 passed.

Step 5: OCP Cleanup

The Obligatory Contour Principle (OCP) states that adjacent identical tones on the tonal tier should be simplified. After spreading, any remaining H+H adjacency violations are resolved, producing the final surface tone pattern.

3. Thesis-Validation Architecture

Every generated tonal record embeds a set of testable theses — predictions about what the acoustic signal should look like if the tonal rules were applied correctly:

If Meeussen's Rule was applied, the thesis predicts that the lowered mora will show a measurably lower F₀ than the preceding High
If Binary Spreading occurred, the spread target should show elevated F₀ relative to an unspread Low
If a melodic pattern was imposed, the pitch contour should match the melodic template, not the lexical tones

When native speakers record these forms on the Amina platform, acoustic measurements are compared against thesis predictions. A thesis passes if the physical measurements confirm the rule; it fails if they contradict it. Systematic failures trigger cartridge review.

This creates a self-correcting loop: the engine generates predictions, speakers provide physical evidence, and discrepancies refine the cartridge parameters.

4. Results: Bemba Proof of Concept

The Bemba tonal pipeline currently produces:

23,000+ tonal verb pairs (scaling to ~500,000 with production root inventory)
Each record includes mora-level H/L tone assignments with provenance (which rule assigned each tone)
F₀ onset perturbation parameters per syllable
Embedded testable theses for acoustic validation

Independent Audit (MRS)

132 stress-test records were selected to exercise every boundary case in the tonal pipeline. Gemini 2.5 Flash independently reviewed all 132 records and found zero tonal logic errors:

48/48

Meeussen's Rule

82/82

Binary Spreading

22/22

Nasal Harmony

26/26

Melodic Override

5. Scalability

The pipeline contains zero language-specific code. All linguistic parameters live in cartridge files. To generate tonal data for a new Bantu language, the process is:

Populate the cartridge with the language's morpheme inventory and underlying tones
Configure the tonal parameters (spreading type, Meeussen's domain, melodic patterns)
Run the same pipeline

The BTS registry contains 664 Bantu language varieties. 9 have proof-of-concept cartridges. 1 (Bemba) is at production level with 1,100+ curated verb roots.

6. Conclusion

The tonal frontier is not a data collection problem — it is an engineering problem. Bantu tones obey strict, deterministic rules. That makes them engineerable. By formalizing these rules in parameterized cartridges and embedding testable predictions in every generated record, we create a system that is simultaneously scalable across languages and verifiable by acoustic physics. The 100% accuracy result on Bemba MRS demonstrates that the approach works; the cartridge architecture demonstrates that it can scale.

Citation

@techreport{cintu2026tonalfrontier,
  title     = {The Tonal Frontier: Automating Ground Truth for Bantu Language AI},
  author    = {Cintu, Conti and others},
  year      = {2026},
  institution = {3MegaLabs},
  url       = {https://bantunomics.com/research/tonal-frontier}
}

Share on Twitter/X Share on LinkedIn