Structured Training Data

Data Products

Every record is morphologically decomposed, tonally annotated, and rule-traced. This is not web-scraped text — it is engineered linguistic infrastructure generated by 39 generators across the BTS engine, tonal pipeline, Amina speaker platform, and SENTGEN system.

Available Now

What data exists today

Phonological Inventories — 664 Languages

T0

The complete phonological scaffold for every registered Bantu language — something no current LLM can produce without errors. Ask any frontier model to list the complete vowel inventory, consonant digraphs, or valid syllable structures for Bemba, Tonga, or Lozi, and it will hallucinate. We have the verified ground truth.

664
Languages
8
Files / language
5,300+
Total files
140+
Syllables / lang
Includes: vowels (short + long), consonants (single + digraphs), syllabary (core + complex), tone system, morphophonemic rules, phonological constraints, exceptions

Morphological Training Data — 250M+ Records

T1–T2

Full 9-slot verb decomposition with morpheme labels, slot positions, phonological rules applied, constraint checks, and provenance tracking. Every record carries 8 AI training task types. 151 morphological constraints verified per form for Bemba.

127M+
Bemba (prod)
~1.2M
9 POC languages
1,100+
Bemba roots
8
AI task types
9
POC languages
POC languages: Bemba, Tonga, Nyanja, Lozi, Zulu, Swahili, Shona, Kinyarwanda, Lingala (350M+ speakers combined)

Specialized Generator Datasets (Bemba Production)

T2

Beyond verb morphology, the production Bemba cartridge has 26 active generators covering the full grammar:

E1/E2 — Verb conjugation~127M records
ADJ — Adjective agreement3,400 100% verified
REL — Relative clauses~309K 75%
NUMBERS — Numeral system~184K 85%
EXIST — Existential constructions~115K 80%
COP — Copular system~50K 70%
Q — Question formation~23K 50%
POSS — Possessive concord~20K
UMUBILI — Body parts domain~12K 43%
SYSLEARN — Meta-linguistic knowledge160 rules

D-Series — Structured Training Data Generators

T2

15 generators that transform raw morphological data into structured pedagogical training sequences for AI models:

D01 Ladder — progressive difficulty (2→9 slots)
D02 Cascade — noun class transformations (~776K)
D03 Minimal Pair — single-difference pairs
D04 Derivation Chain — extension sequences
D05 Semantic Validation — valid/invalid args
D06 Ambiguity — multiple-parse forms
D07 Cross-Module — multi-word compositions
D09 QA Pair — question-answer pairs
D10 Error Correction — error/fix pairs
D11 Enumeration — systematic list generation
D13 Parallel Translation — cross-language
D15 Context — contextual variation sets
D08 (Register), D12 (Discourse), D14 (Idiom) — blocked pending cultural sensitivity review

Tonal Ground Truth — 23,000+ Pairs

T2

Declarative/interrogative acoustic minimal pairs with mora-level H/L tone assignments, F₀ onset perturbation parameters, and embedded testable theses. MRS-certified: 132 stress-test records with 100% tonal accuracy (Gemini 2.5 Flash audit).

23,000+
Bemba pairs
~500K
Target (full roots)
8
Syntactic frames
5
Pipeline steps

SENTGEN — Sentence-Level Training Data

T2

Language-agnostic sentence generator producing bilingual training data with configurable code-switching. 84 templates across 7 categories, 17 lexicon pools (1,059 items), 11 constraint rules. Outputs in 6 AI training formats. Configurable code-switching: 40% Bemba native / 60% English loan on occupation terms.

84
Templates
7
Categories
6
Output formats
~11.8B
Possible combinations
Categories: Introduction, Family, Education, Narrative, Greeting, Geopolitical, Negation. Formats: ASR, TTS, MT, LLM fine-tuning, Preference/DPO, Grammar correction

Amina — Speaker-Validated Audio

T1–T2

48kHz mono recordings from native speakers across 25+ countries. Every recording validated against engine predictions. Speaker calibration profiles built from isolated syllable recordings. Validation Passports certify which linguistic rules each recording confirms.

48kHz
Audio quality
25+
Countries
21
Languages onboarded
7
Acoustic dimensions
Coming to the Platform

What data is coming

PLANNED

Code-Switching Recordings

Naturalistic multi-language recordings capturing switching between Bantu languages and English/French/Portuguese. SENTGEN already supports configurable code-switching ratios (currently 40/60 Bemba/English) — expanding to full naturalistic speech.

PLANNED

Conversations & Stories

Multi-turn dialogues, folktales, personal narratives, and how-to instructions with tonal annotation. SENTGEN templates expanding to conversational and narrative categories.

PLANNED

Register & Formality Variants

D08 (Register Generator) and D14 (Idiom Generator) — formal, informal, colloquial, and idiomatic variants. Currently blocked pending cultural sensitivity review.

PLANNED

Discourse Sequences

D12 (Discourse Generator) — multi-sentence sequences with topic continuity, pronoun resolution, and tonal discourse markers. Cultural review required before activation.

SCALING

POC → Production for 4+ Languages

Tonga, Nyanja, Lozi, and Zulu are candidates for T1→T2 promotion. Requires human linguist review, extended root inventories, and NS verification passes.

SCALING

HFST Transducers for All T1+ Languages

Compiling the first all-Bantu-language HFST system. Offline O(n) morphological analysis and generation. Enterprise clients get compiled binaries for local deployment.

Record Schema

Every record carries full morpheme decomposition, tonal analysis, rule provenance, and 8 AI training task types.

// Example morphological record with tonal data (simplified)
{
  "surface_form": "baleebomba",
  "language": "bem",
  "bts_version": "3.2.1",
  "slots": {
    "subject": {"morph": "ba", "gloss": "3PL", "tone": "H"},
    "tense": {"morph": "ale", "gloss": "PROG", "tone": "L"},
    "object": {"morph": "e", "gloss": "REFL", "tone": "L"},
    "root": {"morph": "bomb", "gloss": "work", "tone": "L"},
    "final_vowel": {"morph": "a", "gloss": "IND", "tone": "L"}
  },
  "tonal_analysis": {
    "surface_tones": ["H", "H", "L", "L", "L"],
    "rules_applied": ["lexical", "binary_spreading"],
    "meeussen_triggered": false
  },
  "constraints_checked": 151,
  "constraints_passed": 151,
  "task_types": ["decomposition", "composition", "translation",
                 "slot_predict", "harmony_predict", "error_detect",
                 "error_correct", "cross_lang_key"],
  "english": "they are working (for themselves)"
}

8 AI Training Task Types Per Record

Every morphological record is usable for 8 distinct training objectives simultaneously.

Decomposition

Parse surface form into morpheme slots

Composition

Combine slots into correct surface form

Translation

Translate to/from English with slot alignment

Slot Prediction

Predict missing morpheme slot

Harmony Prediction

Predict vowel harmony output

Error Detection

Identify grammatical violations

Error Correction

Fix malformed words with rule citation

Cross-Language Key

Canonical form for cross-language comparison

Download Sample Data

Free JSONL samples to evaluate data quality, schema structure, and tonal annotations. No account required.

Output Formats

SENTGEN exports to 6 AI training formats:

ASRSpeech recognition training
TTSText-to-speech training
MTMachine translation pairs
LLMLanguage model fine-tuning
DPOPreference / RLHF training
GrammarGrammatical annotation & correction

Need custom data, specific language coverage, or enterprise access?