Structured Training Data

Data Products

Every record is morphologically decomposed, tonally annotated, and rule-traced. This is not web-scraped text — it is engineered linguistic infrastructure generated by 39 generators across the BTS engine, tonal pipeline, Amina speaker platform, and SENTGEN system.

Available Now

What data exists today

Phonological Inventories — 664 Languages

The complete phonological scaffold for every registered Bantu language — something no current LLM can produce without errors. Ask any frontier model to list the complete vowel inventory, consonant digraphs, or valid syllable structures for Bemba, Tonga, or Lozi, and it will hallucinate. We have the verified ground truth.

664

Languages

Files / language

5,300+

Total files

140+

Syllables / lang

Includes: vowels (short + long), consonants (single + digraphs), syllabary (core + complex), tone system, morphophonemic rules, phonological constraints, exceptions

Morphological Training Data — 250M+ Records

T1–T2

Full 9-slot verb decomposition with morpheme labels, slot positions, phonological rules applied, constraint checks, and provenance tracking. Every record carries 8 AI training task types. 151 morphological constraints verified per form for Bemba.

127M+

Bemba (prod)

~1.2M

9 POC languages

1,100+

Bemba roots

AI task types

POC languages

POC languages: Bemba, Tonga, Nyanja, Lozi, Zulu, Swahili, Shona, Kinyarwanda, Lingala (350M+ speakers combined)

Specialized Generator Datasets (Bemba Production)

Beyond verb morphology, the production Bemba cartridge has 26 active generators covering the full grammar:

E1/E2 — Verb conjugation~127M records

ADJ — Adjective agreement3,400 100% verified

REL — Relative clauses~309K 75%

NUMBERS — Numeral system~184K 85%

EXIST — Existential constructions~115K 80%

COP — Copular system~50K 70%

Q — Question formation~23K 50%

POSS — Possessive concord~20K

UMUBILI — Body parts domain~12K 43%

SYSLEARN — Meta-linguistic knowledge160 rules

D-Series — Structured Training Data Generators

15 generators that transform raw morphological data into structured pedagogical training sequences for AI models:

D01 Ladder — progressive difficulty (2→9 slots)

D02 Cascade — noun class transformations (~776K)

D03 Minimal Pair — single-difference pairs

D04 Derivation Chain — extension sequences

D05 Semantic Validation — valid/invalid args

D06 Ambiguity — multiple-parse forms

D07 Cross-Module — multi-word compositions

D09 QA Pair — question-answer pairs

D10 Error Correction — error/fix pairs

D11 Enumeration — systematic list generation

D13 Parallel Translation — cross-language

D15 Context — contextual variation sets

D08 (Register), D12 (Discourse), D14 (Idiom) — blocked pending cultural sensitivity review

Tonal Ground Truth — 23,000+ Pairs

Declarative/interrogative acoustic minimal pairs with mora-level H/L tone assignments, F₀ onset perturbation parameters, and embedded testable theses. MRS-certified: 132 stress-test records with 100% tonal accuracy (Gemini 2.5 Flash audit).

23,000+

Bemba pairs

~500K

Target (full roots)

Syntactic frames

Pipeline steps

SENTGEN — Sentence-Level Training Data

Language-agnostic sentence generator producing bilingual training data with configurable code-switching. 84 templates across 7 categories, 17 lexicon pools (1,059 items), 11 constraint rules. Outputs in 6 AI training formats. Configurable code-switching: 40% Bemba native / 60% English loan on occupation terms.

Templates

Amina — Speaker-Validated Audio

T1–T2

48kHz mono recordings from native speakers across 25+ countries. Every recording validated against engine predictions. Speaker calibration profiles built from isolated syllable recordings. Validation Passports certify which linguistic rules each recording confirms.

48kHz

Audio quality

25+

Countries

Languages onboarded

Acoustic dimensions

Coming to the Platform

What data is coming

PLANNED

Code-Switching Recordings

Naturalistic multi-language recordings capturing switching between Bantu languages and English/French/Portuguese. SENTGEN already supports configurable code-switching ratios (currently 40/60 Bemba/English) — expanding to full naturalistic speech.

PLANNED

Conversations & Stories

Multi-turn dialogues, folktales, personal narratives, and how-to instructions with tonal annotation. SENTGEN templates expanding to conversational and narrative categories.

PLANNED

Register & Formality Variants

D08 (Register Generator) and D14 (Idiom Generator) — formal, informal, colloquial, and idiomatic variants. Currently blocked pending cultural sensitivity review.

PLANNED

Discourse Sequences

D12 (Discourse Generator) — multi-sentence sequences with topic continuity, pronoun resolution, and tonal discourse markers. Cultural review required before activation.

SCALING

POC → Production for 4+ Languages

Tonga, Nyanja, Lozi, and Zulu are candidates for T1→T2 promotion. Requires human linguist review, extended root inventories, and NS verification passes.

SCALING

HFST Transducers for All T1+ Languages

Compiling the first all-Bantu-language HFST system. Offline O(n) morphological analysis and generation. Enterprise clients get compiled binaries for local deployment.

Record Schema

Every record carries full morpheme decomposition, tonal analysis, rule provenance, and 8 AI training task types.

// Example morphological record with tonal data (simplified)

{
  "surface_form": "baleebomba",
  "language": "bem",
  "bts_version": "3.2.1",
  "slots": {
    "subject": {"morph": "ba", "gloss": "3PL", "tone": "H"},
    "tense": {"morph": "ale", "gloss": "PROG", "tone": "L"},
    "object": {"morph": "e", "gloss": "REFL", "tone": "L"},
    "root": {"morph": "bomb", "gloss": "work", "tone": "L"},
    "final_vowel": {"morph": "a", "gloss": "IND", "tone": "L"}
  },
  "tonal_analysis": {
    "surface_tones": ["H", "H", "L", "L", "L"],
    "rules_applied": ["lexical", "binary_spreading"],
    "meeussen_triggered": false
  },
  "constraints_checked": 151,
  "constraints_passed": 151,
  "task_types": ["decomposition", "composition", "translation",
                 "slot_predict", "harmony_predict", "error_detect",
                 "error_correct", "cross_lang_key"],
  "english": "they are working (for themselves)"
}

8 AI Training Task Types Per Record

Every morphological record is usable for 8 distinct training objectives simultaneously.

Decomposition

Parse surface form into morpheme slots

Composition

Combine slots into correct surface form

Translation

Translate to/from English with slot alignment

Slot Prediction

Predict missing morpheme slot

Harmony Prediction

Predict vowel harmony output

Error Detection

Identify grammatical violations

Error Correction

Fix malformed words with rule citation

Cross-Language Key

Canonical form for cross-language comparison

Download Sample Data

Free JSONL samples to evaluate data quality, schema structure, and tonal annotations. No account required.

Morphological Sample (.jsonl) Tonal Pairs Sample (.jsonl)

Output Formats

SENTGEN exports to 6 AI training formats:

ASRSpeech recognition training

TTSText-to-speech training

MTMachine translation pairs

LLMLanguage model fine-tuning

DPOPreference / RLHF training

GrammarGrammatical annotation & correction

Need custom data, specific language coverage, or enterprise access?

See pricing plans → Contact sales →