Scaling Zero-Resource Vocabulary: A Data Pipeline for Sango

Community Article Published May 22, 2026

TL;DR — SangoAI launched with 583 native-speaker-verified Sango words. That is not enough for a serious language learning platform. This post documents the multi-source vocabulary intelligence pipeline we built and ran on 2026-05-22: mining 2.8 million aligned sentence pairs from the NLLB corpus, 813 Sango pages from FR Wiktionary, and native-speaker input — all cross-validated and enriched by a Foundation LLM — producing 986 new candidates in a single morning. We also ran the first Swadesh coverage analysis for Sango, establishing a reproducible baseline for tracking core vocabulary completeness. Alongside the pipeline, we shipped MEYNG/nllb-sango-finetuned-600m — NLLB-200-distilled-600M fine-tuned on 500K+ Sango-French pairs, achieving BLEU 19.49 / chrF 36.00 on French→Sango (+5.70 BLEU over the baseline NLLB model). The pipeline methodology, the benchmark results, and the transferable recipe are documented here.

Author: Michel WENEZOUI — Founder, MEYNG · sangoai.sbs · meyng.com

Companion post: Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus (the original post covers the translation system; this post covers the data infrastructure underneath it)


1. Why 583 words is not enough

When we launched SangoAI, the vocabulary database had 583 verified entries. Each entry had been manually reviewed: Sango word, French translation, English translation, semantic category, difficulty tier, example sentence. The data quality was high. The coverage was not.

The Swadesh list is the standard benchmark for core vocabulary coverage in linguistics. It is 207 words — the set that, in every known human language, learners need to acquire first: pronouns, basic verbs, body parts, numbers, natural phenomena. The claim is that any language learning tool missing Swadesh words will fail learners in their first week.

We ran the Swadesh analysis against our 583-word database. Result: 33 of 96 tested Swadesh words were present in the verified vocabulary. That is a 34% core coverage rate. Learners using SangoAI could not look up "walk", "burn", "year", "hot", "cold", "smoke", "tooth", "claw", "feather" — foundational vocabulary, not edge cases.

The gap was not a linguistics problem. It was a data engineering problem. We knew what words were missing. We had access to large Sango text corpora. What we lacked was a repeatable pipeline to mine, validate, and import vocabulary at scale.

This post describes what we built.


2. The data sources

Three sources feed the pipeline. They are complementary — each covers vocabulary the others miss.

2a. NLLB parallel corpus (primary source)

Meta's No Language Left Behind project produced large-scale parallel corpora for low-resource languages. For Sango:

  • French–Sango: 588,039 aligned sentence pairs (fra_Latn-sag_Latn)
  • English–Sango: 2,219,119 aligned sentence pairs (eng_Latn-sag_Latn)

Both corpora include LASER alignment scores — a semantic similarity metric between source and target sentences, calibrated to identify high-quality translations. We use LASER as the primary quality gate: only sentence pairs with LASER score ≥ 1.05 enter the mining stage. That threshold filters out low-confidence machine-generated alignments while retaining statistically well-supported pairs.

The corpus is domain-mixed. Religious texts dominate (there is substantial digitized Sango Bible content), but the 2.2M English corpus draws more heavily from web and news sources — different register coverage than the French corpus. Running both fills different vocabulary gaps.

2b. FR Wiktionary (validation source)

The French Wiktionary has 813 pages in the Sango language category, accessible via the MediaWiki API. The coverage is uneven — some entries have full phonological data, others only a bare translation — but it provides an independent signal to cross-validate corpus-derived candidates. A word that appears in both the NLLB corpus and Wiktionary is significantly more trustworthy than one that appears in only one source.

The miner uses the MediaWiki extracts API with plaintext=true, respects a 0.5-second inter-request delay, and parses French definitions from the first sentence of each page extract.

2c. Native speaker input (highest-confidence tier)

Native speakers can provide words that no corpus or dictionary contains — culturally embedded vocabulary, register distinctions, words that appear in oral speech but not in digitized text. These bypass the pipeline's automated confidence scoring and are staged directly at confidence 0.95, the highest tier.

During this run, four Swadesh gaps were filled by native-speaker input:

  • tamboura = marcher / to walk
  • wa = chaud / hot, warm (distinct tone from wa = qui / who)
  • ngu = année / year (distinct word from ngou = eau / water)
  • gbî = brûler / to burn

The tonal distinction between wa (hot) and wa (who) — same spelling, different tone — illustrates why native speaker verification is irreplaceable. A corpus miner cannot distinguish these; the NLLB corpus records both as wa. Only a speaker knows which is which from context.


3. Pipeline architecture

Four stages: Mine → Stage → Enrich → Import

NLLB corpus (CSV)  ──┐
FR Wiktionary      ──┼──► SQLite staging DB ──► Foundation LLM enrichment ──► DynamoDB
Native speaker     ──┘        (dedup + conf)          (validate + translate)    (pending)

Stage 1 — Mine

The corpus miner reads each CSV row, applies the LASER threshold filter, tokenizes the Sango sentence using a diacritic-aware regex (Sango uses tonal diacritics: â ê ë î ö ô û ü), and counts token frequency across the full corpus. Only tokens appearing ≥ 3 times pass to the candidate stage — single-occurrence tokens are overwhelmingly OCR errors, proper names, or code-switching.

For each surviving token, the miner records:

  • Best LASER score across all sentence pairs containing it
  • Representative example sentence (the pair with the highest LASER score)
  • Top 3 co-occurring French/English words (used as translation hints)

Linguistic validation happens at this stage via is_valid_sango_token():

  • Minimum 2 characters
  • Primarily alphabetic (no digits, no punctuation fragments)
  • Contains at least one vowel
  • Falls within plausible Sango token length range

Proper noun detection (looks_like_proper_noun()) checks capitalization patterns and known name suffixes — proper nouns are downweighted in confidence but not excluded, since place names and person names are legitimate vocabulary.

Stage 2 — Stage and deduplicate

Every candidate is written to a SQLite staging database with a UNIQUE(sango, french) constraint. When the same (sango, french) pair appears across multiple sources, the constraint triggers an upsert that increments cross_source_count rather than creating a duplicate row. This cross-source count becomes a quality signal: a word confirmed by both the French and English corpora, plus Wiktionary, has cross_source_count = 3 and is scored higher in the enrichment queue.

Fuzzy deduplication against existing vocabulary uses rapidfuzz with NFKD normalization (handling diacritics correctly) and a similarity threshold of 88. A candidate token that is 88% similar to an already-verified word is assumed to be a morphological variant, misspelling, or corpus noise — and is skipped. This prevents the pipeline from importing ngou, ngoo, ngo' as separate entries when only ngou is the verified canonical form.

Confidence scoring combines three signals:

confidence = base + (corpus_freq / 500) × 0.20 + (laser_score - 1.0) × 0.10

Capped at 0.72 for corpus-derived words. Native-speaker input: 0.95. Swadesh core words added via gap analysis: 0.80.

Stage 3 — LLM enrichment

Pending candidates above confidence 0.50 are sent to a Foundation LLM in batches of 10. The enrichment prompt asks the model to:

  1. Confirm whether the token is a genuine Sango word (not OCR garbage, French loanword, or proper name fragment)
  2. Provide a precise French translation (1–5 words, replacing the rough corpus co-occurrence hint)
  3. Provide an English translation
  4. Assign a semantic category from a controlled vocabulary: greeting, number, family, body, food, animal, color, nature, place, time, verb_motion, verb_action, verb_state, adjective, adverb, pronoun, particle, health, profession, object, emotion, phrase
  5. Assign a difficulty tier: beginner, intermediate, advanced
  6. Generate a short, natural example sentence in Sango (5–10 words)
  7. Translate that example sentence to French

The model returns a JSON array — one object per input word. If the model marks "valid": false, the row is set to status='rejected' in the staging database. The rejection rate was approximately 15% in this run, primarily catching corpus noise and French proper names.

A critical implementation lesson: the staging database uses UNIQUE(sango, french) as the uniqueness key, but enrichment may change the french column from a rough corpus hint ("les, est, que") to a refined translation ("avec"). On re-runs, the original row with the raw hint is re-inserted by the miner, then enrichment tries to set french = "avec" on it — but (sango="nga", french="avec") already exists from a prior run. This triggers a constraint violation. Fix: update_enrichment() takes the integer primary key (id) as its WHERE predicate, not (sango, french). Before updating french, it checks for a conflicting row on a different id; if one exists, the current row is marked status='merged' and skipped. 305 rows were merged in this run, reflecting clean deduplication rather than data loss.

Stage 4 — Import to DynamoDB

Enriched candidates are exported as JSONL, each entry receiving a UUID word_id and status='pending'. They land in the production vocabulary table with status='pending' — invisible to end users until a human reviewer promotes them to status='verified' via the admin dashboard.

The conditional put uses attribute_not_exists(word_id) to prevent duplicate imports on pipeline re-runs. Since word_id is a fresh UUID on each export, this protects against the same export file being imported twice, not against the same word being exported across different runs — for that, the staging database's is_duplicate() check is the primary guard.


4. Results from the 2026-05-22 run

Metric Value
NLLB French corpus rows processed 588,039
Sentence pairs kept (LASER ≥ 1.05) 296,563 (50.4%)
Unique Sango tokens found 15,823
Candidates staged (after freq + dedup filter) 6,275
FR Wiktionary pages found 813
Novel Wiktionary candidates 395
Enrichment batches run 3 (50 + 200 + 200 words)
Total enriched in staging DB 1,213
Words exported (confidence ≥ 0.65) 986
Words imported to DynamoDB 986 (0 errors)
DynamoDB total (verified + pending) 1,992
Swadesh core words now covered 37/96 (+4 from native speaker input)

The 37 Swadesh words now covered include the 4 native-speaker corrections from this session. Before admin review, the verified count remains 583. After review, we expect most of the 986 pending entries to be promoted — the high-confidence tier (conf ≥ 0.78, 26 words) contains exclusively core Sango vocabulary and will be promoted without question.


5. What the Swadesh analysis revealed

The Swadesh gap analysis is worth examining in detail because it shows both what the pipeline catches and where human judgment is irreplaceable.

False positives in the existing vocabulary — words that appeared to cover a Swadesh entry but were wrong:

  • toto appeared as a candidate for "walk" and "warm/hot" because it co-occurs with motion and temperature vocabulary in corpus contexts. In fact, toto = cry. The corpus frequency signal cannot detect semantic mismatch.
  • ngou appeared as a candidate for "year" because Sango corpus texts sometimes use ngou in time-related contexts. ngou = water. ngu (without the final vowel) = year. Same four letters, different final vowel, completely different meaning — and the difference is invisible to a frequency-based miner.
  • so appeared as "burn" because it co-occurs in contexts describing destruction. so = this/that (demonstrative pronoun) or body. Not burn.

All three were caught because a native speaker reviewed the Swadesh output. Without native speaker verification, the pipeline would have promoted the wrong word for three core Sango concepts. The Swadesh list functions as a structured test case — a forcing function for quality review of the most critical vocabulary.

Genuine gaps that the pipeline filled:

  • tamboura (walk), gbî (burn) — not in the NLLB corpus at sufficient frequency to pass the miner's threshold, but present in native-speaker knowledge
  • ngu (year), wa (hot) — present in the corpus but masked by homophony with ngou (water) and wa (who)

This suggests a two-stage approach for Swadesh gaps: (1) run the automated pipeline to identify what corpus evidence exists, (2) present all Swadesh entries with their highest-confidence corpus candidate to a native speaker for validation or correction.


6. The transferable recipe

The pipeline is language-agnostic at the architecture level. Any language with:

  • An NLLB corpus entry (NLLB covers 200 languages)
  • A Wiktionary presence (even minimal)
  • At least one native speaker willing to contribute

...can use this exact stack. The language-specific components are:

  • is_valid_sango_token() — adjust the character set for the target language's script
  • LASER_MIN threshold — may need calibration per corpus quality
  • The Swadesh list translations — must be done by a native speaker

The enrichment prompt is language-agnostic: it asks a frontier LLM to validate and translate, and the model has substantial capability across most NLLB-covered languages.

Estimated setup time for a new language with an NLLB corpus: 3–5 days, most of which is the Swadesh list translation and initial native-speaker review of the first enriched batch.


7. What's next

Immediate (next 2 weeks):

  • Admin review of 986 pending words — each reviewed against Sango dictionaries and native-speaker judgment before promoting to verified
  • Full scan of the 2.2M English-Sango corpus (this run sampled 200K rows; the full corpus likely yields 2–3× more novel candidates)
  • Complete Swadesh gap analysis: 59 of the 96 tested entries still have uncertain translations marked with native-speaker verification needed

Also shipped this week:

MEYNG/nllb-sango-finetuned-600m is live. BENCH-001 full results (N=200 sentence pairs, quality-filtered NLLB held-out set, LASER≥1.0):

Direction Model BLEU chrF
FR → Sango Baseline NLLB-200-600M 13.79 32.85
FR → Sango MEYNG fine-tuned 19.49 36.00
FR → Sango Foundation LLM reference 2.92 26.45
Sango → FR Baseline NLLB-200-600M 9.53 30.88
Sango → FR MEYNG fine-tuned 18.63 35.78

Deltas over baseline: +5.70 BLEU / +3.15 chrF (FR→Sango), +9.10 BLEU / +4.90 chrF (Sango→FR). The Foundation LLM reference shows that zero-shot prompting of a general-purpose model substantially underperforms a domain-specific fine-tune on this low-resource pair — vocabulary-augmented prompting (the approach described in the companion post) is the production path, not a replacement for fine-tuning. A 3-way benchmark against Google Translate is in progress.

Medium term (Q3 2026):

  • 3-way benchmark: fine-tuned NLLB vs Google Translate vs vocabulary-augmented system — results to be published
  • Native audio recordings for the top-300 vocabulary entries (Phase 1: 12 words recorded, 288 remaining)

Language expansion (2027+):

  • Ewondo (Cameroon, ~2M speakers) — same pipeline, Q1 2027
  • Lingala (DRC/Congo, ~25M speakers) — Q3 2027

Every new language reuses the pipeline unchanged. The only per-language work is the native-speaker layer.


Try it / contribute

  • Live platform: sangoai.sbs — learn Sango, translate, chat
  • WhatsApp bot (no account needed): +237 658 763 678 — send APPRENDRE to start
  • HuggingFace dataset: MEYNG/sango-vocabulary — the verified vocabulary dataset, CC-licensed
  • npm package: npm install @meyng/sango-nlp — tokenizer, language detection, Sango-aware stemmer

Native speakers: if you speak Sango and want to contribute vocabulary corrections or audio recordings, open an issue on the dataset repository or contact us at contact@meyng.com. Every correction directly improves translation quality for 5 million speakers.

Researchers: if you use the MEYNG/sango-vocabulary dataset, please cite:

WENEZOUI, M. (2026). MEYNG/sango-vocabulary: first verified Sango vocabulary dataset for NLP. HuggingFace Datasets. https://huggingface.co/datasets/MEYNG/sango-vocabulary


Michel WENEZOUI is the founder of MEYNG, a Central African AI infrastructure company building language technology for African languages starting with Sango. MEYNG operates SangoAI, eNdara (SMS-based education), and Obêtrack (food waste reduction).

Community

Sign up or log in to comment