MICWEN PRO

cesear64

AI & ML interests

None yet

Recent Activity

repliedto their post about 1 hour ago

Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute. The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included. 📄 Blog: https://huggingface.co/blog/MEYNG/sangoai 📦 Dataset: https://huggingface.co/datasets/MEYNG/sango-vocabulary Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

published an article about 2 hours ago

Scaling Zero-Resource Vocabulary: A Data Pipeline for Sango

updated a model 1 day ago

MEYNG/nllb-sango-finetuned-600m

View all activity

Organizations

replied to their post about 1 hour ago

Exactly right on the bootstrapping loop — that's precisely the progression we're running.

Small precision on the mechanism: the model has seen some Sango during pretraining (it appears in Common Crawl), but not enough to produce coherent translations cold. The vocabulary injection doesn't teach the language from scratch — it gives the model enough anchoring signal to activate what it weakly learned. The grammar rules and orthography notes handle the parts pretraining didn't cover reliably (tonal distinctions, diacritics, Sango-specific syntax).

And yes, the loop you're describing is live: the vocabulary-augmented outputs → native-speaker verification → parallel corpus → fine-tuned NMT model. We just published BENCH-001 results on the fine-tune: +5.70 BLEU over baseline on French→Sango, +9.10 on Sango→French. The vocabulary-augmented prompting approach (BLEU 2.92 on the same task, zero fine-tuning) is the floor; the fine-tune is what you get once the dataset is big enough.

The data pipeline post documenting that second step just went up here: https://huggingface.co/blog/MEYNG/sango-vocabulary-pipeline

The interesting open question is where the ceiling is for a 600M-parameter model on a language with ~5M speakers and sparse digitized text. We're nowhere near it yet.

posted an update 8 days ago

Post

4097

Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute.

The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included.

📄 Blog: https://huggingface.co/blog/MEYNG/sangoai
📦 Dataset: MEYNG/sango-vocabulary

Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

2 replies

MICWEN PRO

AI & ML interests

Recent Activity

Organizations

cesear64's activity