MEYNG/nllb-sango-finetuned-600m

First open fine-tuned translation model for Sango — the national language of the Central African Republic (5M+ speakers).

Fine-tuned from facebook/nllb-200-distilled-600M on 500,000+ Sango-French sentence pairs.

What is Sango?

Sango (ISO 639-1: sg, ISO 639-3: sag) is the official national language of the Central African Republic, spoken as a primary lingua franca by over 5 million people. It is a Bantu-based creole derived from the Ngbandi language, using a Latin orthography with tonal diacritics (â, ê, ë, î, ö, ô, û, ü).

Sango is not supported by Google Translate, DeepL, or Microsoft Translator.

This model is the first open neural machine translation system for Sango.

Model description

Property Value
Base model facebook/nllb-200-distilled-600M
Training data 500,000+ Sango-French sentence pairs
Data source NLLB parallel corpus (quality-filtered)
Fine-tune type Full fine-tune (all parameters, not LoRA)
Epochs 3
Direction Bidirectional: fra_Latn ↔ sag_Latn
Precision bfloat16 (A100 native)
Batch size 12 (effective batch 132 with grad accum 11)
Hardware A100-SXM4-40GB (42.4 GB VRAM)
Training time 7.25 hours

Benchmarks

Note on BLEU for Sango: BLEU penalises fluent paraphrases. chrF is more informative for Sango's morphological structure. Both are reported for completeness.

BENCH-001 — In-distribution (NLLB val.jsonl, N=200)

Evaluated on a held-out sample of the quality-filtered NLLB Sango-French corpus (N=200 sentences, LASER score ≥ 1.0, target language ID confidence ≥ 0.9). Same distribution as training data — measures specialised quality on the task the model was trained for.

French → Sango

Model BLEU chrF
Vanilla NLLB-200-distilled-600M (baseline) 13.79 32.85
MEYNG/nllb-sango-finetuned-600m (ours) 19.49 36.00
Foundation LLM (cloud inference, reference) 2.92 26.45
Delta (ours vs baseline) +5.70 +3.15

Sango → French

Model BLEU chrF
Vanilla NLLB-200-distilled-600M (baseline) 9.53 30.88
MEYNG/nllb-sango-finetuned-600m (ours) 18.63 35.78
Delta (ours vs baseline) +9.10 +4.90

Full results: scripts/training/results/bench001_threeway.json


BENCH-002 — Out-of-domain (FLORES-200 devtest, N=200) — Added 2026-05-22

Evaluated on FLORES-200 devtest (Wikipedia-style sentences, bri25yu Parquet mirror). Out-of-distribution relative to NLLB training corpus — measures generalisation to formal/encyclopaedic text.

French → Sango

Model BLEU chrF
Vanilla NLLB-200-distilled-600M (baseline) 6.11 30.23
MEYNG/nllb-sango-finetuned-600m (ours) 3.39 22.96
Delta (ours vs baseline) −2.72 −7.27

Sango → French

Model BLEU chrF
Vanilla NLLB-200-distilled-600M (baseline) 6.62 26.57
MEYNG/nllb-sango-finetuned-600m (ours) 6.68 25.30
Delta (ours vs baseline) +0.06 −1.27

Finding: The model overfit to the NLLB corpus style. On Wikipedia-style formal text it regresses below the baseline, including hallucination on long complex sentences. Use this model for text stylistically similar to its training domain (news / conversational parallel text). For general-purpose Sango translation, the vanilla NLLB-200 baseline is safer on out-of-domain inputs.

This finding drives the next priority: a diverse multi-domain Sango corpus for v2 training.

Full results: scripts/training/results/bench002_flores200.json

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_id = "MEYNG/nllb-sango-finetuned-600m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.float16)

def translate(text, src_lang, tgt_lang, max_length=128):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
    with torch.no_grad():
        generated = model.generate(
            **inputs,
            forced_bos_token_id=forced_bos_token_id,
            max_length=max_length,
            num_beams=4,
        )
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

# French → Sango
print(translate("Bonjour, comment allez-vous?", "fra_Latn", "sag_Latn"))

# Sango → French
print(translate("Singila mingi, balao.", "sag_Latn", "fra_Latn"))

Training data

Training data was sourced from the NLLB-200 parallel corpus (publicly available at storage.googleapis.com/allennlp-data-bucket/nllb/) and quality-filtered with two thresholds:

  • LASER score ≥ 1.0 (semantic alignment quality)
  • Target language identification confidence ≥ 0.9 (ensures the target is actually Sango)

Kept pairs: ~500,000+ (out of ~600,000 raw pairs in the NLLB Sango-French file).

Training was bidirectional: each sentence pair was included twice (FR→SAG and SAG→FR), giving ~1.1M training examples.

Training configuration

Seq2SeqTrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=12,
    gradient_accumulation_steps=11,    # effective batch = 132
    learning_rate=5e-5,
    warmup_steps=500,
    bf16=True,                         # A100 native bfloat16
    predict_with_generate=False,       # bidirectional val set
    metric_for_best_model="eval_loss",
    hub_strategy="checkpoint",         # push to HF every 500 steps
)

Full training script: scripts/training/train_on_azure.py

Known limitations

  • Domain specialisation (primary limitation): The model was trained exclusively on NLLB corpus text (web-crawled news and conversational parallel sentences). BENCH-002 (FLORES-200 devtest, 2026-05-22) shows it regresses below the vanilla baseline on Wikipedia-style formal text (FR→SAG: −2.72 BLEU). It also exhibits hallucination and repetition on long, complex out-of-domain sentences. Best suited for: news, conversational, and general informal French-Sango translation. Not recommended for: formal, legal, medical, or encyclopaedic text.
  • Coverage: The NLLB Sango-French corpus is the only source of training data. Translation quality reflects the distribution of topics in that corpus.
  • Sango orthography: Sango has some inconsistency in diacritic use across documents. The model inherits this from the training data.
  • Low-resource baseline: Sango is a very low-resource language. Even after fine-tuning, expect lower absolute BLEU than high-resource pairs. chrF is a more reliable quality indicator for Sango.
  • SAG→EN: This model was not trained on Sango-English pairs. For Sango→English, use a two-step pipeline: sag_Latn → fra_Latn → eng_Latn.
  • Evaluation: BENCH-001 uses a held-out split of the training corpus (in-distribution). BENCH-002 uses FLORES-200 devtest (out-of-domain). See the Benchmarks section for both.

Related resources

Resource Link
Sango vocabulary dataset MEYNG/sango-vocabulary
SangoAI platform sangoai.sbs
Training code meyng-hub/sangoai scripts/training/

Citation

@misc{wenezoui2026nllb-sango,
  author = {WENEZOUI, Michel},
  title = {MEYNG/nllb-sango-finetuned-600m: First open fine-tuned translation model for Sango},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/MEYNG/nllb-sango-finetuned-600m}
}

License

Apache 2.0. Free for commercial and research use.

About MEYNG

MEYNG builds AI infrastructure for African languages. This model is part of the first systematic effort to build open NLP resources for Sango, including:

  • The only structured Sango vocabulary dataset (769+ verified entries, CC-BY-SA)
  • An open-source Sango NLP toolkit (@meyng/sango-nlp on npm)
  • A deployed translation and learning platform at sangoai.sbs
  • A WhatsApp-based Sango learning bot (eNdara) active in CAR and the diaspora
Downloads last month
1,002
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MEYNG/nllb-sango-finetuned-600m

Finetuned
(290)
this model

Dataset used to train MEYNG/nllb-sango-finetuned-600m

Article mentioning MEYNG/nllb-sango-finetuned-600m