twi-en-nllb-v2 — Twi → English Translation

A fine-tuned version of facebook/nllb-200-distilled-600M for translating Twi (Akan, twi_Latn) into English (eng_Latn). This model improves substantially over the v1 MarianMT baseline and is designed for low-resource Ghanaian language translation.

Model Details

Base model facebook/nllb-200-distilled-600M
Task Sequence-to-sequence translation
Direction twi_Latneng_Latn
Parameters 615.1M
License CC BY-NC 4.0

Training Details

Hyperparameter Value
Epochs 10
Batch size 4 (effective: 32 with gradient accumulation)
Gradient accumulation steps 8
Learning rate 3e-5
LR scheduler Cosine
Warmup ratio 0.06
Weight decay 0.01
Label smoothing 0.1
Max sequence length 128
Precision fp16
Best model metric chrF

Dataset

Training data sourced from the GhanaNLP TWI-ENGLISH Parallel Text dataset.

Split Pairs
Train 7,732
Validation 967
Test 967
Total 9,666

Results

Evaluated on 967 held-out test pairs:

Metric v2 (this model) v1 MarianMT baseline Δ
BLEU 27.30 10.50 +16.80
chrF 51.03 32.30 +18.73
TER ↓ 61.48 87.64 −26.16

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "ninte/twi-en-nllb-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()

def translate(text, src_lang="twi_Latn", tgt_lang="eng_Latn"):
    tokenizer.src_lang = src_lang
    tgt_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            forced_bos_token_id=tgt_token_id,
            num_beams=4,
            max_length=128,
            no_repeat_ngram_size=3,
            length_penalty=1.0,
            early_stopping=True,
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(translate("Ɛsɛ sɛ mmofra di aduan pa na ama wɔanyin yiye."))
# → "Children need to eat healthy foods to grow well."

Translation Examples

Twi Prediction Reference
Ɛsɛ sɛ mmofra di aduan pa na ama wɔanyin yiye. Children need to eat healthy foods to grow well. Children need to feed well in order to grow well.
Mpanimfoɔ a wɔda mansini no ano... The elders at the border of the district have teamed up with the loggers... The district officials connive with loggers to make money out of trees.

Limitations

  • Trained on ~9,600 sentence pairs; performance on out-of-domain text may vary.
  • Max sequence length of 128 tokens — longer inputs are truncated.
  • Best suited for standard/formal Twi; dialectal variation is not explicitly handled.

Citation

If you use this model, please cite the base NLLB model and the training dataset:

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team et al.},
  journal={arXiv preprint arXiv:2207.04672},
  year={2022}
}

@dataset{ghananlp_twi_english_2023,
  author    = {GhanaNLP},
  title     = {GhanaNLP TWI-ENGLISH Parallel Text},
  year      = {2023},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT}
}

Limitations

  • Trained on ~9,600 sentence pairs; performance on out-of-domain text may vary.
  • Max sequence length of 128 tokens — longer inputs are truncated.
  • Best suited for standard/formal Twi; dialectal variation is not explicitly handled.
Downloads last month
42
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ninte/twi-en-nllb-v2

Finetuned
(274)
this model

Dataset used to train ninte/twi-en-nllb-v2

Paper for ninte/twi-en-nllb-v2