twi-en-nllb-v2 — Twi → English Translation

A fine-tuned version of facebook/nllb-200-distilled-600M for translating Twi (Akan, twi_Latn) into English (eng_Latn). This model improves substantially over the v1 MarianMT baseline and is designed for low-resource Ghanaian language translation.

Model Details


Base model	`facebook/nllb-200-distilled-600M`
Task	Sequence-to-sequence translation
Direction	`twi_Latn` → `eng_Latn`
Parameters	615.1M
License	CC BY-NC 4.0

Training Details

Hyperparameter	Value
Epochs	10
Batch size	4 (effective: 32 with gradient accumulation)
Gradient accumulation steps	8
Learning rate	3e-5
LR scheduler	Cosine
Warmup ratio	0.06
Weight decay	0.01
Label smoothing	0.1
Max sequence length	128
Precision	fp16
Best model metric	chrF

Dataset

Training data sourced from the GhanaNLP TWI-ENGLISH Parallel Text dataset.

Split	Pairs
Train	7,732
Validation	967
Test	967
Total	9,666

Results

Evaluated on 967 held-out test pairs:

Metric	v2 (this model)	v1 MarianMT baseline	Δ
BLEU	27.30	10.50	+16.80
chrF	51.03	32.30	+18.73
TER ↓	61.48	87.64	−26.16

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "ninte/twi-en-nllb-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()

def translate(text, src_lang="twi_Latn", tgt_lang="eng_Latn"):
    tokenizer.src_lang = src_lang
    tgt_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            forced_bos_token_id=tgt_token_id,
            num_beams=4,
            max_length=128,
            no_repeat_ngram_size=3,
            length_penalty=1.0,
            early_stopping=True,
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(translate("Ɛsɛ sɛ mmofra di aduan pa na ama wɔanyin yiye."))
# → "Children need to eat healthy foods to grow well."

Translation Examples

Twi	Prediction	Reference
Ɛsɛ sɛ mmofra di aduan pa na ama wɔanyin yiye.	Children need to eat healthy foods to grow well.	Children need to feed well in order to grow well.
Mpanimfoɔ a wɔda mansini no ano...	The elders at the border of the district have teamed up with the loggers...	The district officials connive with loggers to make money out of trees.

Limitations

Trained on ~9,600 sentence pairs; performance on out-of-domain text may vary.
Max sequence length of 128 tokens — longer inputs are truncated.
Best suited for standard/formal Twi; dialectal variation is not explicitly handled.

Citation

If you use this model, please cite the base NLLB model and the training dataset:

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team et al.},
  journal={arXiv preprint arXiv:2207.04672},
  year={2022}
}

@dataset{ghananlp_twi_english_2023,
  author    = {GhanaNLP},
  title     = {GhanaNLP TWI-ENGLISH Parallel Text},
  year      = {2023},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT}
}

Limitations

Trained on ~9,600 sentence pairs; performance on out-of-domain text may vary.
Max sequence length of 128 tokens — longer inputs are truncated.
Best suited for standard/formal Twi; dialectal variation is not explicitly handled.

Downloads last month: 42

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for ninte/twi-en-nllb-v2

Base model

facebook/nllb-200-distilled-600M

Finetuned

(274)

this model

Dataset used to train ninte/twi-en-nllb-v2

Paper for ninte/twi-en-nllb-v2

No Language Left Behind: Scaling Human-Centered Machine Translation

Paper • 2207.04672 • Published Jul 11, 2022 • 3