NepTam20k-NLLB-200-600M
Model Overview
NepTam20k-NLLB-200-600M is a fine-tuned version of Meta’s NLLB-200 (600M) model for Nepali ↔ Tamang machine translation.
The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University using the NepTam20K dataset.
Tamang, a low-resource Sino-Tibetan language, lacks representation in current multilingual MT vocabularies.
To overcome this, Hindi was used as a language proxy due to its shared Devanagari script, enabling effective tokenization and cross-lingual transfer.
🧩 Language Proxy Strategy
Because Tamang is absent from multilingual vocabularies, Hindi embeddings were used to represent Tamang text.
This approach exploits orthographic and phonetic similarities within Indic languages, leveraging shared embeddings to improve generalization across unseen languages.
⚙️ Training Details
| Parameter | Setting |
|---|---|
| Base Model | NLLB-200 (600M) |
| Dataset | NepTam20K (15K train / 5K test pairs) |
| Learning Rate | 5e-4 |
| Batch Size | 16 |
| Dropout | 0.3 |
| Epochs | 5 |
| Optimizer | Adam |
| Scheduler | Inverse square root |
| Framework | Hugging Face Transformers |
| Tokenizer | NLLB-200 tokenizer |
| Proxy Language | Hindi (hin_Deva) |
📊 Evaluation Results
| Model | Direction | BLEU | chrF | chrF++ | METEOR | COMET |
|---|---|---|---|---|---|---|
| NepTam_NLLB | ne→tam | 40.92 | 73.98 | 69.94 | 61.44 | 75.89 |
| tam→ne | 45.26 | 73.71 | 70.30 | 62.31 | 80.19 |
Evaluation metrics were computed using SacreBLEU, chrF, chrF++, METEOR, and COMET to ensure a balanced view across both translation directions.
🚀 Inference
!pip install transformers sentencepiece torch protobuf
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch, re
# Load the fine-tuned NLLB model
model_name = "ilprl-docse/NepTam20k-NLLB-200-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")
# Define language mapping
LANG_MAP = {
"Nepali": "npi_Deva",
"Tamang": "hin_Deva" # Hindi used as proxy
}
def translate(text, source_lang="Nepali", target_lang="Tamang"):
tokenizer.src_lang = LANG_MAP[source_lang]
encoded = tokenizer(text, return_tensors="pt", padding=True).to(model.device)
generated_tokens = model.generate(
**encoded,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(LANG_MAP[target_lang]),
max_length=512,
num_beams=5,
early_stopping=True
)
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
translation = re.sub(r'^(hi|ne)\s*[,.]?\s*', '', translation, flags=re.IGNORECASE).strip()
return translation
# Example translation
text = "पुर्खाको योगदानमाथिको ठुलो अवमूल्यन होइन र यो?"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, "Nepali", "Tamang"))
- Downloads last month
- 14
Model tree for ilprl-docse/NepTam20k-NLLB-200-600M
Base model
facebook/nllb-200-distilled-600M