NepTam20K-mBART50

📝 Model Overview

NepTam20K-mBART50 is a fine-tuned multilingual neural machine translation model based on facebook/mbart-large-50, developed for Nepali ↔ Tamang translation.

The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of a low-resource language revitalization effort.

Tamang does not exist in the original mBART tokenizer vocabulary.
To address this, Hindi (hi_IN) was used as a Devanagari proxy token, enabling multilingual shared lexical transfer and improved adaptation.

🌍 Supported Languages

Source → Target:

Nepali → Tamang
Tamang → Nepali

Token code mapping:

Intended language	Tokenized as
Nepali	`ne_NP`
Tamang	`hi_IN` (proxy)

⚙️ Training Setup

📦 Dataset

NepTam20K — Parallel Corpus

15K training pairs
5K test pairs
Nepali ↔ Tamang direction
Devanagari script
cleaned & normalized

Dataset: ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus

🔧 Training Config

Parameter	Value
Base model	mBART Large 50
Epochs	7
Effective batch size	8
Gradient Accumulation	2
Learning Rate	7e-5
Weight Decay	0.01
Optimizer	AdamW
Tokenizer	MBART-50 tokenizer
Proxy language	Hindi
Framework	HuggingFace Transformers

📊 Evaluation

All models trained on 15K & tested on 5K parallel sentences.

Evaluated using:

SacreBLEU
chrF
chrF++
METEOR
COMET

Performance (NepTam20K test set):

Model	Param Size	Direction	BLEU	chrF	chrF++	METEOR	COMET
NepTam_mBART	610.9M	ne→tam	40.14	72.91	68.92	60.46	75.60
		tam→ne	42.96	71.65	68.25	60.43	79.04

🚀 Usage: Inference

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ilprl-docse/NepTam20k-mBART-50-610.9M"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name).to(device)

LANG = {
    "Nepali": "ne_NP",
    "Tamang": "hi_IN" # Hindi = proxy
}

def translate(text, src="Nepali", tgt="Tamang"):
    tokenizer.src_lang = LANG[src]

    batch = tokenizer(text, return_tensors="pt").to(device)

    generated = model.generate(
        **batch,
        forced_bos_token_id=tokenizer.lang_code_to_id[LANG[tgt]],
        num_beams=5,
        max_length=128
    )
    
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

# Example
text = "म आज धेरै खुशि छु।"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, src="Nepali",tgt="Tamang"))