NepTam20K-mBART50
๐ Model Overview
NepTam20K-mBART50 is a fine-tuned multilingual neural machine translation model based on facebook/mbart-large-50, developed for Nepali โ Tamang translation.
The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of a low-resource language revitalization effort.
Tamang does not exist in the original mBART tokenizer vocabulary.
To address this, Hindi (hi_IN) was used as a Devanagari proxy token, enabling multilingual shared lexical transfer and improved adaptation.
๐ Supported Languages
Source โ Target:
- Nepali โ Tamang
- Tamang โ Nepali
Token code mapping:
| Intended language | Tokenized as |
|---|---|
| Nepali | ne_NP |
| Tamang | hi_IN (proxy) |
โ๏ธ Training Setup
๐ฆ Dataset
NepTam20K โ Parallel Corpus
- 15K training pairs
- 5K test pairs
- Nepali โ Tamang direction
- Devanagari script
- cleaned & normalized
Dataset: ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus
๐ง Training Config
| Parameter | Value |
|---|---|
| Base model | mBART Large 50 |
| Epochs | 7 |
| Effective batch size | 8 |
| Gradient Accumulation | 2 |
| Learning Rate | 7e-5 |
| Weight Decay | 0.01 |
| Optimizer | AdamW |
| Tokenizer | MBART-50 tokenizer |
| Proxy language | Hindi |
| Framework | HuggingFace Transformers |
๐ Evaluation
All models trained on 15K & tested on 5K parallel sentences.
Evaluated using:
- SacreBLEU
- chrF
- chrF++
- METEOR
- COMET
Performance (NepTam20K test set):
| Model | Param Size | Direction | BLEU | chrF | chrF++ | METEOR | COMET |
|---|---|---|---|---|---|---|---|
| NepTam_mBART | 610.9M | neโtam | 40.14 | 72.91 | 68.92 | 60.46 | 75.60 |
| tamโne | 42.96 | 71.65 | 68.25 | 60.43 | 79.04 |
๐ Usage: Inference
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ilprl-docse/NepTam20k-mBART-50-610.9M"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name).to(device)
LANG = {
"Nepali": "ne_NP",
"Tamang": "hi_IN" # Hindi = proxy
}
def translate(text, src="Nepali", tgt="Tamang"):
tokenizer.src_lang = LANG[src]
batch = tokenizer(text, return_tensors="pt").to(device)
generated = model.generate(
**batch,
forced_bos_token_id=tokenizer.lang_code_to_id[LANG[tgt]],
num_beams=5,
max_length=128
)
return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
# Example
text = "เคฎ เคเค เคงเฅเคฐเฅ เคเฅเคถเคฟ เคเฅเฅค"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, src="Nepali",tgt="Tamang"))
- Downloads last month
- 11
Model tree for ilprl-docse/NepTam20k-mBART-50-610.9M
Base model
facebook/mbart-large-50