NepTam20K-mBART50

๐Ÿ“ Model Overview

NepTam20K-mBART50 is a fine-tuned multilingual neural machine translation model based on facebook/mbart-large-50, developed for Nepali โ†” Tamang translation.

The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of a low-resource language revitalization effort.

Tamang does not exist in the original mBART tokenizer vocabulary.
To address this, Hindi (hi_IN) was used as a Devanagari proxy token, enabling multilingual shared lexical transfer and improved adaptation.


๐ŸŒ Supported Languages

Source โ†’ Target:

  • Nepali โ†’ Tamang
  • Tamang โ†’ Nepali

Token code mapping:

Intended language Tokenized as
Nepali ne_NP
Tamang hi_IN (proxy)

โš™๏ธ Training Setup

๐Ÿ“ฆ Dataset

NepTam20K โ€” Parallel Corpus

  • 15K training pairs
  • 5K test pairs
  • Nepali โ†” Tamang direction
  • Devanagari script
  • cleaned & normalized

Dataset: ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus

๐Ÿ”ง Training Config

Parameter Value
Base model mBART Large 50
Epochs 7
Effective batch size 8
Gradient Accumulation 2
Learning Rate 7e-5
Weight Decay 0.01
Optimizer AdamW
Tokenizer MBART-50 tokenizer
Proxy language Hindi
Framework HuggingFace Transformers

๐Ÿ“Š Evaluation

All models trained on 15K & tested on 5K parallel sentences.

Evaluated using:

  • SacreBLEU
  • chrF
  • chrF++
  • METEOR
  • COMET

Performance (NepTam20K test set):

Model Param Size Direction BLEU chrF chrF++ METEOR COMET
NepTam_mBART 610.9M neโ†’tam 40.14 72.91 68.92 60.46 75.60
tamโ†’ne 42.96 71.65 68.25 60.43 79.04

๐Ÿš€ Usage: Inference

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ilprl-docse/NepTam20k-mBART-50-610.9M"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name).to(device)

LANG = {
    "Nepali": "ne_NP",
    "Tamang": "hi_IN" # Hindi = proxy
}

def translate(text, src="Nepali", tgt="Tamang"):
    tokenizer.src_lang = LANG[src]

    batch = tokenizer(text, return_tensors="pt").to(device)

    generated = model.generate(
        **batch,
        forced_bos_token_id=tokenizer.lang_code_to_id[LANG[tgt]],
        num_beams=5,
        max_length=128
    )
    
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

# Example
text = "เคฎ เค†เคœ เคงเฅ‡เคฐเฅˆ เค–เฅเคถเคฟ เค›เฅเฅค"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, src="Nepali",tgt="Tamang"))
Downloads last month
11
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ilprl-docse/NepTam20k-mBART-50-610.9M

Finetuned
(345)
this model

Dataset used to train ilprl-docse/NepTam20k-mBART-50-610.9M

Collection including ilprl-docse/NepTam20k-mBART-50-610.9M