Safetensors
Nepali
Tamil
m2m_100

NepTam20k-NLLB-200-600M

Model Overview

NepTam20k-NLLB-200-600M is a fine-tuned version of Meta’s NLLB-200 (600M) model for Nepali ↔ Tamang machine translation.
The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University using the NepTam20K dataset.

Tamang, a low-resource Sino-Tibetan language, lacks representation in current multilingual MT vocabularies.
To overcome this, Hindi was used as a language proxy due to its shared Devanagari script, enabling effective tokenization and cross-lingual transfer.


🧩 Language Proxy Strategy

Because Tamang is absent from multilingual vocabularies, Hindi embeddings were used to represent Tamang text.
This approach exploits orthographic and phonetic similarities within Indic languages, leveraging shared embeddings to improve generalization across unseen languages.


⚙️ Training Details

Parameter Setting
Base Model NLLB-200 (600M)
Dataset NepTam20K (15K train / 5K test pairs)
Learning Rate 5e-4
Batch Size 16
Dropout 0.3
Epochs 5
Optimizer Adam
Scheduler Inverse square root
Framework Hugging Face Transformers
Tokenizer NLLB-200 tokenizer
Proxy Language Hindi (hin_Deva)

📊 Evaluation Results

Model Direction BLEU chrF chrF++ METEOR COMET
NepTam_NLLB ne→tam 40.92 73.98 69.94 61.44 75.89
tam→ne 45.26 73.71 70.30 62.31 80.19

Evaluation metrics were computed using SacreBLEU, chrF, chrF++, METEOR, and COMET to ensure a balanced view across both translation directions.


🚀 Inference

!pip install transformers sentencepiece torch protobuf

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch, re

# Load the fine-tuned NLLB model
model_name = "ilprl-docse/NepTam20k-NLLB-200-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

# Define language mapping
LANG_MAP = {
    "Nepali": "npi_Deva",
    "Tamang": "hin_Deva"  # Hindi used as proxy
}

def translate(text, source_lang="Nepali", target_lang="Tamang"):
    tokenizer.src_lang = LANG_MAP[source_lang]
    encoded = tokenizer(text, return_tensors="pt", padding=True).to(model.device)

    generated_tokens = model.generate(
        **encoded,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(LANG_MAP[target_lang]),
        max_length=512,
        num_beams=5,
        early_stopping=True
    )

    translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    translation = re.sub(r'^(hi|ne)\s*[,.]?\s*', '', translation, flags=re.IGNORECASE).strip()
    return translation

# Example translation
text = "पुर्खाको योगदानमाथिको ठुलो अवमूल्यन होइन र यो?"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, "Nepali", "Tamang"))
Downloads last month
14
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilprl-docse/NepTam20k-NLLB-200-600M

Finetuned
(272)
this model

Dataset used to train ilprl-docse/NepTam20k-NLLB-200-600M

Collection including ilprl-docse/NepTam20k-NLLB-200-600M