NepTam20k-NLLB-200-600M

Model Overview

NepTam20k-NLLB-200-600M is a fine-tuned version of Meta’s NLLB-200 (600M) model for Nepali ↔ Tamang machine translation.
The model was trained by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University using the NepTam20K dataset.

Tamang, a low-resource Sino-Tibetan language, lacks representation in current multilingual MT vocabularies.
To overcome this, Hindi was used as a language proxy due to its shared Devanagari script, enabling effective tokenization and cross-lingual transfer.

🧩 Language Proxy Strategy

Because Tamang is absent from multilingual vocabularies, Hindi embeddings were used to represent Tamang text.
This approach exploits orthographic and phonetic similarities within Indic languages, leveraging shared embeddings to improve generalization across unseen languages.

⚙️ Training Details

Parameter	Setting
Base Model	NLLB-200 (600M)
Dataset	NepTam20K (15K train / 5K test pairs)
Learning Rate	5e-4
Batch Size	16
Dropout	0.3
Epochs	5
Optimizer	Adam
Scheduler	Inverse square root
Framework	Hugging Face Transformers
Tokenizer	NLLB-200 tokenizer
Proxy Language	Hindi (hin_Deva)

📊 Evaluation Results

Model	Direction	BLEU	chrF	chrF++	METEOR	COMET
NepTam_NLLB	ne→tam	40.92	73.98	69.94	61.44	75.89
	tam→ne	45.26	73.71	70.30	62.31	80.19

Evaluation metrics were computed using SacreBLEU, chrF, chrF++, METEOR, and COMET to ensure a balanced view across both translation directions.

🚀 Inference

!pip install transformers sentencepiece torch protobuf

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch, re

# Load the fine-tuned NLLB model
model_name = "ilprl-docse/NepTam20k-NLLB-200-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

# Define language mapping
LANG_MAP = {
    "Nepali": "npi_Deva",
    "Tamang": "hin_Deva"  # Hindi used as proxy
}

def translate(text, source_lang="Nepali", target_lang="Tamang"):
    tokenizer.src_lang = LANG_MAP[source_lang]
    encoded = tokenizer(text, return_tensors="pt", padding=True).to(model.device)

    generated_tokens = model.generate(
        **encoded,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(LANG_MAP[target_lang]),
        max_length=512,
        num_beams=5,
        early_stopping=True
    )

    translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    translation = re.sub(r'^(hi|ne)\s*[,.]?\s*', '', translation, flags=re.IGNORECASE).strip()
    return translation

# Example translation
text = "पुर्खाको योगदानमाथिको ठुलो अवमूल्यन होइन र यो?"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, "Nepali", "Tamang"))

Downloads last month: 14

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilprl-docse/NepTam20k-NLLB-200-600M

Base model

facebook/nllb-200-distilled-600M

Finetuned

(272)

this model

Dataset used to train ilprl-docse/NepTam20k-NLLB-200-600M

Collection including ilprl-docse/NepTam20k-NLLB-200-600M

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine

Collection

This collection includes datasets and models for Nepali-Tamang bilingual translation trained on 20k parallel sentences. • 5 items • Updated 26 days ago