Translation
Safetensors
Nepali
m2m_100
low-resource
multilingual-translation
Nepali-Tamang

NepTam20k-M2M-100-418M

Model Overview

NepTam20k-M2M-100-418M is a fine-tuned version of Facebook’s M2M-100 (418M) model for Nepali ↔ Tamang translation.
The model was developed by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University using the NepTam20K dataset.

Since Tamang is not included in M2M-100’s vocabulary, Hindi was adopted as a proxy language because it shares the Devanagari script, enabling effective tokenization and embedding sharing. This cross-lingual proxy strategy helps bridge low-resource gaps by exploiting orthographic and phonetic similarity.


🧩 Language Proxy Strategy

The Hindi proxy approach enables Tamang representation within M2M-100’s shared multilingual space.
This aligns Tamang tokens with Hindi embeddings, leveraging transfer learning benefits from a related, high-resource Indic language.


⚙️ Training Details

Parameter Setting
Base Model M2M-100 (418M)
Dataset NepTam20K (15K train / 5K test pairs)
Learning Rate 7e-5
Batch Size 8
Gradient Accumulation 2
Weight Decay 0.01
Epochs 5
Optimizer Adam
Framework Hugging Face Transformers
Tokenizer M2M-100 Tokenizer
Proxy Language Hindi (hi)

📊 Evaluation Results

Model Direction BLEU chrF chrF++ METEOR COMET
NepTam_M2M ne→tam 40.24 73.30 69.27 60.81 75.67
tam→ne 42.73 72.21 68.74 61.07 79.13

🚀 Inference

!pip install transformers sentencepiece torch

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch

# Load your fine-tuned model from Hugging Face
model_name = "ilprl-docse/NepTam20k-M2M-100-418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

# Define language mapping
# Using Hindi ('hi') as proxy for Tamang
LANG_MAP = {
    "Nepali": "ne",
    "Tamang": "hi"   # proxy language code
}

# Translation function
def translate(text, source_lang, target_lang):
    tokenizer.src_lang = LANG_MAP[source_lang]
    encoded = tokenizer(text, return_tensors="pt").to(model.device)
    # Force the model to generate in the target language (using Hindi proxy for Tamang)
    generated_tokens = model.generate(
        **encoded,
        forced_bos_token_id=tokenizer.get_lang_id(LANG_MAP[target_lang]),
        max_length=512
    )
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

# Example translation
text = "पुर्खाको योगदानमाथिको ठुलो अवमूल्यन होइन र यो?"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, "Nepali", "Tamang"))
Downloads last month
14
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilprl-docse/NepTam20k-M2M-100-418M

Finetuned
(166)
this model

Dataset used to train ilprl-docse/NepTam20k-M2M-100-418M

Collection including ilprl-docse/NepTam20k-M2M-100-418M