NepTam20k-M2M-100-418M
Model Overview
NepTam20k-M2M-100-418M is a fine-tuned version of Facebook’s M2M-100 (418M) model for Nepali ↔ Tamang translation.
The model was developed by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University using the NepTam20K dataset.
Since Tamang is not included in M2M-100’s vocabulary, Hindi was adopted as a proxy language because it shares the Devanagari script, enabling effective tokenization and embedding sharing. This cross-lingual proxy strategy helps bridge low-resource gaps by exploiting orthographic and phonetic similarity.
🧩 Language Proxy Strategy
The Hindi proxy approach enables Tamang representation within M2M-100’s shared multilingual space.
This aligns Tamang tokens with Hindi embeddings, leveraging transfer learning benefits from a related, high-resource Indic language.
⚙️ Training Details
| Parameter | Setting |
|---|---|
| Base Model | M2M-100 (418M) |
| Dataset | NepTam20K (15K train / 5K test pairs) |
| Learning Rate | 7e-5 |
| Batch Size | 8 |
| Gradient Accumulation | 2 |
| Weight Decay | 0.01 |
| Epochs | 5 |
| Optimizer | Adam |
| Framework | Hugging Face Transformers |
| Tokenizer | M2M-100 Tokenizer |
| Proxy Language | Hindi (hi) |
📊 Evaluation Results
| Model | Direction | BLEU | chrF | chrF++ | METEOR | COMET |
|---|---|---|---|---|---|---|
| NepTam_M2M | ne→tam | 40.24 | 73.30 | 69.27 | 60.81 | 75.67 |
| tam→ne | 42.73 | 72.21 | 68.74 | 61.07 | 79.13 |
🚀 Inference
!pip install transformers sentencepiece torch
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch
# Load your fine-tuned model from Hugging Face
model_name = "ilprl-docse/NepTam20k-M2M-100-418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")
# Define language mapping
# Using Hindi ('hi') as proxy for Tamang
LANG_MAP = {
"Nepali": "ne",
"Tamang": "hi" # proxy language code
}
# Translation function
def translate(text, source_lang, target_lang):
tokenizer.src_lang = LANG_MAP[source_lang]
encoded = tokenizer(text, return_tensors="pt").to(model.device)
# Force the model to generate in the target language (using Hindi proxy for Tamang)
generated_tokens = model.generate(
**encoded,
forced_bos_token_id=tokenizer.get_lang_id(LANG_MAP[target_lang]),
max_length=512
)
return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
# Example translation
text = "पुर्खाको योगदानमाथिको ठुलो अवमूल्यन होइन र यो?"
print("Input (Nepali):", text)
print("Translation (Tamang):", translate(text, "Nepali", "Tamang"))
- Downloads last month
- 14
Model tree for ilprl-docse/NepTam20k-M2M-100-418M
Base model
facebook/m2m100_418M