NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine
Collection
This collection includes datasets and models for Nepali-Tamang bilingual translation trained on 20k parallel sentences. โข 5 items โข Updated
NepTam20K-Transformer is a vanilla Transformer model fine-tuned for Nepali โ Tamang neural machine translation.
Originally implemented in Fairseq, the model has been converted to HuggingFace MarianMT format for easier deployment and inference.
The model was developed by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of low-resource MT research.
It uses Hindi (hi_IN) as a proxy for Tamang in tokenization, due to the absence of Tamang in standard tokenizers.
| Intended language | Tokenized as |
|---|---|
| Nepali | ne_NP |
| Tamang | hi_IN (proxy) |
Supported directions:
ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus| Model | Param Size | Direction | BLEU | chrF | chrF++ | METEOR | COMET |
|---|---|---|---|---|---|---|---|
| NepTam_Transformer | 49.07M | neโtam | 37.71 | 71.71 | 67.74 | 58.20 | 75.00 |
| tamโne | 38.01 | 69.89 | 66.30 | 57.60 | 77.37 |
This model uses the MarianMT HuggingFace API. The following code shows how to load and translate with it:
import torch
from transformers import MarianMTModel, MarianTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model and tokenizer
model_path = "ilprl-docse/NepTam20k-transformer-49.07M"
tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path).to(device)
# Language map
LANG_MAP = {
"Nepali": "__ne__",
"Tamang": "__taj__"
}
def translate_with_transformer(text, src="Nepali", tgt="Tamang"):
src_lang = LANG_MAP[src]
tgt_lang = LANG_MAP[tgt]
# Prefix target language token (MarianMT-style)
input_text = f"{tgt_lang} {text}"
encoded = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
generated_ids = model.generate(
**encoded,
max_length=512,
num_beams=5,
early_stopping=True
)
return tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()
# Example
text = "เคฎ เคเค เคงเฅเคฐเฅ เคเฅเคธเฅ เคเฅเฅค"
translation = translate_with_transformer(text, src="Nepali", tgt="Tamang")
print("Translation (Tamang):", translation)