NepTam20K-Transformer

📝 Model Overview

NepTam20K-Transformer is a vanilla Transformer model fine-tuned for Nepali ↔ Tamang neural machine translation.

Originally implemented in Fairseq, the model has been converted to HuggingFace MarianMT format for easier deployment and inference.

The model was developed by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of low-resource MT research.

It uses Hindi (hi_IN) as a proxy for Tamang in tokenization, due to the absence of Tamang in standard tokenizers.

🌍 Supported Languages

Intended language	Tokenized as
Nepali	`ne_NP`
Tamang	`hi_IN` (proxy)

Supported directions:

Nepali → Tamang
Tamang → Nepali

⚙️ Training Setup

🔧 Transformer Architecture

Encoder: 5 layers
Decoder: 5 layers
Embedding size: 512
Feed-forward hidden: 2048
Attention heads: 8
Dropout: 0.3
Weight decay: 1e-4
Learning rate: 5e-4
Scheduler: inverse square root
Optimizer: Adam
Epochs: 50
Batch size: 16 effective

📦 Dataset

NepTam20K (15K train / 5K test)
Devanagari script
Cleaned and normalized parallel corpus
Source: ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus

📊 Evaluation

Model	Param Size	Direction	BLEU	chrF	chrF++	METEOR	COMET
NepTam_Transformer	49.07M	ne→tam	37.71	71.71	67.74	58.20	75.00
		tam→ne	38.01	69.89	66.30	57.60	77.37

🚀 Usage (Inference)

This model uses the MarianMT HuggingFace API. The following code shows how to load and translate with it:

import torch
from transformers import MarianMTModel, MarianTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
model_path = "ilprl-docse/NepTam20k-transformer-49.07M"
tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path).to(device)

# Language map
LANG_MAP = {
    "Nepali": "__ne__",
    "Tamang": "__taj__"
}

def translate_with_transformer(text, src="Nepali", tgt="Tamang"):
    src_lang = LANG_MAP[src]
    tgt_lang = LANG_MAP[tgt]
    
    # Prefix target language token (MarianMT-style)
    input_text = f"{tgt_lang} {text}"
    encoded = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            **encoded,
            max_length=512,
            num_beams=5,
            early_stopping=True
        )
    
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()

# Example
text = "म आज धेरै खुसी छु।"
translation = translate_with_transformer(text, src="Nepali", tgt="Tamang")
print("Translation (Tamang):", translation)

Downloads last month: 12

Safetensors

Model size

40.9M params

Tensor type

F32

Dataset used to train ilprl-docse/NepTam20k-transformer-49.07M

Collection including ilprl-docse/NepTam20k-transformer-49.07M

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine

Collection

This collection includes datasets and models for Nepali-Tamang bilingual translation trained on 20k parallel sentences. • 5 items • Updated 25 days ago