NepTam20K-Transformer

๐Ÿ“ Model Overview

NepTam20K-Transformer is a vanilla Transformer model fine-tuned for Nepali โ†” Tamang neural machine translation.

Originally implemented in Fairseq, the model has been converted to HuggingFace MarianMT format for easier deployment and inference.

The model was developed by the Information and Language Processing Research Lab (ILPRL) at Kathmandu University, as part of low-resource MT research.

It uses Hindi (hi_IN) as a proxy for Tamang in tokenization, due to the absence of Tamang in standard tokenizers.


๐ŸŒ Supported Languages

Intended language Tokenized as
Nepali ne_NP
Tamang hi_IN (proxy)

Supported directions:

  • Nepali โ†’ Tamang
  • Tamang โ†’ Nepali

โš™๏ธ Training Setup

๐Ÿ”ง Transformer Architecture

  • Encoder: 5 layers
  • Decoder: 5 layers
  • Embedding size: 512
  • Feed-forward hidden: 2048
  • Attention heads: 8
  • Dropout: 0.3
  • Weight decay: 1e-4
  • Learning rate: 5e-4
  • Scheduler: inverse square root
  • Optimizer: Adam
  • Epochs: 50
  • Batch size: 16 effective

๐Ÿ“ฆ Dataset

  • NepTam20K (15K train / 5K test)
  • Devanagari script
  • Cleaned and normalized parallel corpus
  • Source: ilprl-docse/NepTam-A-Nepali-Tamang-Parallel-Corpus

๐Ÿ“Š Evaluation

Model Param Size Direction BLEU chrF chrF++ METEOR COMET
NepTam_Transformer 49.07M neโ†’tam 37.71 71.71 67.74 58.20 75.00
tamโ†’ne 38.01 69.89 66.30 57.60 77.37

๐Ÿš€ Usage (Inference)

This model uses the MarianMT HuggingFace API. The following code shows how to load and translate with it:

import torch
from transformers import MarianMTModel, MarianTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
model_path = "ilprl-docse/NepTam20k-transformer-49.07M"
tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path).to(device)

# Language map
LANG_MAP = {
    "Nepali": "__ne__",
    "Tamang": "__taj__"
}

def translate_with_transformer(text, src="Nepali", tgt="Tamang"):
    src_lang = LANG_MAP[src]
    tgt_lang = LANG_MAP[tgt]
    
    # Prefix target language token (MarianMT-style)
    input_text = f"{tgt_lang} {text}"
    encoded = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            **encoded,
            max_length=512,
            num_beams=5,
            early_stopping=True
        )
    
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()

# Example
text = "เคฎ เค†เคœ เคงเฅ‡เคฐเฅˆ เค–เฅเคธเฅ€ เค›เฅเฅค"
translation = translate_with_transformer(text, src="Nepali", tgt="Tamang")
print("Translation (Tamang):", translation)
Downloads last month
12
Safetensors
Model size
40.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train ilprl-docse/NepTam20k-transformer-49.07M

Collection including ilprl-docse/NepTam20k-transformer-49.07M