Indic Translation Model โ€” 55 Indian Languages

A fine-tuned translation model supporting 55 Indian languages including 33 low-resource languages. Translates bidirectionally between English and any supported Indic language.

Key Features

  • 55 Indian languages โ€” covers both major and low-resource regional languages
  • Bidirectional โ€” English to Indic and Indic to English
  • 1.3B parameter encoder-decoder model, ready to use out of the box
  • Fast inference โ€” supports CTranslate2 int8 quantization for production deployments
  • Extended vocabulary โ€” 256,243 tokens with 33 newly added language codes

Supported Languages

Language Code Script Language Code Script
Ahirani ahr_Deva Devanagari Kumaoni kfy_Deva Devanagari
Assamese asm_Beng Bengali Kurukh kru_Deva Devanagari
Awadhi awa_Deva Devanagari Kutchi kfr_Deva Devanagari
Bagheli bfy_Deva Devanagari Magahi mag_Deva Devanagari
Bagri bgq_Deva Devanagari Maithili mai_Deva Devanagari
Banjari brj_Deva Devanagari Malayalam mal_Mlym Malayalam
Bengali ben_Beng Bengali Manipuri mni_Beng Bengali
Bhili bhb_Deva Devanagari Manipuri mni_Mtei Meitei Mayek
Bhojpuri bho_Deva Devanagari Marathi mar_Deva Devanagari
Bodo brx_Deva Devanagari Marwari mwr_Deva Devanagari
Braj Bhasha bra_Deva Devanagari Mewari mtr_Deva Devanagari
Bundeli bns_Deva Devanagari Mizo lus_Latn Latin
Chhattisgarhi hne_Deva Devanagari Nepali npi_Deva Devanagari
Dakhini dcc_Deva Devanagari Odia ory_Orya Odia
Dogri doi_Deva Devanagari Pahari phr_Deva Devanagari
Garhwali gbm_Deva Devanagari Punjabi pan_Guru Gurmukhi
Garo grt_Latn Latin Rajasthani raj_Deva Devanagari
Gondi gon_Deva Devanagari Sambalpuri spv_Orya Odia
Gujarati guj_Gujr Gujarati Sanskrit san_Deva Devanagari
Haryanvi bgc_Deva Devanagari Santali sat_Olck Ol Chiki
Hindi hin_Deva Devanagari Sindhi snd_Arab Arabic
Ho hoc_Deva Devanagari Sora srb_Latn Latin
Kangri xnr_Deva Devanagari Tamil tam_Taml Tamil
Kannada kan_Knda Kannada Telugu tel_Telu Telugu
Kashmiri kas_Arab Arabic Tulu tcy_Knda Kannada
Kashmiri kas_Deva Devanagari Urdu urd_Arab Arabic
Khasi kha_Latn Latin Wagdi wbr_Deva Devanagari
Khortha kho_Deva Devanagari
Kodava kfa_Knda Kannada
Konkani kok_Deva Devanagari

How to Use

Quick Start (Transformers)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model
model = AutoModelForSeq2SeqLM.from_pretrained("Anonym-050326/nirukti-translate-1.3b")
tokenizer = AutoTokenizer.from_pretrained("Anonym-050326/nirukti-translate-1.3b")

# Translate English to Hindi
tokenizer.src_lang = "eng_Latn"
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
translated = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("hin_Deva"),
    max_new_tokens=128,
)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

CTranslate2 (Fastest โ€” Recommended for Production)

import ctranslate2
from transformers import AutoTokenizer

translator = ctranslate2.Translator("ct2-int8", device="cuda", compute_type="int8_float16")
tokenizer = AutoTokenizer.from_pretrained("Anonym-050326/nirukti-translate-1.3b")
tokenizer.src_lang = "eng_Latn"

text = "Hello, how are you?"
encoded = tokenizer(text, return_tensors=None, max_length=256, truncation=True)
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])

result = translator.translate_batch([tokens], target_prefix=[["hin_Deva"]], beam_size=5)
output_tokens = result[0].hypotheses[0][1:]  # skip language token
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Translating Between Any Language Pair

Set tokenizer.src_lang to the source language code and use the target code as forced_bos_token_id (Transformers) or target_prefix (CTranslate2).

# Tamil to English
tokenizer.src_lang = "tam_Taml"
# forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn")

# English to Marwari
tokenizer.src_lang = "eng_Latn"
# forced_bos_token_id=tokenizer.convert_tokens_to_ids("mwr_Deva")

Testing Scripts

This repository includes evaluation and testing scripts in the testing_scripts/ directory:

Script Description
testing_scripts/run_evaluation.py Multi-GPU evaluation runner (BLEU + chrF++)
testing_scripts/download_datasets.py Download standard Indic translation benchmarks
testing_scripts/excel_to_csv.py Convert Excel test datasets to evaluation format

Running Evaluation

# 1. Convert to CTranslate2 int8 for fast inference
ct2-opennmt-py-converter --model_path Anonym-050326/nirukti-translate-1.3b --output_dir ct2-int8 --quantization int8

# 2. Download benchmark datasets
python testing_scripts/download_datasets.py --output-dir eval_data

# 3. Run multi-GPU evaluation
python testing_scripts/run_evaluation.py \
    --model ct2-int8 \
    --tokenizer Anonym-050326/nirukti-translate-1.3b \
    --manifest eval_data/manifest.json \
    --output-dir results \
    --num-gpus 4 --batch-size 64 --beam-size 5

Training Details

Architecture

  • Base: 1.3B parameter encoder-decoder translation model
  • Fine-tuning: LoRA (rank 32, alpha 64) on attention + FFN layers, merged into base weights
  • Trainable parameters: ~0.75% of total during training
  • 2-stage training: embedding divergence followed by LoRA fine-tuning

Hyperparameters

Parameter Value
Learning rate 3e-5 (cosine schedule)
Warmup steps 1,000
Effective batch size 256
Epochs 5
Max sequence length 128 tokens
Precision BF16
Label smoothing 0.1
Language balancing Temperature sampling (T=5.0)

Training Data

Trained on a curated combination of parallel corpora covering all 55 languages, with hash-based deduplication and quality filtering. Low-resource language pairs are upsampled using temperature-based balancing to ensure adequate representation.

Compute

  • Training: 2x NVIDIA A100-SXM4-80GB, PyTorch DDP
  • Inference: CTranslate2 int8 quantization

Limitations

  • Low-resource languages (33 newly added) have less training data and lower translation quality than high-resource languages
  • Performance varies by script โ€” languages with multiple script variants may show different quality levels
  • Training data is primarily from news, government, and religious domains; informal or conversational text may translate differently
  • Very short sentences (1-5 words) are harder for the model due to limited context
Downloads last month
16
Safetensors
Model size
1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support