IndicTrans2: English to Nepali & Maithili (Domain-Adapted Finetune)

Model Summary

This is a parameter-efficient fine-tuned (PEFT / QLoRA) version of the ai4bharat/indictrans2-en-indic-dist-200M translation model. It has been fine-tuned to improve translation accuracy and domain-specific terminology translation between English (eng_Latn) and two low-resource Indic languages:

Nepali (npi_Deva)
Maithili (mai_Deva)

The fine-tuning specifically targeted four high-impact socioeconomic domains: Agriculture, Banking, Health, and Micro-Insurance.

Training Overview & Setup

Architecture Base: IndicTrans2 (Encoder-Decoder)
Base Model Size: ~218M parameters
Finetuning Technique: LoRA via QLoRA (4-bit normal float quantization)
- Trainable Parameters: 6,488,064 (~2.97% of total weights)
Precision: Mixed precision (fp16) natively trained on a single RTX 4090 GPU.
Training Duration: ~8.8 Hours (528.5 minutes) / ~5 Epochs
Frameworks: PyTorch, Transformers, PEFT, Accelerate, W&B.

Training Data

The model was fine-tuned on custom parallel corpora (English mapped to Nepali and Maithili). The dataset contains highly specialized terminologies across the following domains:

Agriculture
Banking
Health
Micro-Insurance

Note: Data was cleaned, validated, normalized, and pre-tokenized prior to training to ensure high-quality mappings.

Training Hyperparameters & Execution

Batching & Gradient Accumulation: Managed for optimal RTX 4090 (24GB VRAM) utilization using PyTorch 2.x feature expandable_segments.
Learning Rate Strategy: Smooth decay peaking around ~9.9e-5 and dropping seamlessly to ~5e-6 by the final epoch.
Loss: Training loss converged smoothly from intial 7.87 to 4.74.
Evaluation Splits: Evaluated every ~500 steps against an isolated locked-out test split.

Intermediate Validation Metrics

During training over the split validation dataset, the model demonstrated the following mid-training snapshot performances:

Maximum internal validation BLEU: 0.49 (Epoch ~2.4)
Maximum internal validation chrF++: 19.54 (Epoch ~2.4) (Formal final test set metrics will be available upon completion of the held-out evaluation phase).

How to Use this Model

This model requires the base model to be loaded alongside the adapter weights using the peft library. You must pass trust_remote_code=True because IndicTrans2 utilizes custom tokenization logic.

Installation

Ensure you have the required libraries installed:

pip install torch transformers peft bitsandbytes accelerate sentencepiece

Inference Code

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

base_model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
adapter_repo = "Firoj112/indictrans2-en-npi-mai-finetuned"

# 1. Load the customized Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    adapter_repo, 
    trust_remote_code=True
)

# 2. Configure 4-bit Quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# 3. Load the Base Model
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# 4. Apply the Finetuned Adapters
model = PeftModel.from_pretrained(base_model, adapter_repo)

# 5. Translate!
model.eval()

source_text = "Micro-insurance provides protection to individuals with low income against specific perils."
inputs = tokenizer(source_text, return_tensors="pt", padding=True, truncation=True, max_length=256).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_length=256, num_beams=5)

translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(f"Translation: {translation}")

Potential Risks and Limitations

The model is specialized for Agriculture, Health, Banking, and Micro-insurance. Translations outside these targeted domains may revert to the performance level of the base IndicTrans2 model.
Because it is a 200M distilled version initialized with QLoRA, highly complex or idiomatic conversational phrasing might still exhibit minor grammatical imperfections compared to the larger 1B parameter IndicTrans2 base.
Biases present in the original base model or the finetuning parallel dataset regarding region-specific socio-economic phrasing may surface in predictions.

Environmental Impact

Hardware Type: 1 x NVIDIA GeForce RTX 4090
Hours used: ~8.8 hours
Maximum VRAM Utilized: ~9 GB (Allocated/Reserved dynamically via expandable segments)
Compute Provider: Bare Metal / Local Server

Acknowledgements

The AI4Bharat team for the foundational IndicTrans2 models.
Datasets developed for expanding domain connectivity for English ↔ Nepali and English ↔ Maithili cross-communication.

Downloads last month: 304

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Firoj112/indictrans2-en-npi-mai-finetuned

Base model

ai4bharat/indictrans2-en-indic-dist-200M

Adapter

(3)

this model