IndicTrans2: English to Nepali & Maithili (Domain-Adapted Finetune)
Model Summary
This is a parameter-efficient fine-tuned (PEFT / QLoRA) version of the ai4bharat/indictrans2-en-indic-dist-200M translation model. It has been fine-tuned to improve translation accuracy and domain-specific terminology translation between English (eng_Latn) and two low-resource Indic languages:
- Nepali (
npi_Deva) - Maithili (
mai_Deva)
The fine-tuning specifically targeted four high-impact socioeconomic domains: Agriculture, Banking, Health, and Micro-Insurance.
Training Overview & Setup
- Architecture Base: IndicTrans2 (Encoder-Decoder)
- Base Model Size: ~218M parameters
- Finetuning Technique: LoRA via QLoRA (4-bit normal float quantization)
- Trainable Parameters: 6,488,064 (~2.97% of total weights)
- Precision: Mixed precision (
fp16) natively trained on a single RTX 4090 GPU. - Training Duration: ~8.8 Hours (528.5 minutes) / ~5 Epochs
- Frameworks: PyTorch, Transformers, PEFT, Accelerate, W&B.
Training Data
The model was fine-tuned on custom parallel corpora (English mapped to Nepali and Maithili). The dataset contains highly specialized terminologies across the following domains:
- Agriculture
- Banking
- Health
- Micro-Insurance
Note: Data was cleaned, validated, normalized, and pre-tokenized prior to training to ensure high-quality mappings.
Training Hyperparameters & Execution
- Batching & Gradient Accumulation: Managed for optimal RTX 4090 (24GB VRAM) utilization using PyTorch 2.x feature
expandable_segments. - Learning Rate Strategy: Smooth decay peaking around ~9.9e-5 and dropping seamlessly to ~5e-6 by the final epoch.
- Loss: Training loss converged smoothly from intial
7.87to4.74. - Evaluation Splits: Evaluated every ~500 steps against an isolated locked-out test split.
Intermediate Validation Metrics
During training over the split validation dataset, the model demonstrated the following mid-training snapshot performances:
- Maximum internal validation BLEU: 0.49 (Epoch ~2.4)
- Maximum internal validation chrF++: 19.54 (Epoch ~2.4) (Formal final test set metrics will be available upon completion of the held-out evaluation phase).
How to Use this Model
This model requires the base model to be loaded alongside the adapter weights using the peft library. You must pass trust_remote_code=True because IndicTrans2 utilizes custom tokenization logic.
Installation
Ensure you have the required libraries installed:
pip install torch transformers peft bitsandbytes accelerate sentencepiece
Inference Code
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
base_model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
adapter_repo = "Firoj112/indictrans2-en-npi-mai-finetuned"
# 1. Load the customized Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
adapter_repo,
trust_remote_code=True
)
# 2. Configure 4-bit Quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# 3. Load the Base Model
base_model = AutoModelForSeq2SeqLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# 4. Apply the Finetuned Adapters
model = PeftModel.from_pretrained(base_model, adapter_repo)
# 5. Translate!
model.eval()
source_text = "Micro-insurance provides protection to individuals with low income against specific perils."
inputs = tokenizer(source_text, return_tensors="pt", padding=True, truncation=True, max_length=256).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(f"Translation: {translation}")
Potential Risks and Limitations
- The model is specialized for Agriculture, Health, Banking, and Micro-insurance. Translations outside these targeted domains may revert to the performance level of the base IndicTrans2 model.
- Because it is a 200M distilled version initialized with QLoRA, highly complex or idiomatic conversational phrasing might still exhibit minor grammatical imperfections compared to the larger 1B parameter IndicTrans2 base.
- Biases present in the original base model or the finetuning parallel dataset regarding region-specific socio-economic phrasing may surface in predictions.
Environmental Impact
- Hardware Type: 1 x NVIDIA GeForce RTX 4090
- Hours used: ~8.8 hours
- Maximum VRAM Utilized: ~9 GB (Allocated/Reserved dynamically via expandable segments)
- Compute Provider: Bare Metal / Local Server
Acknowledgements
- The AI4Bharat team for the foundational IndicTrans2 models.
- Datasets developed for expanding domain connectivity for English โ Nepali and English โ Maithili cross-communication.
- Downloads last month
- 304
Model tree for Firoj112/indictrans2-en-npi-mai-finetuned
Base model
ai4bharat/indictrans2-en-indic-dist-200M