IndicTrans2: English to Nepali & Maithili (Domain-Adapted Finetune)

Model Summary

This is a parameter-efficient fine-tuned (PEFT / QLoRA) version of the ai4bharat/indictrans2-en-indic-dist-200M translation model. It has been fine-tuned to improve translation accuracy and domain-specific terminology translation between English (eng_Latn) and two low-resource Indic languages:

  • Nepali (npi_Deva)
  • Maithili (mai_Deva)

The fine-tuning specifically targeted four high-impact socioeconomic domains: Agriculture, Banking, Health, and Micro-Insurance.

Training Overview & Setup

  • Architecture Base: IndicTrans2 (Encoder-Decoder)
  • Base Model Size: ~218M parameters
  • Finetuning Technique: LoRA via QLoRA (4-bit normal float quantization)
    • Trainable Parameters: 6,488,064 (~2.97% of total weights)
  • Precision: Mixed precision (fp16) natively trained on a single RTX 4090 GPU.
  • Training Duration: ~8.8 Hours (528.5 minutes) / ~5 Epochs
  • Frameworks: PyTorch, Transformers, PEFT, Accelerate, W&B.

Training Data

The model was fine-tuned on custom parallel corpora (English mapped to Nepali and Maithili). The dataset contains highly specialized terminologies across the following domains:

  1. Agriculture
  2. Banking
  3. Health
  4. Micro-Insurance

Note: Data was cleaned, validated, normalized, and pre-tokenized prior to training to ensure high-quality mappings.

Training Hyperparameters & Execution

  • Batching & Gradient Accumulation: Managed for optimal RTX 4090 (24GB VRAM) utilization using PyTorch 2.x feature expandable_segments.
  • Learning Rate Strategy: Smooth decay peaking around ~9.9e-5 and dropping seamlessly to ~5e-6 by the final epoch.
  • Loss: Training loss converged smoothly from intial 7.87 to 4.74.
  • Evaluation Splits: Evaluated every ~500 steps against an isolated locked-out test split.

Intermediate Validation Metrics

During training over the split validation dataset, the model demonstrated the following mid-training snapshot performances:

  • Maximum internal validation BLEU: 0.49 (Epoch ~2.4)
  • Maximum internal validation chrF++: 19.54 (Epoch ~2.4) (Formal final test set metrics will be available upon completion of the held-out evaluation phase).

How to Use this Model

This model requires the base model to be loaded alongside the adapter weights using the peft library. You must pass trust_remote_code=True because IndicTrans2 utilizes custom tokenization logic.

Installation

Ensure you have the required libraries installed:

pip install torch transformers peft bitsandbytes accelerate sentencepiece

Inference Code

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

base_model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
adapter_repo = "Firoj112/indictrans2-en-npi-mai-finetuned"

# 1. Load the customized Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    adapter_repo, 
    trust_remote_code=True
)

# 2. Configure 4-bit Quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# 3. Load the Base Model
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# 4. Apply the Finetuned Adapters
model = PeftModel.from_pretrained(base_model, adapter_repo)

# 5. Translate!
model.eval()

source_text = "Micro-insurance provides protection to individuals with low income against specific perils."
inputs = tokenizer(source_text, return_tensors="pt", padding=True, truncation=True, max_length=256).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_length=256, num_beams=5)

translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(f"Translation: {translation}")

Potential Risks and Limitations

  • The model is specialized for Agriculture, Health, Banking, and Micro-insurance. Translations outside these targeted domains may revert to the performance level of the base IndicTrans2 model.
  • Because it is a 200M distilled version initialized with QLoRA, highly complex or idiomatic conversational phrasing might still exhibit minor grammatical imperfections compared to the larger 1B parameter IndicTrans2 base.
  • Biases present in the original base model or the finetuning parallel dataset regarding region-specific socio-economic phrasing may surface in predictions.

Environmental Impact

  • Hardware Type: 1 x NVIDIA GeForce RTX 4090
  • Hours used: ~8.8 hours
  • Maximum VRAM Utilized: ~9 GB (Allocated/Reserved dynamically via expandable segments)
  • Compute Provider: Bare Metal / Local Server

Acknowledgements

  • The AI4Bharat team for the foundational IndicTrans2 models.
  • Datasets developed for expanding domain connectivity for English โ†” Nepali and English โ†” Maithili cross-communication.
Downloads last month
304
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Firoj112/indictrans2-en-npi-mai-finetuned

Adapter
(3)
this model