shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi machine translation. It is specifically designed to be robust to punctuation errors or omissions in the source English text.

The model was developed as part of the research presented in the paper Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.

Model description

This model follows the Direct Fine-tuning methodology (referred to as Approach 2 in the paper). It was fine-tuned on an augmented version of the IITB-ENG-MAR corpus, where English source sentences were provided in both punctuated and unpunctuated forms (2x data size). This approach allows the model to implicitly learn the context required to resolve semantic ambiguities caused by missing punctuation—such as distinguishing vocative uses—directly during the translation process.

Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Repository: Viram_Marathi GitHub
Diagnostic Benchmark: Virām (Punct-Eng-Mar)

Intended uses & limitations

This model is intended for translating English text to Marathi, particularly in cases where the input may lack proper punctuation or contains punctuation-induced ambiguities.

The model was evaluated against the Virām benchmark, which focuses on 54 manually curated instances where punctuation significantly changes the structural or semantic meaning of the sentence.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu	Chrfpp	Comet	Bleurt	Gen Len
0.4656	0.2530	6000	0.4509	8.9549	31.0209	0.5263	None	20.8016
0.4422	0.5059	12000	0.4280	9.2261	31.368	0.5289	None	20.8337
0.4497	0.7589	18000	0.4152	9.5103	31.6753	0.5318	None	20.8757
0.4041	1.0119	24000	0.4080	9.5567	31.8224	0.5328	None	20.8751
0.3974	1.2649	30000	0.4023	9.6304	31.9291	0.5333	None	20.8776
0.3874	1.5178	36000	0.3973	9.6925	31.9776	0.5336	None	20.8777
0.3899	1.7708	42000	0.3937	9.7282	32.031	0.5337	None	20.8775
0.3735	2.0238	48000	0.3917	9.7847	32.0965	0.5341	None	20.8778
0.3825	2.2768	54000	0.3887	9.7889	32.1173	0.5342	None	20.8768
0.3736	2.5297	60000	0.3878	9.8511	32.1897	0.5346	None	20.8772
0.3684	2.7827	66000	0.3869	9.858	32.2074	0.5346	None	20.877

Framework versions

Transformers 4.50.3
Pytorch 2.4.0a0+f70bd71a48.nv24.06
Datasets 3.6.0
Tokenizers 0.21.4

Citation

@article{shejole2025assessing,
  title={Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation},
  author={Shejole, Kaustubh and Sakhare, Shalaka and Deoghare, Sourabh},
  journal={arXiv preprint},
  year={2025}
}

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Base model

ai4bharat/indictrans2-en-indic-dist-200M

Finetuned

(11)

this model

Dataset used to train thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Paper for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Paper • 2601.09725 • Published Dec 28, 2025 • 1

Evaluation results

bleu on Viram (Punct-Eng-Mar)
self-reported

9.858