shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi machine translation. It is specifically designed to be robust to punctuation errors or omissions in the source English text.

The model was developed as part of the research presented in the paper Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.

Model description

This model follows the Direct Fine-tuning methodology (referred to as Approach 2 in the paper). It was fine-tuned on an augmented version of the IITB-ENG-MAR corpus, where English source sentences were provided in both punctuated and unpunctuated forms (2x data size). This approach allows the model to implicitly learn the context required to resolve semantic ambiguities caused by missing punctuation—such as distinguishing vocative uses—directly during the translation process.

Intended uses & limitations

This model is intended for translating English text to Marathi, particularly in cases where the input may lack proper punctuation or contains punctuation-induced ambiguities.

The model was evaluated against the Virām benchmark, which focuses on 54 manually curated instances where punctuation significantly changes the structural or semantic meaning of the sentence.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Bleu Chrfpp Comet Bleurt Gen Len
0.4656 0.2530 6000 0.4509 8.9549 31.0209 0.5263 None 20.8016
0.4422 0.5059 12000 0.4280 9.2261 31.368 0.5289 None 20.8337
0.4497 0.7589 18000 0.4152 9.5103 31.6753 0.5318 None 20.8757
0.4041 1.0119 24000 0.4080 9.5567 31.8224 0.5328 None 20.8751
0.3974 1.2649 30000 0.4023 9.6304 31.9291 0.5333 None 20.8776
0.3874 1.5178 36000 0.3973 9.6925 31.9776 0.5336 None 20.8777
0.3899 1.7708 42000 0.3937 9.7282 32.031 0.5337 None 20.8775
0.3735 2.0238 48000 0.3917 9.7847 32.0965 0.5341 None 20.8778
0.3825 2.2768 54000 0.3887 9.7889 32.1173 0.5342 None 20.8768
0.3736 2.5297 60000 0.3878 9.8511 32.1897 0.5346 None 20.8772
0.3684 2.7827 66000 0.3869 9.858 32.2074 0.5346 None 20.877

Framework versions

  • Transformers 4.50.3
  • Pytorch 2.4.0a0+f70bd71a48.nv24.06
  • Datasets 3.6.0
  • Tokenizers 0.21.4

Citation

@article{shejole2025assessing,
  title={Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation},
  author={Shejole, Kaustubh and Sakhare, Shalaka and Deoghare, Sourabh},
  journal={arXiv preprint},
  year={2025}
}
Downloads last month
2
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Finetuned
(11)
this model

Dataset used to train thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Paper for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Evaluation results