shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi machine translation. It is specifically designed to be robust to punctuation errors or omissions in the source English text.
The model was developed as part of the research presented in the paper Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.
Model description
This model follows the Direct Fine-tuning methodology (referred to as Approach 2 in the paper). It was fine-tuned on an augmented version of the IITB-ENG-MAR corpus, where English source sentences were provided in both punctuated and unpunctuated forms (2x data size). This approach allows the model to implicitly learn the context required to resolve semantic ambiguities caused by missing punctuation—such as distinguishing vocative uses—directly during the translation process.
- Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
- Repository: Viram_Marathi GitHub
- Diagnostic Benchmark: Virām (Punct-Eng-Mar)
Intended uses & limitations
This model is intended for translating English text to Marathi, particularly in cases where the input may lack proper punctuation or contains punctuation-induced ambiguities.
The model was evaluated against the Virām benchmark, which focuses on 54 manually curated instances where punctuation significantly changes the structural or semantic meaning of the sentence.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-06
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 3
Training results
| Training Loss | Epoch | Step | Validation Loss | Bleu | Chrfpp | Comet | Bleurt | Gen Len |
|---|---|---|---|---|---|---|---|---|
| 0.4656 | 0.2530 | 6000 | 0.4509 | 8.9549 | 31.0209 | 0.5263 | None | 20.8016 |
| 0.4422 | 0.5059 | 12000 | 0.4280 | 9.2261 | 31.368 | 0.5289 | None | 20.8337 |
| 0.4497 | 0.7589 | 18000 | 0.4152 | 9.5103 | 31.6753 | 0.5318 | None | 20.8757 |
| 0.4041 | 1.0119 | 24000 | 0.4080 | 9.5567 | 31.8224 | 0.5328 | None | 20.8751 |
| 0.3974 | 1.2649 | 30000 | 0.4023 | 9.6304 | 31.9291 | 0.5333 | None | 20.8776 |
| 0.3874 | 1.5178 | 36000 | 0.3973 | 9.6925 | 31.9776 | 0.5336 | None | 20.8777 |
| 0.3899 | 1.7708 | 42000 | 0.3937 | 9.7282 | 32.031 | 0.5337 | None | 20.8775 |
| 0.3735 | 2.0238 | 48000 | 0.3917 | 9.7847 | 32.0965 | 0.5341 | None | 20.8778 |
| 0.3825 | 2.2768 | 54000 | 0.3887 | 9.7889 | 32.1173 | 0.5342 | None | 20.8768 |
| 0.3736 | 2.5297 | 60000 | 0.3878 | 9.8511 | 32.1897 | 0.5346 | None | 20.8772 |
| 0.3684 | 2.7827 | 66000 | 0.3869 | 9.858 | 32.2074 | 0.5346 | None | 20.877 |
Framework versions
- Transformers 4.50.3
- Pytorch 2.4.0a0+f70bd71a48.nv24.06
- Datasets 3.6.0
- Tokenizers 0.21.4
Citation
@article{shejole2025assessing,
title={Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation},
author={Shejole, Kaustubh and Sakhare, Shalaka and Deoghare, Sourabh},
journal={arXiv preprint},
year={2025}
}
- Downloads last month
- 2
Model tree for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
Base model
ai4bharat/indictrans2-en-indic-dist-200MDataset used to train thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
Paper for thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
Evaluation results
- bleu on Viram (Punct-Eng-Mar)self-reported9.858