shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi machine translation. It was developed to address punctuation robustness, as presented in the paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.

Model Description

This model follows Approach 2 (Direct Fine-tuning) described in the research. It is the Combined Finetuned (1x Punct) variant, where the base model was fine-tuned on the IITB-ENG-MAR dataset using a strategy that alternates between keeping and removing punctuation in the source English text. This helps the model implicitly learn context and resolve semantic ambiguities (e.g., distinguishing "Honey" as a name versus a substance) even when punctuation is missing.

Source Language: English (Latin script)
Target Language: Marathi (Devanagari script)
Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
GitHub Repository: KaustubhShejole/Viram_Marathi

Intended Uses & Limitations

The model is intended for English-to-Marathi translation tasks, particularly those involving informal or unpunctuated source text (like social media or speech transcripts) where standard models might struggle with meaning-changing ambiguities.

Training and Evaluation Data

Fine-tuning Data: A variant of the IITB-ENG-MAR dataset.
Evaluation Benchmark: Virām (Punct-Eng-Mar/PEM), a diagnostic benchmark of 54 manually curated, punctuation-ambiguous instances designed to test MT robustness.

Training Results

The model achieves the following results on the evaluation set:

Loss: 0.4160
Bleu: 9.5026
Chrfpp: 31.6551
Comet: 0.5336
Gen Len: 20.8752

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: adamw_torch
lr_scheduler_type: linear
num_epochs: 3

Framework Versions

Transformers 4.50.3
Pytorch 2.4.0
Datasets 3.6.0
Tokenizers 0.21.4

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for thenlpresearcher/shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Base model

ai4bharat/indictrans2-en-indic-dist-200M

Finetuned

(11)

this model

Paper for thenlpresearcher/shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva

Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Paper • 2601.09725 • Published Dec 28, 2025 • 1