shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi machine translation. It was developed to address punctuation robustness, as presented in the paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.
Model Description
This model follows Approach 2 (Direct Fine-tuning) described in the research. It is the Combined Finetuned (1x Punct) variant, where the base model was fine-tuned on the IITB-ENG-MAR dataset using a strategy that alternates between keeping and removing punctuation in the source English text. This helps the model implicitly learn context and resolve semantic ambiguities (e.g., distinguishing "Honey" as a name versus a substance) even when punctuation is missing.
- Source Language: English (Latin script)
- Target Language: Marathi (Devanagari script)
- Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
- GitHub Repository: KaustubhShejole/Viram_Marathi
Intended Uses & Limitations
The model is intended for English-to-Marathi translation tasks, particularly those involving informal or unpunctuated source text (like social media or speech transcripts) where standard models might struggle with meaning-changing ambiguities.
Training and Evaluation Data
- Fine-tuning Data: A variant of the IITB-ENG-MAR dataset.
- Evaluation Benchmark: Virām (Punct-Eng-Mar/PEM), a diagnostic benchmark of 54 manually curated, punctuation-ambiguous instances designed to test MT robustness.
Training Results
The model achieves the following results on the evaluation set:
- Loss: 0.4160
- Bleu: 9.5026
- Chrfpp: 31.6551
- Comet: 0.5336
- Gen Len: 20.8752
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-06
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: adamw_torch
- lr_scheduler_type: linear
- num_epochs: 3
Framework Versions
- Transformers 4.50.3
- Pytorch 2.4.0
- Datasets 3.6.0
- Tokenizers 0.21.4
- Downloads last month
- 2
Model tree for thenlpresearcher/shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Deva
Base model
ai4bharat/indictrans2-en-indic-dist-200M