YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
mt5-small-indic-gec-tamil
A multilingual Grammatical Error Correction (GEC) model fine-tuned from mT5-small for Tamil. Developed as part of the BHASHA 2025 Shared Task 1: IndicGEC.
- Developed by: Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
- License: MIT
- Base model: google/mt5-small
- Paper: Team Horizon at BHASHA Task 1
- Repository: manavdhamecha77/IndicGEC2025
- GitHub.io: Multilingual IndicGEC
What it does
Given a grammatically incorrect Tamil sentence, the model outputs a corrected version. It handles errors across spelling, morphology, tense, word order, punctuation, missing/extra words, and semantic issues. Tamil morphological and word-order errors are a primary focus.
GLEU Score on Tamil test set: 86.03 (5th rank at BHASHA 2025)
Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "manavdhamecha77/GEC-mT5-Small-Tamil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
sentences = [
"நான் பள்ளிக்கு போகிறேன்", # example Tamil sentence
]
inputs = ["correct this: " + s for s in sentences]
encoded = tokenizer(
inputs,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128
).to(device)
outputs = model.generate(**encoded, max_length=128, num_beams=4)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for orig, corr in zip(sentences, corrected):
print(f"Input: {orig}")
print(f"Corrected: {corr}\n")
Training Details
The model was fine-tuned using a sequence-to-sequence objective on parallel noisy–clean sentence pairs. Tamil had the smallest official dataset (~91 training pairs), which was expanded to ~10k pairs using a synthetic error injection pipeline covering 10 linguistic error categories.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 5e-5 |
| Batch Size | 16–32 |
| Epochs | 10–15 |
| Max Sequence Length | 128 |
| Early Stopping | Based on GLEU (dev set) |
Input format: "correct this: <incorrect sentence>"
Evaluation
| Language | Model | GLEU |
|---|---|---|
| Tamil | mT5-small | 86.03 |
Limitations
- Performance may degrade on heavy code-mixing, informal slang, or dialectal text.
- Trained primarily on formal written Tamil; may not generalize to all domains.
- Evaluation uses automatic metrics (GLEU) only; human evaluation not conducted.
- Original annotated dataset for Tamil is very small (~91 pairs); most training data is synthetic.
Citation
@inproceedings{dhamecha2025horizon,
title = {Team Horizon at {BHASHA} Task 1: Multilingual {IndicGEC} with Transformer-based Grammatical Error Correction Models},
author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
year = {2025},
url = {https://aclanthology.org/2025.bhasha-1.14/}
}
- Downloads last month
- 21
Model tree for manavdhamecha77/GEC-mT5-Small-Tamil
Base model
google/mt5-small