YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

mt5-small-indic-gec-tamil

A multilingual Grammatical Error Correction (GEC) model fine-tuned from mT5-small for Tamil. Developed as part of the BHASHA 2025 Shared Task 1: IndicGEC.

Developed by: Manav Dhamecha, Gaurav Damor, Sunil Choudhary, Pruthwik Mishra
License: MIT
Base model: google/mt5-small
Paper: Team Horizon at BHASHA Task 1
Repository: manavdhamecha77/IndicGEC2025
GitHub.io: Multilingual IndicGEC

What it does

Given a grammatically incorrect Tamil sentence, the model outputs a corrected version. It handles errors across spelling, morphology, tense, word order, punctuation, missing/extra words, and semantic issues. Tamil morphological and word-order errors are a primary focus.

GLEU Score on Tamil test set: 86.03 (5th rank at BHASHA 2025)

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "manavdhamecha77/GEC-mT5-Small-Tamil"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentences = [
    "நான் பள்ளிக்கு போகிறேன்",   # example Tamil sentence
]

inputs = ["correct this: " + s for s in sentences]

encoded = tokenizer(
    inputs,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
).to(device)

outputs = model.generate(**encoded, max_length=128, num_beams=4)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for orig, corr in zip(sentences, corrected):
    print(f"Input:     {orig}")
    print(f"Corrected: {corr}\n")

Training Details

The model was fine-tuned using a sequence-to-sequence objective on parallel noisy–clean sentence pairs. Tamil had the smallest official dataset (~91 training pairs), which was expanded to ~10k pairs using a synthetic error injection pipeline covering 10 linguistic error categories.

Parameter	Value
Optimizer	AdamW
Learning Rate	5e-5
Batch Size	16–32
Epochs	10–15
Max Sequence Length	128
Early Stopping	Based on GLEU (dev set)

Input format: "correct this: <incorrect sentence>"

Evaluation

Language	Model	GLEU
Tamil	mT5-small	86.03

Limitations

Performance may degrade on heavy code-mixing, informal slang, or dialectal text.
Trained primarily on formal written Tamil; may not generalize to all domains.
Evaluation uses automatic metrics (GLEU) only; human evaluation not conducted.
Original annotated dataset for Tamil is very small (~91 pairs); most training data is synthetic.

Citation

@inproceedings{dhamecha2025horizon,
  title     = {Team Horizon at {BHASHA} Task 1: Multilingual {IndicGEC} with Transformer-based Grammatical Error Correction Models},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.14/}
}

Downloads last month: 21

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for manavdhamecha77/GEC-mT5-Small-Tamil

Base model

google/mt5-small

Finetuned

(666)

this model