YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

mt5-small-indic-gec-tamil

A multilingual Grammatical Error Correction (GEC) model fine-tuned from mT5-small for Tamil. Developed as part of the BHASHA 2025 Shared Task 1: IndicGEC.


What it does

Given a grammatically incorrect Tamil sentence, the model outputs a corrected version. It handles errors across spelling, morphology, tense, word order, punctuation, missing/extra words, and semantic issues. Tamil morphological and word-order errors are a primary focus.

GLEU Score on Tamil test set: 86.03 (5th rank at BHASHA 2025)


Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "manavdhamecha77/GEC-mT5-Small-Tamil"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

sentences = [
    "நான் பள்ளிக்கு போகிறேன்",   # example Tamil sentence
]

inputs = ["correct this: " + s for s in sentences]

encoded = tokenizer(
    inputs,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
).to(device)

outputs = model.generate(**encoded, max_length=128, num_beams=4)
corrected = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for orig, corr in zip(sentences, corrected):
    print(f"Input:     {orig}")
    print(f"Corrected: {corr}\n")

Training Details

The model was fine-tuned using a sequence-to-sequence objective on parallel noisy–clean sentence pairs. Tamil had the smallest official dataset (~91 training pairs), which was expanded to ~10k pairs using a synthetic error injection pipeline covering 10 linguistic error categories.

Parameter Value
Optimizer AdamW
Learning Rate 5e-5
Batch Size 16–32
Epochs 10–15
Max Sequence Length 128
Early Stopping Based on GLEU (dev set)

Input format: "correct this: <incorrect sentence>"


Evaluation

Language Model GLEU
Tamil mT5-small 86.03

Limitations

  • Performance may degrade on heavy code-mixing, informal slang, or dialectal text.
  • Trained primarily on formal written Tamil; may not generalize to all domains.
  • Evaluation uses automatic metrics (GLEU) only; human evaluation not conducted.
  • Original annotated dataset for Tamil is very small (~91 pairs); most training data is synthetic.

Citation

@inproceedings{dhamecha2025horizon,
  title     = {Team Horizon at {BHASHA} Task 1: Multilingual {IndicGEC} with Transformer-based Grammatical Error Correction Models},
  author    = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik},
  booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)},
  year      = {2025},
  url       = {https://aclanthology.org/2025.bhasha-1.14/}
}
Downloads last month
21
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manavdhamecha77/GEC-mT5-Small-Tamil

Base model

google/mt5-small
Finetuned
(666)
this model