Gemma3-Singlish-Sinhala-Merged

A fine-tuned Gemma 3 model for Romanized Sinhala (Singlish) → Sinhala script transliteration, developed for the Indo NLP Shared Task on Romanized Sinhala transliteration.

What is this for?

Sri Lankans commonly type Sinhala phonetically in Roman script — e.g., kohomada instead of කොහොමද. This model converts that Romanized input back into proper Sinhala Unicode script — including the messy, inconsistent, real-world typing patterns that standard phonetic models struggle with.

Handling Ad-hoc Input

Ad-hoc Romanized Sinhala (casual, inconsistent spellings like kohomda, kohomadha, kohmda) is notoriously hard. This model was built to handle it. Training proceeded in three phases:

Base phonetic training — structured, rule-consistent Romanized Sinhala pairs
Ad-hoc fine-tuning — noisy, user-generated spelling variations
Merged fine-tuning — joint training on both distributions to prevent catastrophic forgetting

Alongside the three-phase curriculum, artificial data augmentation was applied to simulate real-world ad-hoc spelling patterns — random character substitutions, vowel dropping, and phoneme collisions common in casual Sri Lankan typing.

Performance (Indo NLP Shared Task)

Metric	Phonetic	Ad-hoc
CER	0.0182	0.0416
WER	0.0931	0.1587
Exact Acc	0.37	0.205
BLEU-4 Word	0.7757	0.6666
BLEU-4 Char	0.9569	0.9225
BERTScore F1	0.986	0.9706

Phonetic = structured phonetic romanization inputs
Ad-hoc = casual, inconsistent user typing (the harder, more realistic setting)

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "savinugunarathna/Gemma3-Singlish-Sinhala-Merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def transliterate(singlish_text: str) -> str:
    prompt = f"Transliterate the following Romanized Sinhala to Sinhala script:\n{singlish_text}\nSinhala:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=128)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("Sinhala:")[-1].strip()

# Works on clean phonetic input
print(transliterate("kohomada"))    # → කොහොමද

# Also handles messy ad-hoc input
print(transliterate("kohomda"))     # → කොහොමද
print(transliterate("mama giye"))   # → මම ගියේ

Cite This Model

If you use this model in your work, please cite:

@misc{gunarathna2025gemma3singlish,
  title={Gemma3-Singlish-Sinhala-Merged: A Three-Phase Fine-Tuned Model for Romanized Sinhala Transliteration},
  author={Gunarathna, Savinu},
  year={2025},
  howpublished={\url{https://huggingface.co/savinugunarathna/Gemma3-Singlish-Sinhala-Merged}},
  note={Indo NLP Shared Task submission}
}

Acknowledgements & Related Work

This model builds on the following datasets and resources:

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

@article{ranasinghe2022sold,
  title={SOLD: Sinhala Offensive Language Dataset},
  author={Ranasinghe, Tharindu and Anuradha, Isuri and Premasiri, Damith and Silva, Kanishka and Hettiarachchi, Hansi and Uyangodage, Lasitha and Zampieri, Marcos},
  journal={arXiv preprint arXiv:2212.00851},
  year={2022}
}

@inproceedings{Nsina2024,
  author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
  title={{NSINA: A News Corpus for Sinhala}},
  booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  year={2024},
  month={May},
}

Downloads last month: 82

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/Gemma3-Singlish-Sinhala-Merged

Adapters

3 models

Papers for savinugunarathna/Gemma3-Singlish-Sinhala-Merged

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Paper • 2507.09245 • Published Jul 12, 2025

SOLD: Sinhala Offensive Language Dataset

Paper • 2212.00851 • Published Dec 1, 2022