Gemma3-Singlish-Sinhala-Merged
A fine-tuned Gemma 3 model for Romanized Sinhala (Singlish) → Sinhala script transliteration, developed for the Indo NLP Shared Task on Romanized Sinhala transliteration.
What is this for?
Sri Lankans commonly type Sinhala phonetically in Roman script — e.g., kohomada instead of කොහොමද. This model converts that Romanized input back into proper Sinhala Unicode script — including the messy, inconsistent, real-world typing patterns that standard phonetic models struggle with.
Handling Ad-hoc Input
Ad-hoc Romanized Sinhala (casual, inconsistent spellings like kohomda, kohomadha, kohmda) is notoriously hard. This model was built to handle it. Training proceeded in three phases:
- Base phonetic training — structured, rule-consistent Romanized Sinhala pairs
- Ad-hoc fine-tuning — noisy, user-generated spelling variations
- Merged fine-tuning — joint training on both distributions to prevent catastrophic forgetting
Alongside the three-phase curriculum, artificial data augmentation was applied to simulate real-world ad-hoc spelling patterns — random character substitutions, vowel dropping, and phoneme collisions common in casual Sri Lankan typing.
Performance (Indo NLP Shared Task)
| Metric | Phonetic | Ad-hoc |
|---|---|---|
| CER | 0.0182 | 0.0416 |
| WER | 0.0931 | 0.1587 |
| Exact Acc | 0.37 | 0.205 |
| BLEU-4 Word | 0.7757 | 0.6666 |
| BLEU-4 Char | 0.9569 | 0.9225 |
| BERTScore F1 | 0.986 | 0.9706 |
Phonetic = structured phonetic romanization inputs
Ad-hoc = casual, inconsistent user typing (the harder, more realistic setting)
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "savinugunarathna/Gemma3-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
def transliterate(singlish_text: str) -> str:
prompt = f"Transliterate the following Romanized Sinhala to Sinhala script:\n{singlish_text}\nSinhala:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
return decoded.split("Sinhala:")[-1].strip()
# Works on clean phonetic input
print(transliterate("kohomada")) # → කොහොමද
# Also handles messy ad-hoc input
print(transliterate("kohomda")) # → කොහොමද
print(transliterate("mama giye")) # → මම ගියේ
Cite This Model
If you use this model in your work, please cite:
@misc{gunarathna2025gemma3singlish,
title={Gemma3-Singlish-Sinhala-Merged: A Three-Phase Fine-Tuned Model for Romanized Sinhala Transliteration},
author={Gunarathna, Savinu},
year={2025},
howpublished={\url{https://huggingface.co/savinugunarathna/Gemma3-Singlish-Sinhala-Merged}},
note={Indo NLP Shared Task submission}
}
Acknowledgements & Related Work
This model builds on the following datasets and resources:
@article{sumanathilaka2025swa,
title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
journal={arXiv preprint arXiv:2507.09245},
year={2025}
}
@article{ranasinghe2022sold,
title={SOLD: Sinhala Offensive Language Dataset},
author={Ranasinghe, Tharindu and Anuradha, Isuri and Premasiri, Damith and Silva, Kanishka and Hettiarachchi, Hansi and Uyangodage, Lasitha and Zampieri, Marcos},
journal={arXiv preprint arXiv:2212.00851},
year={2022}
}
@inproceedings{Nsina2024,
author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
title={{NSINA: A News Corpus for Sinhala}},
booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
year={2024},
month={May},
}
- Downloads last month
- 82