Gemma3 Singlish → Sinhala Transliteration Model
Overview
This model performs Singlish (Romanized Sinhala) to Sinhala script transliteration.
It is fine-tuned from the base model:
savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix
The model was trained using a two-stage curriculum training strategy to improve transliteration accuracy, especially for rare Sinhala conjunct characters and difficult code-mixed inputs.
Training Strategy
Stage 1 — General Language Learning
Datasets used:
- Phonetic dataset (1M rows)
- Adhoc transliteration dataset
- Code-mixed Sinhala-English dataset
Goal:
Learn general phonetic transliteration patterns from Singlish text and improve handling of normal transliteration and code-mixed inputs.
Stage 2 — Rare Conjunct Booster
Datasets used:
- Adjunct dataset (rare Sinhala clusters)
- Replay samples from phonetic dataset
- Replay samples from adhoc dataset
- Replay samples from code-mixed dataset
Goal:
Improve difficult Sinhala conjunct characters such as:
- ඥ
- ක්ෂ
- ශ්ර
- ස්ථ
- මඤ්ඤ
Example Transliteration
| Singlish | Sinhala |
|---|---|
| gnathin | ඥාතින් |
| jnana | ඥානය |
| mannyokka | මඤ්ඤොක්කා |
| kshana | ක්ෂණ |
| shraddha | ශ්රද්ධා |
Datasets
The model was trained using the following datasets:
1. Phonetic Dataset (~500k rows)
Romanized Sinhala to Sinhala script.
2. Adhoc Dataset
Common Singlish words and phrases.
3. Code-Mixed Dataset
Sinhala and English mixed sentences.
4. Adjunct Dataset
Synthetic and extracted dataset focusing on rare Sinhala conjunct characters.
Training Details
| Parameter | Value |
|---|---|
| Base Model | Gemma3-Singlish-Sinhala-CodeMix |
| Fine-tuning Method | LoRA |
| LoRA Rank | 32 |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Learning Rate | 1.5e-4 |
| Scheduler | Cosine |
| Precision | BF16 |
| Max Length | 256 |
Intended Use
This model is intended for:
- Singlish to Sinhala transliteration
- Sinhala-English code-mixed input normalization
- Rare conjunct transliteration improvement
- NLP research on Romanized Sinhala text
Limitations
This model may still struggle with:
- Extremely noisy or highly ambiguous spellings
- Very long paragraphs
- Rare unseen slang forms
- Inputs where meaning depends on broader discourse context
Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Pudamya/gemma3-singlish-sinhala-2stage-adjunct-final"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Transliterate to Sinhala: gnathin"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Example Predictions
| Input | Output |
|---|---|
| gnathin | ඥාතින් |
| mannyokka | මඤ්ඤොක්කා |
| api gnathin ganna oni | අපි ඥාතින් ගන්න ඕනි |
| pls explain kshetra | කරුණාකර ක්ෂේත්ර විස්තර කරන්න |
Metrics
| Metric | Value |
|---|---|
| BLEU | 59.71 |
| CER | 0.1296 |
| WER | 0.2195 |
Author
Pudamya Vidusini Rathnayake
- Downloads last month
- 12
Model tree for Pudamya/gemma3-singlish-sinhala-2stage-adjunct-final
Base model
google/gemma-3-270mEvaluation results
- BLEU on Mixed Evaluation Setself-reported59.710
- CER on Mixed Evaluation Setself-reported0.130
- WER on Mixed Evaluation Setself-reported0.220
- Exact Match on Mixed Evaluation Setself-reported0.320