Gemma3 Singlish → Sinhala Transliteration Model

Overview

This model performs Singlish (Romanized Sinhala) to Sinhala script transliteration.

It is fine-tuned from the base model:

savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix

The model was trained using a two-stage curriculum training strategy to improve transliteration accuracy, especially for rare Sinhala conjunct characters and difficult code-mixed inputs.


Training Strategy

Stage 1 — General Language Learning

Datasets used:

  • Phonetic dataset (1M rows)
  • Adhoc transliteration dataset
  • Code-mixed Sinhala-English dataset

Goal:

Learn general phonetic transliteration patterns from Singlish text and improve handling of normal transliteration and code-mixed inputs.


Stage 2 — Rare Conjunct Booster

Datasets used:

  • Adjunct dataset (rare Sinhala clusters)
  • Replay samples from phonetic dataset
  • Replay samples from adhoc dataset
  • Replay samples from code-mixed dataset

Goal:

Improve difficult Sinhala conjunct characters such as:

  • ක්ෂ
  • ශ්‍ර
  • ස්ථ
  • මඤ්ඤ

Example Transliteration

Singlish Sinhala
gnathin ඥාතින්
jnana ඥානය
mannyokka මඤ්ඤොක්කා
kshana ක්ෂණ
shraddha ශ්‍රද්ධා

Datasets

The model was trained using the following datasets:

1. Phonetic Dataset (~500k rows)

Romanized Sinhala to Sinhala script.

2. Adhoc Dataset

Common Singlish words and phrases.

3. Code-Mixed Dataset

Sinhala and English mixed sentences.

4. Adjunct Dataset

Synthetic and extracted dataset focusing on rare Sinhala conjunct characters.


Training Details

Parameter Value
Base Model Gemma3-Singlish-Sinhala-CodeMix
Fine-tuning Method LoRA
LoRA Rank 32
Batch Size 2
Gradient Accumulation 8
Learning Rate 1.5e-4
Scheduler Cosine
Precision BF16
Max Length 256

Intended Use

This model is intended for:

  • Singlish to Sinhala transliteration
  • Sinhala-English code-mixed input normalization
  • Rare conjunct transliteration improvement
  • NLP research on Romanized Sinhala text

Limitations

This model may still struggle with:

  • Extremely noisy or highly ambiguous spellings
  • Very long paragraphs
  • Rare unseen slang forms
  • Inputs where meaning depends on broader discourse context

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Pudamya/gemma3-singlish-sinhala-2stage-adjunct-final"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Transliterate to Sinhala: gnathin"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=40)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Predictions

Input Output
gnathin ඥාතින්
mannyokka මඤ්ඤොක්කා
api gnathin ganna oni අපි ඥාතින් ගන්න ඕනි
pls explain kshetra කරුණාකර ක්ෂේත්‍ර විස්තර කරන්න

Metrics

Metric Value
BLEU 59.71
CER 0.1296
WER 0.2195

Author

Pudamya Vidusini Rathnayake

Downloads last month
12
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pudamya/gemma3-singlish-sinhala-2stage-adjunct-final

Evaluation results