ByT5 — Singlish → Sinhala Transliteration
Fine-tuned version of google/byt5-base for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.
This is the merged (LoRA weights absorbed) final model.
Task
Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.
| Input (Singlish) | Output (Sinhala) |
|---|---|
mama giya |
මම ගිය |
kohomada |
කොහොමද |
Model Notes
ByT5 operates at the raw byte level with no tokenizer vocabulary — it processes UTF-8 bytes directly. This makes it well-suited for transliteration involving low-resource scripts like Sinhala, where subword tokenizers may over-fragment characters.
Each Sinhala Unicode character encodes to approximately 3 UTF-8 bytes, producing longer token sequences than sentencepiece-based models. Use max_length=512 at inference to cover 99.9% of target sequences in this dataset (measured p99=412, max=543 tokens).
Training Pipeline
Trained using a three-phase curriculum strategy with LoRA, using the same pipeline as the Small100 and mT5 variants.
Data
| Split | Source | Size |
|---|---|---|
| Phase 1 & 2 training | phonetic_train_1M.csv |
1,000,000 samples |
| Adhoc fine-tuning | adhoc.csv |
11,937 samples |
| Phonetic validation | phonetic_test.csv |
10,003 samples |
| Adhoc validation | adhoc_test.csv |
5,003 samples |
Synthetic Augmentation
Adhoc data was expanded with a rule-based Singlish augmenter:
- Vowel dropping — randomly drops non-boundary vowels
- Cluster simplification — collapses common digraphs (
th→t,sh→s,nd→n, etc.) - Vowel swapping — substitutes phonetically similar vowels (
a↔e,i↔e,o↔u)
Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.
Three-Phase Curriculum
| Phase | Data | Epochs | LR | Validation | Aug |
|---|---|---|---|---|---|
| 1 — Foundation | 65% of phonetic train (~650K) | 2 | 1e-4 | Phonetic | 15% |
| 2 — Expansion | Remaining phonetic + 5× adhoc + 80K replay | 2 | 5e-5 | Adhoc | 20% |
| 3 — Mastery | 10× adhoc + 200K phonetic mix | 2 | 2e-5 | Adhoc | 15% |
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, out_proj, fc1, fc2 |
Training Arguments
| Parameter | Value |
|---|---|
| Batch size | 8 |
| Gradient accumulation | 4 (effective batch: 32) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Warmup ratio | 0.03 |
| Optimizer | AdamW fused |
| Precision | bfloat16 / fp16 |
Evaluation Results
| Test Set | CER ↓ | WER ↓ | BLEU ↑ | BERTScore ↑ |
|---|---|---|---|---|
| Phonetic | 0.0287 | 0.1289 | 0.7056 | 0.9905 |
| Adhoc | 0.0634 | 0.2414 | 0.5078 | 0.9901 |
BERTScore computed using Ransaka/sinhala-bert-medium-v2.
ByT5 ranked 2nd on both test sets among models evaluated in this shared task.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "savinugunarathna/ByT5-Singlish-Sinhala-Merged2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
inputs = tokenizer("mama giya", return_tensors="pt")
outputs = model.generate(
**inputs,
num_beams=4,
max_length=512, # important — ByT5 produces longer byte sequences
length_penalty=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය
Note: Always use
max_length=512for ByT5. Each Sinhala character encodes to ~3 byte tokens, so sequences are much longer than with sentencepiece models.
- Downloads last month
- 5
Model tree for savinugunarathna/ByT5-Singlish-Sinhala-Merged2
Base model
google/byt5-base