ByT5 — Singlish → Sinhala Transliteration

Fine-tuned version of google/byt5-base for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.

This is the merged (LoRA weights absorbed) final model.

Task

Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.

Input (Singlish)	Output (Sinhala)
`mama giya`	`මම ගිය`
`kohomada`	`කොහොමද`

Model Notes

ByT5 operates at the raw byte level with no tokenizer vocabulary — it processes UTF-8 bytes directly. This makes it well-suited for transliteration involving low-resource scripts like Sinhala, where subword tokenizers may over-fragment characters.

Each Sinhala Unicode character encodes to approximately 3 UTF-8 bytes, producing longer token sequences than sentencepiece-based models. Use max_length=512 at inference to cover 99.9% of target sequences in this dataset (measured p99=412, max=543 tokens).

Training Pipeline

Trained using a three-phase curriculum strategy with LoRA, using the same pipeline as the Small100 and mT5 variants.

Data

Split	Source	Size
Phase 1 & 2 training	`phonetic_train_1M.csv`	1,000,000 samples
Adhoc fine-tuning	`adhoc.csv`	11,937 samples
Phonetic validation	`phonetic_test.csv`	10,003 samples
Adhoc validation	`adhoc_test.csv`	5,003 samples

Synthetic Augmentation

Adhoc data was expanded with a rule-based Singlish augmenter:

Vowel dropping — randomly drops non-boundary vowels
Cluster simplification — collapses common digraphs (th→t, sh→s, nd→n, etc.)
Vowel swapping — substitutes phonetically similar vowels (a↔e, i↔e, o↔u)

Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.

Three-Phase Curriculum

Phase	Data	Epochs	LR	Validation	Aug
1 — Foundation	65% of phonetic train (~650K)	2	1e-4	Phonetic	15%
2 — Expansion	Remaining phonetic + 5× adhoc + 80K replay	2	5e-5	Adhoc	20%
3 — Mastery	10× adhoc + 200K phonetic mix	2	2e-5	Adhoc	15%

LoRA Configuration

Parameter	Value
Rank (r)	64
Alpha	128
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`

Training Arguments

Parameter	Value
Batch size	8
Gradient accumulation	4 (effective batch: 32)
Weight decay	0.01
Max grad norm	1.0
Warmup ratio	0.03
Optimizer	AdamW fused
Precision	bfloat16 / fp16

Evaluation Results

Test Set	CER ↓	WER ↓	BLEU ↑	BERTScore ↑
Phonetic	0.0287	0.1289	0.7056	0.9905
Adhoc	0.0634	0.2414	0.5078	0.9901

BERTScore computed using Ransaka/sinhala-bert-medium-v2.

ByT5 ranked 2nd on both test sets among models evaluated in this shared task.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "savinugunarathna/ByT5-Singlish-Sinhala-Merged2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

inputs = tokenizer("mama giya", return_tensors="pt")
outputs = model.generate(
    **inputs,
    num_beams=4,
    max_length=512,   # important — ByT5 produces longer byte sequences
    length_penalty=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය

Note: Always use max_length=512 for ByT5. Each Sinhala character encodes to ~3 byte tokens, so sequences are much longer than with sentencepiece models.

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/ByT5-Singlish-Sinhala-Merged2

Base model

google/byt5-base

Adapter

(5)

this model