ByT5 — Singlish → Sinhala Transliteration

Fine-tuned version of google/byt5-base for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.

This is the merged (LoRA weights absorbed) final model.


Task

Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.

Input (Singlish) Output (Sinhala)
mama giya මම ගිය
kohomada කොහොමද

Model Notes

ByT5 operates at the raw byte level with no tokenizer vocabulary — it processes UTF-8 bytes directly. This makes it well-suited for transliteration involving low-resource scripts like Sinhala, where subword tokenizers may over-fragment characters.

Each Sinhala Unicode character encodes to approximately 3 UTF-8 bytes, producing longer token sequences than sentencepiece-based models. Use max_length=512 at inference to cover 99.9% of target sequences in this dataset (measured p99=412, max=543 tokens).


Training Pipeline

Trained using a three-phase curriculum strategy with LoRA, using the same pipeline as the Small100 and mT5 variants.

Data

Split Source Size
Phase 1 & 2 training phonetic_train_1M.csv 1,000,000 samples
Adhoc fine-tuning adhoc.csv 11,937 samples
Phonetic validation phonetic_test.csv 10,003 samples
Adhoc validation adhoc_test.csv 5,003 samples

Synthetic Augmentation

Adhoc data was expanded with a rule-based Singlish augmenter:

  • Vowel dropping — randomly drops non-boundary vowels
  • Cluster simplification — collapses common digraphs (th→t, sh→s, nd→n, etc.)
  • Vowel swapping — substitutes phonetically similar vowels (a↔e, i↔e, o↔u)

Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.

Three-Phase Curriculum

Phase Data Epochs LR Validation Aug
1 — Foundation 65% of phonetic train (~650K) 2 1e-4 Phonetic 15%
2 — Expansion Remaining phonetic + 5× adhoc + 80K replay 2 5e-5 Adhoc 20%
3 — Mastery 10× adhoc + 200K phonetic mix 2 2e-5 Adhoc 15%

LoRA Configuration

Parameter Value
Rank (r) 64
Alpha 128
Dropout 0.05
Target modules q_proj, k_proj, v_proj, out_proj, fc1, fc2

Training Arguments

Parameter Value
Batch size 8
Gradient accumulation 4 (effective batch: 32)
Weight decay 0.01
Max grad norm 1.0
Warmup ratio 0.03
Optimizer AdamW fused
Precision bfloat16 / fp16

Evaluation Results

Test Set CER ↓ WER ↓ BLEU ↑ BERTScore ↑
Phonetic 0.0287 0.1289 0.7056 0.9905
Adhoc 0.0634 0.2414 0.5078 0.9901

BERTScore computed using Ransaka/sinhala-bert-medium-v2.

ByT5 ranked 2nd on both test sets among models evaluated in this shared task.


Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "savinugunarathna/ByT5-Singlish-Sinhala-Merged2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

inputs = tokenizer("mama giya", return_tensors="pt")
outputs = model.generate(
    **inputs,
    num_beams=4,
    max_length=512,   # important — ByT5 produces longer byte sequences
    length_penalty=1.0,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය

Note: Always use max_length=512 for ByT5. Each Sinhala character encodes to ~3 byte tokens, so sequences are much longer than with sentencepiece models.

Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/ByT5-Singlish-Sinhala-Merged2

Base model

google/byt5-base
Adapter
(5)
this model