Bangla Transliteration (Roman to Bangla)
A character-level Transformer that converts romanized Bangla (Banglish) to proper Bangla script.
Input: ami tomake bhalobashi
Output: আমি তোমাকে ভালোবাসি
9.18% CER on held-out test set (2,429 pairs). 6M parameters, 24MB model, runs on CPU.
Quick Start
pip install torch gradio editdistance
Interactive:
python infer.py --model model/ro2bn_ft
Gradio demo:
python app.py
Evaluate on test set:
python infer.py --model model/ro2bn_ft --eval data/test.csv
How It Works
Two-stage training on the BanglaTLit dataset:
- Pretrain on 206K pairs (41K gold human-labeled + 165K pseudo-labeled silver) with warmup LR schedule
- Fine-tune on 41K gold-only pairs with constant lr=5e-5
The model is a standard encoder-decoder Transformer (d_model=256, 8 heads, 4 layers each side, FFN=512) operating at the character level. No subword tokenization, no pretrained embeddings -- just raw characters in, characters out.
Data
All sourced from BanglaTLit (Bangla social media transliteration corpus).
| Split | Pairs |
|---|---|
| Train | 41,785 |
| Val | 1,461 |
| Test | 2,429 |
Each CSV has two columns: roman and bangla.
Model
| Architecture | Char-level Encoder-Decoder Transformer |
| Parameters | 6M |
| d_model | 256 |
| Heads | 8 |
| Layers | 4 enc + 4 dec |
| FFN | 512 |
| Size | 24MB |
Saved in model/ro2bn_ft/ (PyTorch state dict + vocab JSONs + config).
Training
Pretrain (stage 1):
python train.py \
--train data/train.csv --val data/val.csv --test data/test.csv \
--d_model 256 --nhead 8 --layers 4 --ffn 512 \
--epochs 100 --batch 256 --warmup 800 --dropout 0.2 \
--out model/ro2bn_v3
Fine-tune (stage 2):
python finetune.py \
--checkpoint model/ro2bn_v3 \
--train data/train.csv --val data/val.csv --test data/test.csv \
--out model/ro2bn_ft \
--epochs 40 --lr 5e-5 --batch 128
Trained on a single RTX 3090 (24GB). Pretrain took ~4 hours, fine-tune ~3.5 hours.
Results
| Metric | Value |
|---|---|
| Test CER | 9.18% |
| Perfect match | 435 / 2,429 (17.9%) |
Most of the remaining errors are spacing differences (কত? vs কত ?), valid spelling variants (থ্যাংকস / থ্যাংক্স), or English loanword transliteration choices (অ্যাপ vs এপ).
Files
tlit/
├── app.py # Gradio demo
├── infer.py # Inference (interactive / batch / eval)
├── train.py # Stage 1: pretrain
├── finetune.py # Stage 2: fine-tune on gold data
├── model/
│ └── ro2bn_ft/ # Final model weights + vocab
├── data/
│ ├── train.csv # 41K gold pairs
│ ├── val.csv
│ └── test.csv
└── scripts/
├── extract_gold.py # Data extraction pipeline
├── extract_silver.py
├── extract_bronze.py
└── combine.py
Limitations
- Trained on social media text (TrickBD forum posts). Formal Bangla or literary text may get different results.
- Roman-to-Bangla only. Reverse direction (bn2ro) not yet trained.
- No beam search -- uses greedy decoding. Beam search would likely improve CER by 1-2%.
- Doesn't convert ASCII digits to Bangla digits (e.g.,
10stays10, not১০).
Credits
Built by Md Nahid Hasan (@nahidstaq).
Training data sourced from BanglaTLit by Ashfak Yeahhia et al.