You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Bangla Transliteration (Roman to Bangla)

A character-level Transformer that converts romanized Bangla (Banglish) to proper Bangla script.

Input:  ami tomake bhalobashi
Output: আমি তোমাকে ভালোবাসি

9.18% CER on held-out test set (2,429 pairs). 6M parameters, 24MB model, runs on CPU.

Quick Start

pip install torch gradio editdistance

Interactive:

python infer.py --model model/ro2bn_ft

Gradio demo:

python app.py

Evaluate on test set:

python infer.py --model model/ro2bn_ft --eval data/test.csv

How It Works

Two-stage training on the BanglaTLit dataset:

  1. Pretrain on 206K pairs (41K gold human-labeled + 165K pseudo-labeled silver) with warmup LR schedule
  2. Fine-tune on 41K gold-only pairs with constant lr=5e-5

The model is a standard encoder-decoder Transformer (d_model=256, 8 heads, 4 layers each side, FFN=512) operating at the character level. No subword tokenization, no pretrained embeddings -- just raw characters in, characters out.

Data

All sourced from BanglaTLit (Bangla social media transliteration corpus).

Split Pairs
Train 41,785
Val 1,461
Test 2,429

Each CSV has two columns: roman and bangla.

Model

Architecture Char-level Encoder-Decoder Transformer
Parameters 6M
d_model 256
Heads 8
Layers 4 enc + 4 dec
FFN 512
Size 24MB

Saved in model/ro2bn_ft/ (PyTorch state dict + vocab JSONs + config).

Training

Pretrain (stage 1):

python train.py \
  --train data/train.csv --val data/val.csv --test data/test.csv \
  --d_model 256 --nhead 8 --layers 4 --ffn 512 \
  --epochs 100 --batch 256 --warmup 800 --dropout 0.2 \
  --out model/ro2bn_v3

Fine-tune (stage 2):

python finetune.py \
  --checkpoint model/ro2bn_v3 \
  --train data/train.csv --val data/val.csv --test data/test.csv \
  --out model/ro2bn_ft \
  --epochs 40 --lr 5e-5 --batch 128

Trained on a single RTX 3090 (24GB). Pretrain took ~4 hours, fine-tune ~3.5 hours.

Results

Metric Value
Test CER 9.18%
Perfect match 435 / 2,429 (17.9%)

Most of the remaining errors are spacing differences (কত? vs কত ?), valid spelling variants (থ্যাংকস / থ্যাংক্স), or English loanword transliteration choices (অ্যাপ vs এপ).

Files

tlit/
├── app.py              # Gradio demo
├── infer.py            # Inference (interactive / batch / eval)
├── train.py            # Stage 1: pretrain
├── finetune.py         # Stage 2: fine-tune on gold data
├── model/
│   └── ro2bn_ft/       # Final model weights + vocab
├── data/
│   ├── train.csv       # 41K gold pairs
│   ├── val.csv
│   └── test.csv
└── scripts/
    ├── extract_gold.py     # Data extraction pipeline
    ├── extract_silver.py
    ├── extract_bronze.py
    └── combine.py

Limitations

  • Trained on social media text (TrickBD forum posts). Formal Bangla or literary text may get different results.
  • Roman-to-Bangla only. Reverse direction (bn2ro) not yet trained.
  • No beam search -- uses greedy decoding. Beam search would likely improve CER by 1-2%.
  • Doesn't convert ASCII digits to Bangla digits (e.g., 10 stays 10, not ১০).

Credits

Built by Md Nahid Hasan (@nahidstaq).

Training data sourced from BanglaTLit by Ashfak Yeahhia et al.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support