NastaliqBERT
Character-level NeoBERT encoder pretrained on unvocalized Urdu Nastaliq text.
NastaliqBERT is the first character-level pretrained encoder for Urdu and the first NeoBERT-architecture model for any South Asian language. It achieves a masked language modelling perplexity of 1.67 on CC-100 Urdu.
Model description
NastaliqBERT uses the NeoBERT architecture with RoPE positional embeddings, SwiGLU activation, and Pre-RMSNorm. It operates on a 116-token character vocabulary covering all Urdu Nastaliq letters (including final forms), ASCII, and special tokens — no subword tokenization.
| Architecture | NeoBERT (RoPE + SwiGLU + Pre-RMSNorm) |
| Vocabulary | 116 characters |
| Hidden size | 512 |
| Layers | 6 |
| Attention heads | 8 |
| Parameters | ~19.4M |
| Pretraining data | CC-100 Urdu (~8GB) |
| Perplexity | 1.67 |
Why character-level for Urdu?
Urdu is written in Nastaliq a right-to-left abjad where short vowels are not written. Subword tokenizers fragment Nastaliq ligatures at arbitrary boundaries. Character-level modelling preserves:
- The aspiration digraph ھ (do-chashmi-he) as a distinct unit
- Final letter forms (ے vs ی, ں vs ن)
- Correct morphological boundaries
Usage
import torch, json, sys
sys.path.insert(0, 'src')
from model import BertCharUr, BertCharUrConfig
from tokenizer import CharTokenizer
tok = CharTokenizer()
with open('config.json') as f:
cfg = BertCharUrConfig.from_dict(json.load(f))
model = BertCharUr(cfg)
state = torch.load('model.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state'])
model.eval()
ids = tok.encode("مشاہیر", max_length=64)
input_ids = torch.tensor([ids])
mask = torch.ones_like(input_ids)
with torch.no_grad():
hidden = model.get_encoder_output(input_ids, mask)
# hidden: (1, seq_len, 512)
Downstream use
NastaliqBERT is used as the encoder in UrduPhon, a G2P model for Urdu TTS that uses the Phonological Feature Decoder (PFD).
Training
Pretrained for 100,000 steps on CC-100 Urdu (statmt/cc100, name="ur") using:
- MLM with 15% mask rate (no NSP)
- AdamW optimizer, lr=1e-4, cosine decay
- Effective batch size 64 (32 × 2 grad accumulation)
- Mixed precision (bf16)
- FlashAttention-2 (via
F.scaled_dot_product_attention)
Citation
@misc{nastaliqbert2025,
title = {NastaliqBERT: A Character-Level NeoBERT Encoder for Urdu's Defective Abjad Script},
author = {Mahwiz},
year = {2026},
howpublished = {\url{https://huggingface.co/mahwizzzz/NastaliqBERT}},
}
Dataset used to train mahwizzzz/NastaliqBERT
Evaluation results
- perplexity on CC-100 Urduself-reported1.670