NastaliqBERT

Character-level NeoBERT encoder pretrained on unvocalized Urdu Nastaliq text.

NastaliqBERT is the first character-level pretrained encoder for Urdu and the first NeoBERT-architecture model for any South Asian language. It achieves a masked language modelling perplexity of 1.67 on CC-100 Urdu.

Model description

NastaliqBERT uses the NeoBERT architecture with RoPE positional embeddings, SwiGLU activation, and Pre-RMSNorm. It operates on a 116-token character vocabulary covering all Urdu Nastaliq letters (including final forms), ASCII, and special tokens — no subword tokenization.


Architecture	NeoBERT (RoPE + SwiGLU + Pre-RMSNorm)
Vocabulary	116 characters
Hidden size	512
Layers	6
Attention heads	8
Parameters	~19.4M
Pretraining data	CC-100 Urdu (~8GB)
Perplexity	1.67

Why character-level for Urdu?

Urdu is written in Nastaliq a right-to-left abjad where short vowels are not written. Subword tokenizers fragment Nastaliq ligatures at arbitrary boundaries. Character-level modelling preserves:

The aspiration digraph ھ (do-chashmi-he) as a distinct unit
Final letter forms (ے vs ی, ں vs ن)
Correct morphological boundaries

Usage

import torch, json, sys
sys.path.insert(0, 'src')
from model import BertCharUr, BertCharUrConfig
from tokenizer import CharTokenizer

tok = CharTokenizer()
with open('config.json') as f:
    cfg = BertCharUrConfig.from_dict(json.load(f))

model = BertCharUr(cfg)
state = torch.load('model.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state'])
model.eval()

ids = tok.encode("مشاہیر", max_length=64)
input_ids = torch.tensor([ids])
mask = torch.ones_like(input_ids)

with torch.no_grad():
    hidden = model.get_encoder_output(input_ids, mask)
# hidden: (1, seq_len, 512)

Downstream use

NastaliqBERT is used as the encoder in UrduPhon, a G2P model for Urdu TTS that uses the Phonological Feature Decoder (PFD).

Training

Pretrained for 100,000 steps on CC-100 Urdu (statmt/cc100, name="ur") using:

MLM with 15% mask rate (no NSP)
AdamW optimizer, lr=1e-4, cosine decay
Effective batch size 64 (32 × 2 grad accumulation)
Mixed precision (bf16)
FlashAttention-2 (via F.scaled_dot_product_attention)

Citation

@misc{nastaliqbert2025,
  title        = {NastaliqBERT: A Character-Level NeoBERT Encoder for Urdu's Defective Abjad Script},
  author       = {Mahwiz},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/mahwizzzz/NastaliqBERT}},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mahwizzzz/NastaliqBERT

Evaluation results

perplexity on CC-100 Urdu
self-reported

1.670