Dia-1.6B-Urdu — Urdu TTS Fine-tune
Model Summary
Dia-1.6B-Urdu is a fine-tuned version of nari-labs/Dia-1.6B for Urdu (اردو) text-to-speech synthesis. It is, to our knowledge, one of the first openly released neural TTS models fine-tuned specifically on a large-scale Urdu corpus using the Dia architecture.
| Property | Value |
|---|---|
| Base Model | nari-labs/Dia-1.6B |
| Language | Urdu (ur) |
| Script | Nastaliq (RTL) |
| Training Data | ~221,832 Urdu audio-text pairs |
| Audio Duration | ~200+ hours |
| Sample Rate | 44,100 Hz |
| Architecture | Encoder-Decoder Transformer + DAC codec |
| Parameters | ~1.6B |
| Fine-tune Steps | See training details below |
| Hardware | NVIDIA RTX 3090 (24GB VRAM) |
Intended Use
This model is designed for:
- Urdu speech synthesis from text input
- Audiobook and narration generation in Urdu
- Research in low-resource TTS for South Asian languages
- Voice interface applications targeting Urdu speakers (~230M native speakers globally)
Out-of-Scope Use
- Real-time streaming TTS (latency not optimized)
- Non-Urdu languages (use the base Dia-1.6B for multilingual use)
- Voice cloning of specific individuals without consent
Usage
Installation
pip install torch torchaudio soundfile
pip install git+https://github.com/nari-labs/dia.git
Basic Inference
import torch
import soundfile as sf
from dia.model import Dia
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Dia.from_pretrained("nari-labs/Dia-1.6B", device=device)
# fine-tuned checkpoint
print("Loading finetuned weights...")
ckpt_path = hf_hub_download(repo_id="mahwizzzz/Dia-1.6B-Urdu", filename="model.pth")
model.model.load_state_dict(ckpt, strict=True)
model.model.eval()
model.model = model.model.float()
text = "[ur] روزے کے وقت کھانے کی سہولت دینے کیا منع ہے؟"
with torch.no_grad():
output = model.generate(
text,
max_tokens=512,
cfg_scale=3.0
)
output_path = "/tmp/test_urdu_26000.wav"
sf.write(output_path, output, 44100)
Language Tag
Always prefix your Urdu text with [ur] for best results:
text = "[ur] آپ کا متن یہاں لکھیں۔"
Training Details
Dataset
- Size: 221,832 audio-text pairs after filtering
- Duration filter: 1.0s – 15.0s per clip
- Rejected clips: 97 (duration out of range), 0 corrupted
Audio Preprocessing
- Resampled from 16,000 Hz → 44,100 Hz (ffmpeg)
- Loudness normalized to -23 LUFS (pyloudnorm)
- Peak limiting at 0.99
- Format: 16-bit PCM mono WAV
Training Configuration
| Hyperparameter | Value |
|---|---|
| Base model | nari-labs/Dia-1.6B |
| Precision | float16 (--half) |
| Batch size | 2 |
| Learning rate | 1e-5 (AdamW) |
| LR schedule | Cosine with warmup |
| Optimizer | AdamW |
| Hardware | NVIDIA RTX 3090 24GB |
| Framework | PyTorch 2.9.1 + CUDA 12.8 |
Train/Validation Split
- Train: 221,167 samples
- Validation: 665 samples
- Split ratio: 99.7% / 0.3%
- Random seed: 42
Evaluation
Evaluation audio samples are generated every 2,000 training steps using the Urdu test sentence:
آپ کو اپنے ملک یا علاقے کے قوانین کو جاننے کیلئے مقامی حکومت یا قانونی مشاور سے رجوع کرنا چاہئے۔
Formal MOS (Mean Opinion Score) evaluation is planned for future releases.
Limitations
- Early training: Early checkpoints may produce rough or accented speech — quality improves significantly with more steps
- OOV words: Roman Urdu or mixed-script text may not synthesize correctly
- Prosody: Long sentences may have unnatural prosody at early checkpoints
- DAC dependency: Requires the DAC (Descript Audio Codec) model for inference
Ethical Considerations
- This model was trained exclusively on publicly available Urdu narrations
- No private or personally identifiable voice data was used
- The model should not be used to synthesize speech impersonating real individuals without consent
- Misuse for disinformation or voice fraud is strictly prohibited
Citation
If you use this model in your research or applications, please cite:
@misc{dia_urdu_2026,
title = {Dia-1.6B-Urdu: Fine-tuned Urdu Text-to-Speech},
author = {Mahwiz Khalil},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/mahwizzzz/Dia-1.6B-Urdu}},
note = {Fine-tuned from nari-labs/Dia-1.6B on Urdu corpus}
}
Also cite the original Dia model:
@misc{dia2025,
title = {Dia: A 1.6B Dialogue Text-to-Speech Model},
author = {Nari Labs},
year = {2025},
url = {https://github.com/nari-labs/dia}
}
Acknowledgements
Model trained and released by mahwizzzz — Muhammad Mahwiz Khalil
- Downloads last month
- 13
Model tree for mahwizzzz/Dia-1.6B-Urdu
Base model
nari-labs/Dia-1.6B