Dia-1.6B-Urdu — Urdu TTS Fine-tune

Model Summary

Dia-1.6B-Urdu is a fine-tuned version of nari-labs/Dia-1.6B for Urdu (اردو) text-to-speech synthesis. It is, to our knowledge, one of the first openly released neural TTS models fine-tuned specifically on a large-scale Urdu corpus using the Dia architecture.

Property	Value
Base Model	nari-labs/Dia-1.6B
Language	Urdu (`ur`)
Script	Nastaliq (RTL)
Training Data	~221,832 Urdu audio-text pairs
Audio Duration	~200+ hours
Sample Rate	44,100 Hz
Architecture	Encoder-Decoder Transformer + DAC codec
Parameters	~1.6B
Fine-tune Steps	See training details below
Hardware	NVIDIA RTX 3090 (24GB VRAM)

Intended Use

This model is designed for:

Urdu speech synthesis from text input
Audiobook and narration generation in Urdu
Research in low-resource TTS for South Asian languages
Voice interface applications targeting Urdu speakers (~230M native speakers globally)

Out-of-Scope Use

Real-time streaming TTS (latency not optimized)
Non-Urdu languages (use the base Dia-1.6B for multilingual use)
Voice cloning of specific individuals without consent

Usage

Installation

pip install torch torchaudio soundfile
pip install git+https://github.com/nari-labs/dia.git

Basic Inference

import torch
import soundfile as sf
from dia.model import Dia

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Dia.from_pretrained("nari-labs/Dia-1.6B", device=device)

# fine-tuned checkpoint
print("Loading finetuned weights...")
ckpt_path = hf_hub_download(repo_id="mahwizzzz/Dia-1.6B-Urdu", filename="model.pth")
model.model.load_state_dict(ckpt, strict=True)
model.model.eval()
model.model = model.model.float()

text = "[ur] روزے کے وقت کھانے کی سہولت دینے کیا منع ہے؟"

with torch.no_grad():
    output = model.generate(
        text,
        max_tokens=512,
        cfg_scale=3.0
    )

output_path = "/tmp/test_urdu_26000.wav"
sf.write(output_path, output, 44100)

Language Tag

Always prefix your Urdu text with [ur] for best results:

text = "[ur] آپ کا متن یہاں لکھیں۔"

Training Details

Dataset

Size: 221,832 audio-text pairs after filtering
Duration filter: 1.0s – 15.0s per clip
Rejected clips: 97 (duration out of range), 0 corrupted

Audio Preprocessing

Resampled from 16,000 Hz → 44,100 Hz (ffmpeg)
Loudness normalized to -23 LUFS (pyloudnorm)
Peak limiting at 0.99
Format: 16-bit PCM mono WAV

Training Configuration

Hyperparameter	Value
Base model	nari-labs/Dia-1.6B
Precision	float16 (--half)
Batch size	2
Learning rate	1e-5 (AdamW)
LR schedule	Cosine with warmup
Optimizer	AdamW
Hardware	NVIDIA RTX 3090 24GB
Framework	PyTorch 2.9.1 + CUDA 12.8

Train/Validation Split

Train: 221,167 samples
Validation: 665 samples
Split ratio: 99.7% / 0.3%
Random seed: 42

Evaluation

Evaluation audio samples are generated every 2,000 training steps using the Urdu test sentence:

آپ کو اپنے ملک یا علاقے کے قوانین کو جاننے کیلئے مقامی حکومت یا قانونی مشاور سے رجوع کرنا چاہئے۔

Formal MOS (Mean Opinion Score) evaluation is planned for future releases.

Limitations

Early training: Early checkpoints may produce rough or accented speech — quality improves significantly with more steps
OOV words: Roman Urdu or mixed-script text may not synthesize correctly
Prosody: Long sentences may have unnatural prosody at early checkpoints
DAC dependency: Requires the DAC (Descript Audio Codec) model for inference

Ethical Considerations

This model was trained exclusively on publicly available Urdu narrations
No private or personally identifiable voice data was used
The model should not be used to synthesize speech impersonating real individuals without consent
Misuse for disinformation or voice fraud is strictly prohibited

Citation

If you use this model in your research or applications, please cite:

@misc{dia_urdu_2026,
  title        = {Dia-1.6B-Urdu: Fine-tuned Urdu Text-to-Speech},
  author       = {Mahwiz Khalil},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/mahwizzzz/Dia-1.6B-Urdu}},
  note         = {Fine-tuned from nari-labs/Dia-1.6B on Urdu corpus}
}

Also cite the original Dia model:

@misc{dia2025,
  title  = {Dia: A 1.6B Dialogue Text-to-Speech Model},
  author = {Nari Labs},
  year   = {2025},
  url    = {https://github.com/nari-labs/dia}
}

Acknowledgements

Nari Labs for the Dia-1.6B base model and fine-tuning pipeline
Descript for the DAC audio codec

Model trained and released by mahwizzzz — Muhammad Mahwiz Khalil

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/Dia-1.6B-Urdu

Base model

nari-labs/Dia-1.6B

Finetuned

(25)

this model

mahwizzzz
/

Dia-1.6B-Urdu