Dia-1.6B-Urdu — Urdu TTS Fine-tune

Model Language Base Model License


Model Summary

Dia-1.6B-Urdu is a fine-tuned version of nari-labs/Dia-1.6B for Urdu (اردو) text-to-speech synthesis. It is, to our knowledge, one of the first openly released neural TTS models fine-tuned specifically on a large-scale Urdu corpus using the Dia architecture.

Property Value
Base Model nari-labs/Dia-1.6B
Language Urdu (ur)
Script Nastaliq (RTL)
Training Data ~221,832 Urdu audio-text pairs
Audio Duration ~200+ hours
Sample Rate 44,100 Hz
Architecture Encoder-Decoder Transformer + DAC codec
Parameters ~1.6B
Fine-tune Steps See training details below
Hardware NVIDIA RTX 3090 (24GB VRAM)

Intended Use

This model is designed for:

  • Urdu speech synthesis from text input
  • Audiobook and narration generation in Urdu
  • Research in low-resource TTS for South Asian languages
  • Voice interface applications targeting Urdu speakers (~230M native speakers globally)

Out-of-Scope Use

  • Real-time streaming TTS (latency not optimized)
  • Non-Urdu languages (use the base Dia-1.6B for multilingual use)
  • Voice cloning of specific individuals without consent

Usage

Installation

pip install torch torchaudio soundfile
pip install git+https://github.com/nari-labs/dia.git

Basic Inference

import torch
import soundfile as sf
from dia.model import Dia

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Dia.from_pretrained("nari-labs/Dia-1.6B", device=device)

# fine-tuned checkpoint
print("Loading finetuned weights...")
ckpt_path = hf_hub_download(repo_id="mahwizzzz/Dia-1.6B-Urdu", filename="model.pth")
model.model.load_state_dict(ckpt, strict=True)
model.model.eval()
model.model = model.model.float()

text = "[ur] روزے کے وقت کھانے کی سہولت دینے کیا منع ہے؟"

with torch.no_grad():
    output = model.generate(
        text,
        max_tokens=512,
        cfg_scale=3.0
    )

output_path = "/tmp/test_urdu_26000.wav"
sf.write(output_path, output, 44100)

Language Tag

Always prefix your Urdu text with [ur] for best results:

text = "[ur] آپ کا متن یہاں لکھیں۔"

Training Details

Dataset

  • Size: 221,832 audio-text pairs after filtering
  • Duration filter: 1.0s – 15.0s per clip
  • Rejected clips: 97 (duration out of range), 0 corrupted

Audio Preprocessing

  • Resampled from 16,000 Hz → 44,100 Hz (ffmpeg)
  • Loudness normalized to -23 LUFS (pyloudnorm)
  • Peak limiting at 0.99
  • Format: 16-bit PCM mono WAV

Training Configuration

Hyperparameter Value
Base model nari-labs/Dia-1.6B
Precision float16 (--half)
Batch size 2
Learning rate 1e-5 (AdamW)
LR schedule Cosine with warmup
Optimizer AdamW
Hardware NVIDIA RTX 3090 24GB
Framework PyTorch 2.9.1 + CUDA 12.8

Train/Validation Split

  • Train: 221,167 samples
  • Validation: 665 samples
  • Split ratio: 99.7% / 0.3%
  • Random seed: 42

Evaluation

Evaluation audio samples are generated every 2,000 training steps using the Urdu test sentence:

آپ کو اپنے ملک یا علاقے کے قوانین کو جاننے کیلئے مقامی حکومت یا قانونی مشاور سے رجوع کرنا چاہئے۔

Formal MOS (Mean Opinion Score) evaluation is planned for future releases.


Limitations

  • Early training: Early checkpoints may produce rough or accented speech — quality improves significantly with more steps
  • OOV words: Roman Urdu or mixed-script text may not synthesize correctly
  • Prosody: Long sentences may have unnatural prosody at early checkpoints
  • DAC dependency: Requires the DAC (Descript Audio Codec) model for inference

Ethical Considerations

  • This model was trained exclusively on publicly available Urdu narrations
  • No private or personally identifiable voice data was used
  • The model should not be used to synthesize speech impersonating real individuals without consent
  • Misuse for disinformation or voice fraud is strictly prohibited

Citation

If you use this model in your research or applications, please cite:

@misc{dia_urdu_2026,
  title        = {Dia-1.6B-Urdu: Fine-tuned Urdu Text-to-Speech},
  author       = {Mahwiz Khalil},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/mahwizzzz/Dia-1.6B-Urdu}},
  note         = {Fine-tuned from nari-labs/Dia-1.6B on Urdu corpus}
}

Also cite the original Dia model:

@misc{dia2025,
  title  = {Dia: A 1.6B Dialogue Text-to-Speech Model},
  author = {Nari Labs},
  year   = {2025},
  url    = {https://github.com/nari-labs/dia}
}

Acknowledgements

  • Nari Labs for the Dia-1.6B base model and fine-tuning pipeline
  • Descript for the DAC audio codec

Model trained and released by mahwizzzz — Muhammad Mahwiz Khalil

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/Dia-1.6B-Urdu

Finetuned
(25)
this model

Dataset used to train mahwizzzz/Dia-1.6B-Urdu