Bangladeshi Bangla TTS β VITS
A Text-to-Speech model for Bangladeshi Bangla (Bengali) built using the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture. Trained to capture authentic Bangladeshi pronunciation and accent patterns.
Model Details
| Property | Value |
|---|---|
| Architecture | VITS (end-to-end TTS) |
| Language | Bangla (bn) |
| Focus | Bangladeshi pronunciation & accent |
| License | Apache 2.0 |
About VITS
VITS (Kim et al., 2021) is an end-to-end TTS model that:
- Directly learns a mapping from text to waveform (no separate vocoder)
- Uses Variational Autoencoder (VAE) + Normalizing Flows + GAN discriminator
- Achieves near human-level naturalness on single-speaker datasets
- Fast inference compared to two-stage models
Why Bangla TTS?
Bangla is spoken by 230M+ people worldwide, making it the 7th most spoken language globally. Yet, high-quality, publicly available Bangla TTS models remain scarce. This model is part of an effort to bridge that gap for the Bangladeshi speech community.
Related Models (Larger Scale)
For production-quality Bangla TTS, see the Orpheus 3B fine-tuned models which use a much larger LLM backbone:
| Model | Scale | Quality |
|---|---|---|
| This model | VITS | Good |
| orpheus-3b-bangla-high-data | Orpheus 3B (99K samples) | Best |
| orpheus-3b-bangla-small-data | Orpheus 3B (39K samples) | Better |
Repository
Training code and data pipeline available at: github.com/EMTIAZ-RUET/bangladeshi-bangla-tts-finetuning
Citation
@misc{emtiaz2025banglavirus,
author = {Emtiaz Uddin Ahmed},
title = {Bangladeshi Bangla TTS β VITS},
year = {2025},
url = {https://huggingface.co/EMTIAZZ/bangladeshi-bangla-tts-vits}
}
Author
Emtiaz Uddin Ahmed β AI/ML Engineer at Markopolo.ai, Dhaka, Bangladesh. GitHub Β· Portfolio
- Downloads last month
- 183