Bangladeshi Bangla TTS — VITS

A Text-to-Speech model for Bangladeshi Bangla (Bengali) built using the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture. Trained to capture authentic Bangladeshi pronunciation and accent patterns.

Model Details

Property	Value
Architecture	VITS (end-to-end TTS)
Language	Bangla (bn)
Focus	Bangladeshi pronunciation & accent
License	Apache 2.0

About VITS

VITS (Kim et al., 2021) is an end-to-end TTS model that:

Directly learns a mapping from text to waveform (no separate vocoder)
Uses Variational Autoencoder (VAE) + Normalizing Flows + GAN discriminator
Achieves near human-level naturalness on single-speaker datasets
Fast inference compared to two-stage models

Why Bangla TTS?

Bangla is spoken by 230M+ people worldwide, making it the 7th most spoken language globally. Yet, high-quality, publicly available Bangla TTS models remain scarce. This model is part of an effort to bridge that gap for the Bangladeshi speech community.

Related Models (Larger Scale)

For production-quality Bangla TTS, see the Orpheus 3B fine-tuned models which use a much larger LLM backbone:

Model	Scale	Quality
This model	VITS	Good
orpheus-3b-bangla-high-data	Orpheus 3B (99K samples)	Best
orpheus-3b-bangla-small-data	Orpheus 3B (39K samples)	Better

Repository

Training code and data pipeline available at: github.com/EMTIAZ-RUET/bangladeshi-bangla-tts-finetuning

Citation

@misc{emtiaz2025banglavirus,
  author = {Emtiaz Uddin Ahmed},
  title  = {Bangladeshi Bangla TTS — VITS},
  year   = {2025},
  url    = {https://huggingface.co/EMTIAZZ/bangladeshi-bangla-tts-vits}
}

Author

Emtiaz Uddin Ahmed — AI/ML Engineer at Markopolo.ai, Dhaka, Bangladesh. GitHub · Portfolio

Downloads last month: 183

Spaces using EMTIAZZ/bangladeshi-bangla-tts-vits 2

Paper for EMTIAZZ/bangladeshi-bangla-tts-vits

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4