Bangladeshi Bangla TTS β€” VITS

A Text-to-Speech model for Bangladeshi Bangla (Bengali) built using the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture. Trained to capture authentic Bangladeshi pronunciation and accent patterns.

Model Details

Property Value
Architecture VITS (end-to-end TTS)
Language Bangla (bn)
Focus Bangladeshi pronunciation & accent
License Apache 2.0

About VITS

VITS (Kim et al., 2021) is an end-to-end TTS model that:

  • Directly learns a mapping from text to waveform (no separate vocoder)
  • Uses Variational Autoencoder (VAE) + Normalizing Flows + GAN discriminator
  • Achieves near human-level naturalness on single-speaker datasets
  • Fast inference compared to two-stage models

Why Bangla TTS?

Bangla is spoken by 230M+ people worldwide, making it the 7th most spoken language globally. Yet, high-quality, publicly available Bangla TTS models remain scarce. This model is part of an effort to bridge that gap for the Bangladeshi speech community.

Related Models (Larger Scale)

For production-quality Bangla TTS, see the Orpheus 3B fine-tuned models which use a much larger LLM backbone:

Model Scale Quality
This model VITS Good
orpheus-3b-bangla-high-data Orpheus 3B (99K samples) Best
orpheus-3b-bangla-small-data Orpheus 3B (39K samples) Better

Repository

Training code and data pipeline available at: github.com/EMTIAZ-RUET/bangladeshi-bangla-tts-finetuning

Citation

@misc{emtiaz2025banglavirus,
  author = {Emtiaz Uddin Ahmed},
  title  = {Bangladeshi Bangla TTS β€” VITS},
  year   = {2025},
  url    = {https://huggingface.co/EMTIAZZ/bangladeshi-bangla-tts-vits}
}

Author

Emtiaz Uddin Ahmed β€” AI/ML Engineer at Markopolo.ai, Dhaka, Bangladesh. GitHub Β· Portfolio

Downloads last month
183
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using EMTIAZZ/bangladeshi-bangla-tts-vits 2

Paper for EMTIAZZ/bangladeshi-bangla-tts-vits