Chatterbox Indic LoRA โ Indian Language TTS
LoRA adapters + extended tokenizer to add 8 Indian languages to Chatterbox-Multilingual by Resemble AI.
No phoneme engineering. No G2P. Just grapheme-level fine-tuning on 1.4% of the model parameters.
Chatterbox Indic LoRA โ Indian Language TTS
Audio Samples
Hindi (hi) โ CER 0.1058
| Male | Female |
|---|---|
Telugu (te) โ CER 0.2853
| Male | Female |
|---|---|
Kannada (kn) โ CER 0.1434
| Male | Female |
|---|---|
Bengali (bn) โ CER 0.2450
| Male |
|---|
Tamil (ta) โ CER 0.1608
| Male | Female |
|---|---|
Malayalam (ml) โ CER 0.8593
| Male | Female |
|---|---|
Marathi (mr) โ CER 0.1976
| Male | Female |
|---|---|
Gujarati (gu) โ CER 0.2377
| Male | Female |
|---|---|
Supported Languages
| Language | Script | Training Data | CER (mean) | Status |
|---|---|---|---|---|
| Hindi | Devanagari | ~10h (IndicTTS) | 0.1058 | Stable |
| Telugu | Telugu | ~52h (IndicTTS + ai4bharat Rasa) | 0.2853 | Trained |
| Kannada | Kannada | ~7h (IndicTTS) | 0.1434 | Trained |
| Bengali | Bengali | ~15h (IndicTTS) | 0.2450 | Trained |
| Tamil | Tamil | ~10h (IndicTTS + ai4bharat Rasa) | 0.1608 | Trained |
| Malayalam | Malayalam | ~10h (IndicTTS + ai4bharat Rasa) | 0.8593 | Experimental |
| Marathi | Devanagari | ~10h (IndicTTS + ai4bharat Rasa) | 0.1976 | Trained |
| Gujarati | Gujarati | ~10h (IndicTTS + ai4bharat Rasa) | 0.2377 | Trained |
| English | Latin | โ | Preserved | Base model (frozen) |
CER measured via Whisper large-v3 ASR on 100 held-out samples per language.
How It Works
The base Chatterbox-Multilingual model supports 23 languages but no Dravidian or additional Indo-Aryan languages beyond Hindi. This adapter extends it by:
- Extended Tokenizer โ Added graphemes for Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati to the MTLTokenizer vocabulary (2454 โ 2871 tokens)
- Brahmic Warm-Start โ New character embeddings initialized from phonetically equivalent Devanagari characters (e.g., Telugu "เฐ" โ Hindi "เค")
- LoRA Fine-Tuning โ Rank-32 adapters on q/k/v/o projections of the T3 Llama backbone (~7.8M trainable params / 544M total)
- Gradient Masking โ Original embedding rows frozen during training; only new script embeddings update
The speech vocabulary, vocoder (S3Gen), and speaker encoder remain completely frozen. Only T3's text understanding is adapted.
Quick Start
Option A: Python (3 lines)
Install from the fork (not pip install chatterbox-tts โ that has dependency conflicts):
# 1. Install PyTorch for your GPU first (example for CUDA 12.8 / Blackwell / 50-series):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
# 2. Install from fork (relaxed deps, Indic support built in):
pip install git+https://github.com/reenigne314/chatterbox-indic-lora.git
Then generate speech:
import soundfile as sf
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
# Load base model + LoRA + tokenizer + speaker โ all in one call
model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female")
# Generate Telugu speech
wav = model.generate("เฐจเฐฎเฐธเฑเฐเฐพเฐฐเฐ, เฐฎเฑเฐฐเฑ เฐเฐฒเฐพ เฐเฐจเฑเฐจเฐพเฐฐเฑ?", language_id="te")
sf.write("output_telugu.wav", wav.squeeze(0).cpu().numpy(), model.sr)
# Switch speaker on the fly
from chatterbox.mtl_tts import Conditionals
model.conds = Conditionals.load("path/to/hi_male.pt").to("cuda")
wav = model.generate("เคจเคฎเคธเฅเคคเฅ, เคเคช เคเฅเคธเฅ เคนเฅเค?", language_id="hi")
sf.write("output_hindi.wav", wav.squeeze(0).cpu().numpy(), model.sr)
Option B: Docker (one command)
git clone https://huggingface.co/reenigne314/chatterbox-indic-lora
cd chatterbox-indic-lora
docker compose up
# Open http://localhost:7860
Option C: Gradio Web UI
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/reenigne314/chatterbox-indic.git
pip install gradio>=4.0.0
python app.py # http://localhost:7860
python app.py --share # public link
Available Speakers
| File | Language | Gender |
|---|---|---|
hi_female.pt / hi_male.pt |
Hindi | Female / Male |
te_female.pt / te_male.pt |
Telugu | Female / Male |
kn_female.pt / kn_male.pt |
Kannada | Female / Male |
bn_female.pt / bn_male.pt |
Bengali | Female / Male |
ta_female.pt / ta_male.pt |
Tamil | Female / Male |
ml_female.pt / ml_male.pt |
Malayalam | Female / Male |
mr_female.pt / mr_male.pt |
Marathi | Female / Male |
gu_female.pt / gu_male.pt |
Gujarati | Female / Male |
Included Files
.
โโโ app.py # Gradio Web UI
โโโ Dockerfile # Docker support
โโโ docker-compose.yml
โโโ requirements.txt
โโโ checkpoints/
โ โโโ best.pt # LoRA weights + extended embeddings
โโโ tokenizer/
โ โโโ extended_tokenizer.json # Extended vocab (2454 โ 2871 tokens)
โ โโโ brahmic_init_map.json # Brahmic โ Devanagari mapping
โโโ conds/
โ โโโ {lang}_{gender}.pt # 16 speaker conditioning files
โ โโโ conds_manifest.json # Speaker metadata
โโโ README.md # This file
Base model not included. from_indic_lora() auto-downloads it from ResembleAI/chatterbox on first run.
Training Details
| Setting | Value |
|---|---|
| Base model | Chatterbox-Multilingual (T3 Llama 520M) |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj |
| Trainable params | ~7.8M / 544M (1.4%) |
| Precision | bf16 |
| Hardware | 1x RTX PRO 6000 Blackwell (96GB) |
| Primary data | SPRINGLab IndicTTS, ai4bharat Rasa |
| Training script | scripts/train_t3_lora.py |
Training Approach
Languages were added incrementally with weighted sampling to prevent catastrophic forgetting:
- Round 1: Hindi only (validate pipeline)
- Round 2: Telugu + Hindi (extended vocab, Brahmic warm-start)
- Round 3: Telugu-heavy with larger dataset (ai4bharat Rasa ~52h)
- Round 4: Telugu refinement with expanded data
- Round 5: Kannada + Telugu + Hindi
- Round 6: All 8 languages (Hi, Te, Kn, Bn, Ta, Ml, Mr, Gu)
Hindi CER improved even after adding new languages โ no catastrophic forgetting observed.
Limitations
- Malayalam CER is high (0.86). The model struggles with Malayalam โ likely needs more training data or dedicated fine-tuning. Treat Malayalam as experimental.
- CER is the primary metric. Naturalness (MOS), speaker similarity, and prosody have not been formally evaluated yet. The audio sounds clean to the ear, but systematic subjective evaluation is pending.
- 2 speakers per language. Training data has one male and one female speaker from IndicTTS per language. The model may not generalize well to all voice types.
- No code-mix yet. Hindi+English or Telugu+English mixed sentences are not specifically trained. This is planned for a future release.
- Single codebook. Chatterbox uses single-stream S3 tokens (25 Hz). Fine acoustic details may be less sharp than multi-codebook systems.
Citation
If you use this model, please cite both this work and the original Chatterbox:
@misc{chatterbox_indic_lora_2025,
author = {Bharadwaj Kommanamanchi},
title = {Chatterbox Indic LoRA โ Indian Language TTS via Grapheme-Level Fine-Tuning},
year = {2025},
howpublished = {\url{https://huggingface.co/reenigne314/chatterbox-indic-lora}},
note = {LoRA adapters for Chatterbox-Multilingual}
}
@misc{chatterboxtts2025,
author = {{Resemble AI}},
title = {{Chatterbox-TTS}},
year = {2025},
howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
note = {GitHub repository}
}
Acknowledgements
- Resemble AI โ for open-sourcing Chatterbox under MIT license. This work would not exist without their model and architecture.
- SPRINGLab / IIT Madras โ IndicTTS dataset
- ai4bharat โ Rasa dataset for Telugu
- CosyVoice โ S3Gen architecture (adapted by Resemble AI)
- Meta / Llama 3 โ T3 backbone architecture
Model tree for reenigne314/chatterbox-indic-lora
Base model
ResembleAI/chatterbox