Chatterbox Indic LoRA โ€” Indian Language TTS

LoRA adapters + extended tokenizer to add 8 Indian languages to Chatterbox-Multilingual by Resemble AI.

No phoneme engineering. No G2P. Just grapheme-level fine-tuning on 1.4% of the model parameters.

Article Series: Teaching an AI to Speak Indian Languages

Chatterbox Indic LoRA โ€” Indian Language TTS

Open In Colab


Audio Samples

Hindi (hi) โ€” CER 0.1058

Male Female

Telugu (te) โ€” CER 0.2853

Male Female

Kannada (kn) โ€” CER 0.1434

Male Female

Bengali (bn) โ€” CER 0.2450

Male

Tamil (ta) โ€” CER 0.1608

Male Female

Malayalam (ml) โ€” CER 0.8593

Male Female

Marathi (mr) โ€” CER 0.1976

Male Female

Gujarati (gu) โ€” CER 0.2377

Male Female

Supported Languages

Language Script Training Data CER (mean) Status
Hindi Devanagari ~10h (IndicTTS) 0.1058 Stable
Telugu Telugu ~52h (IndicTTS + ai4bharat Rasa) 0.2853 Trained
Kannada Kannada ~7h (IndicTTS) 0.1434 Trained
Bengali Bengali ~15h (IndicTTS) 0.2450 Trained
Tamil Tamil ~10h (IndicTTS + ai4bharat Rasa) 0.1608 Trained
Malayalam Malayalam ~10h (IndicTTS + ai4bharat Rasa) 0.8593 Experimental
Marathi Devanagari ~10h (IndicTTS + ai4bharat Rasa) 0.1976 Trained
Gujarati Gujarati ~10h (IndicTTS + ai4bharat Rasa) 0.2377 Trained
English Latin โ€” Preserved Base model (frozen)

CER measured via Whisper large-v3 ASR on 100 held-out samples per language.


How It Works

The base Chatterbox-Multilingual model supports 23 languages but no Dravidian or additional Indo-Aryan languages beyond Hindi. This adapter extends it by:

  1. Extended Tokenizer โ€” Added graphemes for Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati to the MTLTokenizer vocabulary (2454 โ†’ 2871 tokens)
  2. Brahmic Warm-Start โ€” New character embeddings initialized from phonetically equivalent Devanagari characters (e.g., Telugu "เฐ•" โ† Hindi "เค•")
  3. LoRA Fine-Tuning โ€” Rank-32 adapters on q/k/v/o projections of the T3 Llama backbone (~7.8M trainable params / 544M total)
  4. Gradient Masking โ€” Original embedding rows frozen during training; only new script embeddings update

The speech vocabulary, vocoder (S3Gen), and speaker encoder remain completely frozen. Only T3's text understanding is adapted.


Quick Start

Option A: Python (3 lines)

Install from the fork (not pip install chatterbox-tts โ€” that has dependency conflicts):

# 1. Install PyTorch for your GPU first (example for CUDA 12.8 / Blackwell / 50-series):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

# 2. Install from fork (relaxed deps, Indic support built in):
pip install git+https://github.com/reenigne314/chatterbox-indic-lora.git

Then generate speech:

import soundfile as sf
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load base model + LoRA + tokenizer + speaker โ€” all in one call
model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female")

# Generate Telugu speech
wav = model.generate("เฐจเฐฎเฐธเฑเฐ•เฐพเฐฐเฐ‚, เฐฎเฑ€เฐฐเฑ เฐŽเฐฒเฐพ เฐ‰เฐจเฑเฐจเฐพเฐฐเฑ?", language_id="te")
sf.write("output_telugu.wav", wav.squeeze(0).cpu().numpy(), model.sr)
# Switch speaker on the fly
from chatterbox.mtl_tts import Conditionals
model.conds = Conditionals.load("path/to/hi_male.pt").to("cuda")
wav = model.generate("เคจเคฎเคธเฅเคคเฅ‡, เค†เคช เค•เฅˆเคธเฅ‡ เคนเฅˆเค‚?", language_id="hi")
sf.write("output_hindi.wav", wav.squeeze(0).cpu().numpy(), model.sr)

Option B: Docker (one command)

git clone https://huggingface.co/reenigne314/chatterbox-indic-lora
cd chatterbox-indic-lora
docker compose up
# Open http://localhost:7860

Option C: Gradio Web UI

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/reenigne314/chatterbox-indic.git
pip install gradio>=4.0.0

python app.py              # http://localhost:7860
python app.py --share      # public link

Available Speakers

File Language Gender
hi_female.pt / hi_male.pt Hindi Female / Male
te_female.pt / te_male.pt Telugu Female / Male
kn_female.pt / kn_male.pt Kannada Female / Male
bn_female.pt / bn_male.pt Bengali Female / Male
ta_female.pt / ta_male.pt Tamil Female / Male
ml_female.pt / ml_male.pt Malayalam Female / Male
mr_female.pt / mr_male.pt Marathi Female / Male
gu_female.pt / gu_male.pt Gujarati Female / Male

Included Files

.
โ”œโ”€โ”€ app.py                             # Gradio Web UI
โ”œโ”€โ”€ Dockerfile                         # Docker support
โ”œโ”€โ”€ docker-compose.yml
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ checkpoints/
โ”‚   โ””โ”€โ”€ best.pt                        # LoRA weights + extended embeddings
โ”œโ”€โ”€ tokenizer/
โ”‚   โ”œโ”€โ”€ extended_tokenizer.json        # Extended vocab (2454 โ†’ 2871 tokens)
โ”‚   โ””โ”€โ”€ brahmic_init_map.json          # Brahmic โ†’ Devanagari mapping
โ”œโ”€โ”€ conds/
โ”‚   โ”œโ”€โ”€ {lang}_{gender}.pt             # 16 speaker conditioning files
โ”‚   โ””โ”€โ”€ conds_manifest.json            # Speaker metadata
โ””โ”€โ”€ README.md                          # This file

Base model not included. from_indic_lora() auto-downloads it from ResembleAI/chatterbox on first run.


Training Details

Setting Value
Base model Chatterbox-Multilingual (T3 Llama 520M)
LoRA rank 32
LoRA alpha 64
LoRA targets q_proj, k_proj, v_proj, o_proj
Trainable params ~7.8M / 544M (1.4%)
Precision bf16
Hardware 1x RTX PRO 6000 Blackwell (96GB)
Primary data SPRINGLab IndicTTS, ai4bharat Rasa
Training script scripts/train_t3_lora.py

Training Approach

Languages were added incrementally with weighted sampling to prevent catastrophic forgetting:

  • Round 1: Hindi only (validate pipeline)
  • Round 2: Telugu + Hindi (extended vocab, Brahmic warm-start)
  • Round 3: Telugu-heavy with larger dataset (ai4bharat Rasa ~52h)
  • Round 4: Telugu refinement with expanded data
  • Round 5: Kannada + Telugu + Hindi
  • Round 6: All 8 languages (Hi, Te, Kn, Bn, Ta, Ml, Mr, Gu)

Hindi CER improved even after adding new languages โ€” no catastrophic forgetting observed.


Limitations

  • Malayalam CER is high (0.86). The model struggles with Malayalam โ€” likely needs more training data or dedicated fine-tuning. Treat Malayalam as experimental.
  • CER is the primary metric. Naturalness (MOS), speaker similarity, and prosody have not been formally evaluated yet. The audio sounds clean to the ear, but systematic subjective evaluation is pending.
  • 2 speakers per language. Training data has one male and one female speaker from IndicTTS per language. The model may not generalize well to all voice types.
  • No code-mix yet. Hindi+English or Telugu+English mixed sentences are not specifically trained. This is planned for a future release.
  • Single codebook. Chatterbox uses single-stream S3 tokens (25 Hz). Fine acoustic details may be less sharp than multi-codebook systems.

Citation

If you use this model, please cite both this work and the original Chatterbox:

@misc{chatterbox_indic_lora_2025,
  author       = {Bharadwaj Kommanamanchi},
  title        = {Chatterbox Indic LoRA โ€” Indian Language TTS via Grapheme-Level Fine-Tuning},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/reenigne314/chatterbox-indic-lora}},
  note         = {LoRA adapters for Chatterbox-Multilingual}
}

@misc{chatterboxtts2025,
  author       = {{Resemble AI}},
  title        = {{Chatterbox-TTS}},
  year         = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note         = {GitHub repository}
}

Acknowledgements

  • Resemble AI โ€” for open-sourcing Chatterbox under MIT license. This work would not exist without their model and architecture.
  • SPRINGLab / IIT Madras โ€” IndicTTS dataset
  • ai4bharat โ€” Rasa dataset for Telugu
  • CosyVoice โ€” S3Gen architecture (adapted by Resemble AI)
  • Meta / Llama 3 โ€” T3 backbone architecture
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for reenigne314/chatterbox-indic-lora

Adapter
(4)
this model