Chatterbox Indic LoRA — Indian Language TTS

LoRA adapters + extended tokenizer to add 8 Indian languages to Chatterbox-Multilingual by Resemble AI.

No phoneme engineering. No G2P. Just grapheme-level fine-tuning on 1.4% of the model parameters.

Article Series: Teaching an AI to Speak Indian Languages

Chatterbox Indic LoRA — Indian Language TTS

Audio Samples

Hindi (hi) — CER 0.1058

Male	Female

Telugu (te) — CER 0.2853

Male	Female

Kannada (kn) — CER 0.1434

Male	Female

Bengali (bn) — CER 0.2450

Male

Tamil (ta) — CER 0.1608

Male	Female

Malayalam (ml) — CER 0.8593

Male	Female

Marathi (mr) — CER 0.1976

Male	Female

Gujarati (gu) — CER 0.2377

Male	Female

Supported Languages

Language	Script	Training Data	CER (mean)	Status
Hindi	Devanagari	~10h (IndicTTS)	0.1058	Stable
Telugu	Telugu	~52h (IndicTTS + ai4bharat Rasa)	0.2853	Trained
Kannada	Kannada	~7h (IndicTTS)	0.1434	Trained
Bengali	Bengali	~15h (IndicTTS)	0.2450	Trained
Tamil	Tamil	~10h (IndicTTS + ai4bharat Rasa)	0.1608	Trained
Malayalam	Malayalam	~10h (IndicTTS + ai4bharat Rasa)	0.8593	Experimental
Marathi	Devanagari	~10h (IndicTTS + ai4bharat Rasa)	0.1976	Trained
Gujarati	Gujarati	~10h (IndicTTS + ai4bharat Rasa)	0.2377	Trained
English	Latin	—	Preserved	Base model (frozen)

CER measured via Whisper large-v3 ASR on 100 held-out samples per language.

How It Works

The base Chatterbox-Multilingual model supports 23 languages but no Dravidian or additional Indo-Aryan languages beyond Hindi. This adapter extends it by:

Extended Tokenizer — Added graphemes for Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati to the MTLTokenizer vocabulary (2454 → 2871 tokens)
Brahmic Warm-Start — New character embeddings initialized from phonetically equivalent Devanagari characters (e.g., Telugu "క" ← Hindi "क")
LoRA Fine-Tuning — Rank-32 adapters on q/k/v/o projections of the T3 Llama backbone (~7.8M trainable params / 544M total)
Gradient Masking — Original embedding rows frozen during training; only new script embeddings update

The speech vocabulary, vocoder (S3Gen), and speaker encoder remain completely frozen. Only T3's text understanding is adapted.

Quick Start

Option A: Python (3 lines)

Install from the fork (not pip install chatterbox-tts — that has dependency conflicts):

# 1. Install PyTorch for your GPU first (example for CUDA 12.8 / Blackwell / 50-series):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

# 2. Install from fork (relaxed deps, Indic support built in):
pip install git+https://github.com/reenigne314/chatterbox-indic-lora.git

Then generate speech:

import soundfile as sf
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load base model + LoRA + tokenizer + speaker — all in one call
model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female")

# Generate Telugu speech
wav = model.generate("నమస్కారం, మీరు ఎలా ఉన్నారు?", language_id="te")
sf.write("output_telugu.wav", wav.squeeze(0).cpu().numpy(), model.sr)

# Switch speaker on the fly
from chatterbox.mtl_tts import Conditionals
model.conds = Conditionals.load("path/to/hi_male.pt").to("cuda")
wav = model.generate("नमस्ते, आप कैसे हैं?", language_id="hi")
sf.write("output_hindi.wav", wav.squeeze(0).cpu().numpy(), model.sr)

Option B: Docker (one command)

git clone https://huggingface.co/reenigne314/chatterbox-indic-lora
cd chatterbox-indic-lora
docker compose up
# Open http://localhost:7860

Option C: Gradio Web UI

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/reenigne314/chatterbox-indic.git
pip install gradio>=4.0.0

python app.py              # http://localhost:7860
python app.py --share      # public link

Available Speakers

File	Language	Gender
`hi_female.pt` / `hi_male.pt`	Hindi	Female / Male
`te_female.pt` / `te_male.pt`	Telugu	Female / Male
`kn_female.pt` / `kn_male.pt`	Kannada	Female / Male
`bn_female.pt` / `bn_male.pt`	Bengali	Female / Male
`ta_female.pt` / `ta_male.pt`	Tamil	Female / Male
`ml_female.pt` / `ml_male.pt`	Malayalam	Female / Male
`mr_female.pt` / `mr_male.pt`	Marathi	Female / Male
`gu_female.pt` / `gu_male.pt`	Gujarati	Female / Male

Included Files

.
├── app.py                             # Gradio Web UI
├── Dockerfile                         # Docker support
├── docker-compose.yml
├── requirements.txt
├── checkpoints/
│   └── best.pt                        # LoRA weights + extended embeddings
├── tokenizer/
│   ├── extended_tokenizer.json        # Extended vocab (2454 → 2871 tokens)
│   └── brahmic_init_map.json          # Brahmic → Devanagari mapping
├── conds/
│   ├── {lang}_{gender}.pt             # 16 speaker conditioning files
│   └── conds_manifest.json            # Speaker metadata
└── README.md                          # This file

Base model not included. from_indic_lora() auto-downloads it from ResembleAI/chatterbox on first run.

Training Details

Setting	Value
Base model	Chatterbox-Multilingual (T3 Llama 520M)
LoRA rank	32
LoRA alpha	64
LoRA targets	q_proj, k_proj, v_proj, o_proj
Trainable params	~7.8M / 544M (1.4%)
Precision	bf16
Hardware	1x RTX PRO 6000 Blackwell (96GB)
Primary data	SPRINGLab IndicTTS, ai4bharat Rasa
Training script	scripts/train_t3_lora.py

Training Approach

Languages were added incrementally with weighted sampling to prevent catastrophic forgetting:

Round 1: Hindi only (validate pipeline)
Round 2: Telugu + Hindi (extended vocab, Brahmic warm-start)
Round 3: Telugu-heavy with larger dataset (ai4bharat Rasa ~52h)
Round 4: Telugu refinement with expanded data
Round 5: Kannada + Telugu + Hindi
Round 6: All 8 languages (Hi, Te, Kn, Bn, Ta, Ml, Mr, Gu)

Hindi CER improved even after adding new languages — no catastrophic forgetting observed.

Limitations

Malayalam CER is high (0.86). The model struggles with Malayalam — likely needs more training data or dedicated fine-tuning. Treat Malayalam as experimental.
CER is the primary metric. Naturalness (MOS), speaker similarity, and prosody have not been formally evaluated yet. The audio sounds clean to the ear, but systematic subjective evaluation is pending.
2 speakers per language. Training data has one male and one female speaker from IndicTTS per language. The model may not generalize well to all voice types.
No code-mix yet. Hindi+English or Telugu+English mixed sentences are not specifically trained. This is planned for a future release.
Single codebook. Chatterbox uses single-stream S3 tokens (25 Hz). Fine acoustic details may be less sharp than multi-codebook systems.

Citation

If you use this model, please cite both this work and the original Chatterbox:

@misc{chatterbox_indic_lora_2025,
  author       = {Bharadwaj Kommanamanchi},
  title        = {Chatterbox Indic LoRA — Indian Language TTS via Grapheme-Level Fine-Tuning},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/reenigne314/chatterbox-indic-lora}},
  note         = {LoRA adapters for Chatterbox-Multilingual}
}

@misc{chatterboxtts2025,
  author       = {{Resemble AI}},
  title        = {{Chatterbox-TTS}},
  year         = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note         = {GitHub repository}
}

Acknowledgements

Resemble AI — for open-sourcing Chatterbox under MIT license. This work would not exist without their model and architecture.
SPRINGLab / IIT Madras — IndicTTS dataset
ai4bharat — Rasa dataset for Telugu
CosyVoice — S3Gen architecture (adapted by Resemble AI)
Meta / Llama 3 — T3 backbone architecture

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for reenigne314/chatterbox-indic-lora

Base model

ResembleAI/chatterbox

Adapter

(4)

this model