Praxy Voice R6 — LoRA for Indic TTS on Chatterbox

Praxy Voice R6 is a LoRA adapter that extends ResembleAI Chatterbox Multilingual to high-quality Telugu and Tamil text-to-speech, two languages the Chatterbox base does not natively cover.

This adapter is part of a larger recipe described in the accompanying paper "Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base" (arXiv:2604.25441). The recipe is:

  1. BUPS — ISO-15919 romanisation of Indic text before tokenisation
  2. This LoRA adapter — rank-32 attention-only adapter on Chatterbox's t3 transformer (4.7M trainable params)
  3. Voice-prompt recovery — an 8–11 s reference audio clip in the target language + three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1) at inference

For Hindi, this LoRA actively regresses semantic accuracy — Hindi should be synthesised with the vanilla Chatterbox base, still applying the voice-prompt + Config B recipe. See paper §5.3 for the rationale.

What this adapter does NOT cover

Praxy v1 ships three deployment branches; this LoRA is only one of them.

Input class Branch Where it lives
Telugu (pure script) R6 LoRA + Chatterbox This repo ✓
Tamil (pure script) R6 LoRA + Chatterbox This repo ✓
Hindi (pure script) Vanilla Chatterbox + voice prompt ResembleAI/chatterbox — no adapter needed
Hi/Te/Ta with English code-mix transliterate → IndicF5 Recipe in praxelhq/praxy, see paper §III.E

Code-mixed text (e.g. "मैंने WhatsApp पे message किया but notification नहीं आया।") is not in scope for this LoRA — using it on code-mix will degrade quality (the LoRA romanises English chunks into Indic phonetics). Code-mix synthesis routes through AI4Bharat IndicF5 with a native-script transliteration preprocessor; the routing is one regex (≥1 Latin word ≥2 chars) and is implemented in serving/praxy_router.py.

🌐 Try the full system (all three branches behind one UI, with voice cloning): Praxel/praxy-voice-demo Space.


Quick start

Install

pip install chatterbox-tts peft indic-transliteration indic-num2words

Telugu or Tamil (via this LoRA + BUPS)

import torch
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download

# Load base model (frozen)
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")
for p in model.t3.parameters(): p.requires_grad_(False)

# Wrap t3 with the same LoRA shape used at training
lora_cfg = LoraConfig(
    r=32, lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05, bias="none",
)
model.t3 = get_peft_model(model.t3, lora_cfg)

# Load Praxy R6 weights
ckpt_path = hf_hub_download("Praxel/praxy-voice-r6", "lora_state.pt")
sd = torch.load(ckpt_path, map_location="cuda")
model.t3.load_state_dict(sd, strict=False, assign=True)
model.t3.eval()

# BUPS romanisation (required for Te/Ta)
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate

def bups(text, script):
    # script in {'devanagari', 'telugu', 'tamil', ...}
    script_map = {
        'devanagari': sanscript.DEVANAGARI,
        'telugu': sanscript.TELUGU,
        'tamil': sanscript.TAMIL,
    }
    return transliterate(text, script_map[script], sanscript.ISO)

text = "నేను ఇవాళ బాగున్నాను"      # "I am well today" in Telugu
text_roman = bups(text, "telugu")   # "nēnu ivāḷa bāgunnānu"

# Inference with voice-prompt recovery + Config B
wav = model.generate(
    text_roman,
    language_id="hi",                 # Hi-proxy (Te/Ta aren't in Chatterbox's 23-lang roster)
    audio_prompt_path="path/to/your_te_voice_9s.wav",  # BYOR: 8-11s same-language clip
    exaggeration=0.7,
    temperature=0.6,
    min_p=0.1,
)
# wav is a torch.Tensor; save via torchaudio or soundfile

Hindi (via vanilla Chatterbox — NOT this LoRA)

from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")
wav = model.generate(
    "मेरे पास एक सपना है",             # "I have a dream" in Hindi Devanagari
    language_id="hi",
    audio_prompt_path="path/to/your_hi_voice_6s.wav",
    exaggeration=0.7,
    temperature=0.6,
    min_p=0.1,
)

Numbers, currency, percent (highly recommended)

Chatterbox's tokeniser fragments digit runs. Use the unified Indic normaliser before synthesis:

pip install indic-num2words
from num_to_words import num_to_word
import re

def normalise_indic(text, lang):
    def _sub(m):
        n = int(m.group(0))
        return num_to_word(n, lang).replace(",", "").strip()
    return re.sub(r"\d+", _sub, text)

# "జనవరి 26, 2026న" → "జనవరి ఇరవై ఆరు రెండు వేల ఇరవై ఆరున"

(For a richer normaliser with currency / percent / date-ordinal handling, see praxy/linguistics/indic_numbers.py in the code repo.)


Benchmarks

Evaluated on the companion PSP (Phoneme Substitution Profile) benchmark, 10-utterance pilot sets per language:

Lang Metric Praxy R6 (this repo) Sarvam Bulbul ElevenLabs v3 Cartesia Sonic-3
Te Retroflex collapse ↓ 26.7% 33.3% 40.0% 50.0%
Te PSD ↓ 13.1 11.1 154.4 33.8
Te FAD ↓ 291.3 250.4 328.9 458.1
Te LLM-WER ↓ 0.033 0.029 0.041 0.029
Ta ZF (zha) collapse ↓ 71% 86% 86% 86%
Ta Retroflex collapse ↓ 69.2% 70.5% 69.2% 69.2%
Ta FAD ↓ 276.0 200.3 239.4 404.3
Ta LLM-WER ↓ 0.041
Hi LLM-WER ↓ (vanilla, not this LoRA) 0.025 0.007 0.006 0.025
Hi Intent ↑ 1.00 0.90

Highlight: On Telugu, Praxy's retroflex collapse rate is the lowest of every system measured — better than Sarvam, which was previously the best Telugu TTS.

See the paper (§5) for the full table including Indic Parler-TTS, native-audio noise floors, and all four ablations.


What this adapter does and does not do

It does:

  • Extend Chatterbox Multilingual to Telugu and Tamil (languages it had zero native coverage for)
  • Preserve Chatterbox's full zero-shot voice-cloning capability via the audio_prompt_path interface
  • Work alongside the unified Indic text normaliser (numbers, currency, dates, %)

It does not:

  • Improve Hindi — use vanilla Chatterbox for Hi
  • Replace your need for a reference voice — the voice-prompt recovery recipe requires an 8–11 s same-language clip
  • Train a new acoustic model — the acoustic decoder (s3gen) stays frozen

Training details

  • Base: ResembleAI/chatterbox (MIT)
  • Adapter target: t3 transformer, attention projections only (q_proj, k_proj, v_proj, o_proj)
  • Rank / alpha / dropout: 32 / 64 / 0.05
  • Trainable params: 7.86M (0.97% of base) — 240 tensors × (lora_A + lora_B) across 30 transformer layers × 4 attention projections
  • Training data: 1,886 h of IndicTTS + Rasa + FLEURS + Shrutilipi (Te/Ta/Hi, CC-BY-4.0)
  • Optimiser / schedule: AdamW (β=0.9, 0.95), cosine with 500-step warmup, peak LR 3e-6
  • Precision: bf16 mixed
  • Compute: 1× A100-80GB, 8000 steps, ~11 hours, ~$45

Training code: github.com/praxelhq/praxy (MIT).


Limitations

  • 10-utt pilots only — 300-utt benchmarks planned for v2. Do not treat single-digit-percent differences as statistically separable.
  • Voice-prompt CUDA bug: Sarvam-Hi-female reference audio triggers a CUDA assertion in Chatterbox's s3gen positional embedding. Workaround: use Cartesia-Hi-female or any other Hi reference.
  • No MOS — formal subjective evaluation deferred to v2.
  • Acoustic-decoder unchanged — the LoRA does not touch s3gen; acoustic quality beyond voice-prompt-recoverable is bounded by base Chatterbox's acoustic prior.

License and attribution

  • This LoRA adapter: Apache-2.0
  • Chatterbox base (not included in this repo): MIT (© Resemble AI)
  • Training data: CC-BY-4.0 / similarly permissive

If you use this model in published work, please cite the companion papers:

@misc{praxyvoice2026,
  title={Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost},
  author={Menta, Venkata Pushpak Teja},
  year={2026},
  eprint={2604.25441},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.25441}
}

@misc{psp2026,
  title={PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech},
  author={Menta, Venkata Pushpak Teja},
  year={2026},
  eprint={2604.25476},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.25476}
}

Contact

Issues, bugs, reference-voice quirks: github.com/praxelhq/praxy/issues.

Commercial licensing / enterprise support: pushpak@praxel.in.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Praxel/praxy-voice-r6

Adapter
(5)
this model

Space using Praxel/praxy-voice-r6 1

Papers for Praxel/praxy-voice-r6