Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel

LoRA adapter on top of vasista22/whisper-telugu-large-v2 trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.

Headline results (entity-dense Telugu, n=102, Cartesia held-out)

System	EHR	WER	SFR
Vanilla Whisper-large-v3	0.560	1.330	0.566
vasista22 (open SOTA, our base)	0.027	0.582	1.000
Deepgram Nova-3 (commercial)	0.160	0.690	0.978
Praxy-STT-Te-rb (this model)	0.473	0.324	0.928

= 17× over open SOTA, 3× over commercial on Indian-entity recognition.

Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base_model = "vasista22/whisper-telugu-large-v2"
processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")

# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
model.config.forced_decoder_ids = forced
model.generation_config.forced_decoder_ids = forced
model.generation_config.suppress_tokens = []

model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
model.eval()

# Transcribe
import librosa
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
print(text)

Training

Base: vasista22/whisper-telugu-large-v2 (IIT-Madras Speech Lab, Apache-2.0)
LoRA config: rank 16, alpha 32, dropout 0.05, target modules q_proj k_proj v_proj out_proj
Training corpus: Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
Steps: 4000 on Modal A10G, ~$5 compute
Pin chain: transformers==4.36.2, peft==0.10.0, torch==2.4.0 (vasista22's saved generation_config is incompatible with newer transformers)

License + companion work

Apache-2.0 (matches upstream vasista22 license).

This is paper #3 in a series:

Praxy Voice TTS (paper #1, the synthesis half of this flywheel): arXiv:2604.25441
PSP (paper #2, accent metric used to validate synth quality): arXiv:2604.25476
STT Flywheel (this paper): arXiv:2605.03073; code at github.com/praxelhq/stt-flywheel

Companion β models: Praxel/praxy-stt-hi-rb, Praxel/praxy-stt-ta-rb.

Limitations

Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.

Citation

@misc{praxy_stt_2026,
  author = {Menta, Venkata Pushpak Teja},
  title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
  year = {2026},
  publisher = {Praxel Ventures},
  howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
}

Downloads last month: 41

Model tree for Praxel/praxy-stt-te-rb

Base model

vasista22/whisper-telugu-large-v2

Adapter

(1)

this model

Datasets used to train Praxel/praxy-stt-te-rb

Papers for Praxel/praxy-stt-te-rb

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Paper • 2605.03073 • Published 3 days ago • 2

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Paper • 2604.25441 • Published 9 days ago • 2

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Paper • 2604.25476 • Published 9 days ago • 2