The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail
Paper • 2605.03073 • Published • 2
LoRA adapter on top of vasista22/whisper-telugu-large-v2 trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.
| System | EHR | WER | SFR |
|---|---|---|---|
| Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 |
| vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 |
| Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 |
| Praxy-STT-Te-rb (this model) | 0.473 | 0.324 | 0.928 |
= 17× over open SOTA, 3× over commercial on Indian-entity recognition.
Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
base_model = "vasista22/whisper-telugu-large-v2"
processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")
# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
model.config.forced_decoder_ids = forced
model.generation_config.forced_decoder_ids = forced
model.generation_config.suppress_tokens = []
model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
model.eval()
# Transcribe
import librosa
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
print(text)
vasista22/whisper-telugu-large-v2 (IIT-Madras Speech Lab, Apache-2.0)q_proj k_proj v_proj out_projtransformers==4.36.2, peft==0.10.0, torch==2.4.0 (vasista22's saved generation_config is incompatible with newer transformers)Apache-2.0 (matches upstream vasista22 license).
This is paper #3 in a series:
Companion β models: Praxel/praxy-stt-hi-rb, Praxel/praxy-stt-ta-rb.
@misc{praxy_stt_2026,
author = {Menta, Venkata Pushpak Teja},
title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
year = {2026},
publisher = {Praxel Ventures},
howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
}
Base model
vasista22/whisper-telugu-large-v2