Automatic Speech Recognition
PEFT
Safetensors
Telugu
whisper
telugu
indic
lora
entity-dense

Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel

LoRA adapter on top of vasista22/whisper-telugu-large-v2 trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.

Headline results (entity-dense Telugu, n=102, Cartesia held-out)

System EHR WER SFR
Vanilla Whisper-large-v3 0.560 1.330 0.566
vasista22 (open SOTA, our base) 0.027 0.582 1.000
Deepgram Nova-3 (commercial) 0.160 0.690 0.978
Praxy-STT-Te-rb (this model) 0.473 0.324 0.928

= 17× over open SOTA, 3× over commercial on Indian-entity recognition.

Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base_model = "vasista22/whisper-telugu-large-v2"
processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")

# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
model.config.forced_decoder_ids = forced
model.generation_config.forced_decoder_ids = forced
model.generation_config.suppress_tokens = []

model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
model.eval()

# Transcribe
import librosa
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
print(text)

Training

  • Base: vasista22/whisper-telugu-large-v2 (IIT-Madras Speech Lab, Apache-2.0)
  • LoRA config: rank 16, alpha 32, dropout 0.05, target modules q_proj k_proj v_proj out_proj
  • Training corpus: Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
  • Steps: 4000 on Modal A10G, ~$5 compute
  • Pin chain: transformers==4.36.2, peft==0.10.0, torch==2.4.0 (vasista22's saved generation_config is incompatible with newer transformers)

License + companion work

Apache-2.0 (matches upstream vasista22 license).

This is paper #3 in a series:

Companion β models: Praxel/praxy-stt-hi-rb, Praxel/praxy-stt-ta-rb.

Limitations

  • Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
  • Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
  • Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.

Citation

@misc{praxy_stt_2026,
  author = {Menta, Venkata Pushpak Teja},
  title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
  year = {2026},
  publisher = {Praxel Ventures},
  howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
}
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Praxel/praxy-stt-te-rb

Adapter
(1)
this model

Datasets used to train Praxel/praxy-stt-te-rb

Papers for Praxel/praxy-stt-te-rb