The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail
Paper • 2605.03073 • Published • 2
LoRA adapter on top of vasista22/whisper-tamil-large-v2 trained on the EDSA (Entity-Dense Synthetic Audio) corpus.
| System | EHR |
|---|---|
| vasista22 (open SOTA) | 0.025 |
| Deepgram Nova-3 (commercial) | 0.025 |
| Praxy-STT-TA-rb (this model) | 0.543 |
= 22× over open SOTA, 22× over commercial — both vasista22 and Deepgram completely fail (EHR 0.025) on Tamil entity-dense; β-Ta is the cleanest demonstration of the flywheel value.
Read-prose regression vs vasista22 base: +9 pp on FLEURS, +3 pp on CV25, tied on IV.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
import torch, librosa
base_model = "vasista22/whisper-tamil-large-v2"
processor = WhisperProcessor.from_pretrained(base_model, language="tamil", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype=torch.bfloat16).to("cuda")
forced = processor.tokenizer.get_decoder_prompt_ids(language="tamil", task="transcribe")
model.config.forced_decoder_ids = forced
model.generation_config.forced_decoder_ids = forced
model.generation_config.suppress_tokens = []
model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-ta-rb")
model.eval()
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
print(processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip())
LoRA rank 16, alpha 32, on q/k/v/out_proj of vasista22/whisper-tamil-large-v2. 4000 steps Modal A10G. Cartesia rows held out for evaluation. Pin chain: transformers==4.36.2, peft==0.10.0.
License: Apache-2.0.
Base model
vasista22/whisper-tamil-large-v2