Praxy-STT-TE-r2: Whisper-large-v3 + Per-Language LoRA for Telugu

Companion to the paper The TTS↔STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail (preprint forthcoming).

This is the Whisper-large-v3-base LoRA variant trained on the EDSA entity-dense corpus. The paper's headline-recommended model is Praxel/praxy-stt-te-rb (vasista22 base) which has substantially better read-prose performance.

Headline result

System	EHR	WER	SFR
Vanilla Whisper-v3	0.560	1.330	0.566
Praxy-STT-Te-r2 (this model)	0.853†	0.515	0.793

† Note: this LoRA's training corpus included Cartesia-synthesised entity-dense rows; held-out evaluation against Praxel/praxy-stt-te-rb (vasista22 base, cartesia-excluded) is the cleaner number for entity-dense Te.

Native human Te audio (n=20, paper §5.2): EHR 0.839, WER 0.515, SFR 0.753.

Status

This is the Whisper-large-v3-base variant; we recommend Praxel/praxy-stt-te-rb (vasista22 base + EDSA LoRA) for production use — read-prose performance is substantially better. This r2 release accompanies the paper as the methods row that demonstrates the Script Collapse fix mechanism on a generic Whisper-v3 base.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
import torch, librosa

base = "openai/whisper-large-v3"
processor = WhisperProcessor.from_pretrained(base, language="telugu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base, torch_dtype=torch.bfloat16).to("cuda")
model.generation_config.language = "telugu"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []

model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-r2")
model.eval()

audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1, language="telugu", task="transcribe")
print(processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip())

Training

LoRA rank 16, alpha 32, on q/k/v/out_proj of openai/whisper-large-v3. 6000 steps Modal A10G. Per-language decoder prefix <|te|> (no Hindi-proxy). Pin chain: transformers==4.49.0, peft>=0.13, torch==2.4.0.

Companion artefacts

Paper #3 (this work): arXiv:2605.03073; code at github.com/praxelhq/stt-flywheel
EDSA-LoRA on vasista22 base (recommended): Praxel/praxy-stt-te-rb
Paper #1 (Praxy Voice TTS): arXiv:2604.25441
Paper #2 (PSP): arXiv:2604.25476
Paper #4 (LASE): arXiv:2605.00777; code at github.com/praxelhq/lase

License: Apache-2.0.

Downloads last month: 34

Model tree for Praxel/praxy-stt-te-r2

Base model

openai/whisper-large-v3

Adapter

(204)

this model

Datasets used to train Praxel/praxy-stt-te-r2

Papers for Praxel/praxy-stt-te-r2

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Paper • 2605.03073 • Published 3 days ago • 2

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Paper • 2605.00777 • Published 6 days ago • 2

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Paper • 2604.25441 • Published 9 days ago • 2

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Paper • 2604.25476 • Published 9 days ago • 2