Praxy-STT-HI-r2: Whisper-large-v3 + Per-Language LoRA for Hindi
Companion to the paper The TTS↔STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail (preprint forthcoming).
This is the Whisper-large-v3-base LoRA variant trained on the EDSA entity-dense corpus. The paper's headline-recommended model is Praxel/praxy-stt-hi-rb (vasista22 base) which has substantially better read-prose performance.
⚠️ Honest status: contraindicated for production
This LoRA was trained as part of a 3-language ablation series. For Hindi, vanilla Whisper-large-v3 already achieves SFR $\geq 0.98$ — there is no Script Collapse to correct. Applying this LoRA to vanilla Hindi inference produces net regressions (paper §5.3, Table 3):
| Holdout | Vanilla v3 WER → Hi-r2 WER |
|---|---|
| FLEURS-Hi | 0.32 → 0.51 |
| CV25-Hi | 0.42 → 1.11 |
| IV-Hi | 0.52 → 0.89 |
Use Praxel/praxy-stt-hi-rb instead (vasista22 base + EDSA LoRA) for entity-dense Hindi, OR use vanilla vasista22/whisper-hindi-large-v2 for read-prose Hindi.
This r2 release is published for reproducibility of the language-conditional finding (paper §5.3) and should not be used as a deployment target.
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
import torch, librosa
base = "openai/whisper-large-v3"
processor = WhisperProcessor.from_pretrained(base, language="hindi", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(base, torch_dtype=torch.bfloat16).to("cuda")
model.generation_config.language = "hindi"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []
model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-hi-r2")
model.eval()
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1, language="hindi", task="transcribe")
print(processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip())
Training
LoRA rank 16, alpha 32, on q/k/v/out_proj of openai/whisper-large-v3. 6000 steps Modal A10G. Per-language decoder prefix <|hi|> (no Hindi-proxy). Pin chain: transformers==4.49.0, peft>=0.13, torch==2.4.0.
Companion artefacts
- Paper #3 (this work): arXiv:2605.03073; code at github.com/praxelhq/stt-flywheel
- EDSA-LoRA on vasista22 base (recommended): Praxel/praxy-stt-hi-rb
- Paper #1 (Praxy Voice TTS): arXiv:2604.25441
- Paper #2 (PSP): arXiv:2604.25476
- Paper #4 (LASE): arXiv:2605.00777; code at github.com/praxelhq/lase
License: Apache-2.0.
- Downloads last month
- 30
Model tree for Praxel/praxy-stt-hi-r2
Base model
openai/whisper-large-v3