Initial release: Praxy-STT-TE-rb (vasista22 + EDSA LoRA)
Browse files- README.md +102 -0
- adapter_config.json +34 -0
- adapter_model.safetensors +3 -0
- preprocessor_config.json +14 -0
README.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: vasista22/whisper-telugu-large-v2
|
| 3 |
+
library_name: peft
|
| 4 |
+
language: te
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
tags:
|
| 7 |
+
- automatic-speech-recognition
|
| 8 |
+
- whisper
|
| 9 |
+
- telugu
|
| 10 |
+
- indic
|
| 11 |
+
- lora
|
| 12 |
+
- entity-dense
|
| 13 |
+
metrics:
|
| 14 |
+
- wer
|
| 15 |
+
- ehr
|
| 16 |
+
datasets:
|
| 17 |
+
- ai4bharat/IndicVoices
|
| 18 |
+
- mozilla-foundation/common_voice_25_0
|
| 19 |
+
- google/fleurs
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel
|
| 23 |
+
|
| 24 |
+
LoRA adapter on top of `vasista22/whisper-telugu-large-v2` trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.
|
| 25 |
+
|
| 26 |
+
## Headline results (entity-dense Telugu, n=102, Cartesia held-out)
|
| 27 |
+
|
| 28 |
+
| System | EHR | WER | SFR |
|
| 29 |
+
|---|---|---|---|
|
| 30 |
+
| Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 |
|
| 31 |
+
| vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 |
|
| 32 |
+
| Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 |
|
| 33 |
+
| **Praxy-STT-Te-rb (this model)** | **0.473** | **0.324** | 0.928 |
|
| 34 |
+
|
| 35 |
+
= **17× over open SOTA, 3× over commercial** on Indian-entity recognition.
|
| 36 |
+
|
| 37 |
+
Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.
|
| 38 |
+
|
| 39 |
+
## Usage
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
from transformers import WhisperForConditionalGeneration, WhisperProcessor
|
| 43 |
+
from peft import PeftModel
|
| 44 |
+
|
| 45 |
+
base_model = "vasista22/whisper-telugu-large-v2"
|
| 46 |
+
processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
|
| 47 |
+
model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")
|
| 48 |
+
|
| 49 |
+
# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
|
| 50 |
+
forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
|
| 51 |
+
model.config.forced_decoder_ids = forced
|
| 52 |
+
model.generation_config.forced_decoder_ids = forced
|
| 53 |
+
model.generation_config.suppress_tokens = []
|
| 54 |
+
|
| 55 |
+
model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
|
| 56 |
+
model.eval()
|
| 57 |
+
|
| 58 |
+
# Transcribe
|
| 59 |
+
import librosa
|
| 60 |
+
audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
|
| 61 |
+
feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
|
| 62 |
+
pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
|
| 63 |
+
text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
|
| 64 |
+
print(text)
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Training
|
| 68 |
+
|
| 69 |
+
- **Base:** `vasista22/whisper-telugu-large-v2` (IIT-Madras Speech Lab, Apache-2.0)
|
| 70 |
+
- **LoRA config:** rank 16, alpha 32, dropout 0.05, target modules `q_proj k_proj v_proj out_proj`
|
| 71 |
+
- **Training corpus:** Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
|
| 72 |
+
- **Steps:** 4000 on Modal A10G, ~$5 compute
|
| 73 |
+
- **Pin chain:** `transformers==4.36.2`, `peft==0.10.0`, `torch==2.4.0` (vasista22's saved generation_config is incompatible with newer transformers)
|
| 74 |
+
|
| 75 |
+
## License + companion work
|
| 76 |
+
|
| 77 |
+
Apache-2.0 (matches upstream vasista22 license).
|
| 78 |
+
|
| 79 |
+
This is paper #3 in a series:
|
| 80 |
+
- **Praxy Voice TTS** (paper #1, the synthesis half of this flywheel): [arXiv:2604.25441](https://arxiv.org/abs/2604.25441)
|
| 81 |
+
- **PSP** (paper #2, accent metric used to validate synth quality): [arXiv:2604.25476](https://arxiv.org/abs/2604.25476)
|
| 82 |
+
- **STT Flywheel** (this paper): preprint forthcoming
|
| 83 |
+
|
| 84 |
+
Companion β models: `Praxel/praxy-stt-hi-rb`, `Praxel/praxy-stt-ta-rb`.
|
| 85 |
+
|
| 86 |
+
## Limitations
|
| 87 |
+
|
| 88 |
+
- Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
|
| 89 |
+
- Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
|
| 90 |
+
- Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.
|
| 91 |
+
|
| 92 |
+
## Citation
|
| 93 |
+
|
| 94 |
+
```bibtex
|
| 95 |
+
@misc{praxy_stt_2026,
|
| 96 |
+
author = {Menta, Venkata Pushpak Teja},
|
| 97 |
+
title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
|
| 98 |
+
year = {2026},
|
| 99 |
+
publisher = {Praxel Ventures},
|
| 100 |
+
howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
|
| 101 |
+
}
|
| 102 |
+
```
|
adapter_config.json
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alpha_pattern": {},
|
| 3 |
+
"auto_mapping": {
|
| 4 |
+
"base_model_class": "WhisperForConditionalGeneration",
|
| 5 |
+
"parent_library": "transformers.models.whisper.modeling_whisper"
|
| 6 |
+
},
|
| 7 |
+
"base_model_name_or_path": "vasista22/whisper-telugu-large-v2",
|
| 8 |
+
"bias": "none",
|
| 9 |
+
"fan_in_fan_out": false,
|
| 10 |
+
"inference_mode": true,
|
| 11 |
+
"init_lora_weights": true,
|
| 12 |
+
"layer_replication": null,
|
| 13 |
+
"layers_pattern": null,
|
| 14 |
+
"layers_to_transform": null,
|
| 15 |
+
"loftq_config": {},
|
| 16 |
+
"lora_alpha": 32,
|
| 17 |
+
"lora_dropout": 0.05,
|
| 18 |
+
"megatron_config": null,
|
| 19 |
+
"megatron_core": "megatron.core",
|
| 20 |
+
"modules_to_save": null,
|
| 21 |
+
"peft_type": "LORA",
|
| 22 |
+
"r": 16,
|
| 23 |
+
"rank_pattern": {},
|
| 24 |
+
"revision": null,
|
| 25 |
+
"target_modules": [
|
| 26 |
+
"v_proj",
|
| 27 |
+
"q_proj",
|
| 28 |
+
"out_proj",
|
| 29 |
+
"k_proj"
|
| 30 |
+
],
|
| 31 |
+
"task_type": null,
|
| 32 |
+
"use_dora": false,
|
| 33 |
+
"use_rslora": false
|
| 34 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d63ca26518e18997022be01080e332125c9155a4c9d442e999b93ee3c7ed371d
|
| 3 |
+
size 31568280
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"chunk_length": 30,
|
| 3 |
+
"feature_extractor_type": "WhisperFeatureExtractor",
|
| 4 |
+
"feature_size": 80,
|
| 5 |
+
"hop_length": 160,
|
| 6 |
+
"n_fft": 400,
|
| 7 |
+
"n_samples": 480000,
|
| 8 |
+
"nb_max_frames": 3000,
|
| 9 |
+
"padding_side": "right",
|
| 10 |
+
"padding_value": 0.0,
|
| 11 |
+
"processor_class": "WhisperProcessor",
|
| 12 |
+
"return_attention_mask": false,
|
| 13 |
+
"sampling_rate": 16000
|
| 14 |
+
}
|