Automatic Speech Recognition
PEFT
Safetensors
Telugu
whisper
telugu
indic
lora
entity-dense
praxelhq commited on
Commit
ea2dc41
·
verified ·
1 Parent(s): c82ca7f

Initial release: Praxy-STT-TE-rb (vasista22 + EDSA LoRA)

Browse files
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: vasista22/whisper-telugu-large-v2
3
+ library_name: peft
4
+ language: te
5
+ license: apache-2.0
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - whisper
9
+ - telugu
10
+ - indic
11
+ - lora
12
+ - entity-dense
13
+ metrics:
14
+ - wer
15
+ - ehr
16
+ datasets:
17
+ - ai4bharat/IndicVoices
18
+ - mozilla-foundation/common_voice_25_0
19
+ - google/fleurs
20
+ ---
21
+
22
+ # Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel
23
+
24
+ LoRA adapter on top of `vasista22/whisper-telugu-large-v2` trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.
25
+
26
+ ## Headline results (entity-dense Telugu, n=102, Cartesia held-out)
27
+
28
+ | System | EHR | WER | SFR |
29
+ |---|---|---|---|
30
+ | Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 |
31
+ | vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 |
32
+ | Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 |
33
+ | **Praxy-STT-Te-rb (this model)** | **0.473** | **0.324** | 0.928 |
34
+
35
+ = **17× over open SOTA, 3× over commercial** on Indian-entity recognition.
36
+
37
+ Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.
38
+
39
+ ## Usage
40
+
41
+ ```python
42
+ from transformers import WhisperForConditionalGeneration, WhisperProcessor
43
+ from peft import PeftModel
44
+
45
+ base_model = "vasista22/whisper-telugu-large-v2"
46
+ processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
47
+ model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")
48
+
49
+ # vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
50
+ forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
51
+ model.config.forced_decoder_ids = forced
52
+ model.generation_config.forced_decoder_ids = forced
53
+ model.generation_config.suppress_tokens = []
54
+
55
+ model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
56
+ model.eval()
57
+
58
+ # Transcribe
59
+ import librosa
60
+ audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
61
+ feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
62
+ pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
63
+ text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
64
+ print(text)
65
+ ```
66
+
67
+ ## Training
68
+
69
+ - **Base:** `vasista22/whisper-telugu-large-v2` (IIT-Madras Speech Lab, Apache-2.0)
70
+ - **LoRA config:** rank 16, alpha 32, dropout 0.05, target modules `q_proj k_proj v_proj out_proj`
71
+ - **Training corpus:** Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
72
+ - **Steps:** 4000 on Modal A10G, ~$5 compute
73
+ - **Pin chain:** `transformers==4.36.2`, `peft==0.10.0`, `torch==2.4.0` (vasista22's saved generation_config is incompatible with newer transformers)
74
+
75
+ ## License + companion work
76
+
77
+ Apache-2.0 (matches upstream vasista22 license).
78
+
79
+ This is paper #3 in a series:
80
+ - **Praxy Voice TTS** (paper #1, the synthesis half of this flywheel): [arXiv:2604.25441](https://arxiv.org/abs/2604.25441)
81
+ - **PSP** (paper #2, accent metric used to validate synth quality): [arXiv:2604.25476](https://arxiv.org/abs/2604.25476)
82
+ - **STT Flywheel** (this paper): preprint forthcoming
83
+
84
+ Companion β models: `Praxel/praxy-stt-hi-rb`, `Praxel/praxy-stt-ta-rb`.
85
+
86
+ ## Limitations
87
+
88
+ - Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
89
+ - Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
90
+ - Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.
91
+
92
+ ## Citation
93
+
94
+ ```bibtex
95
+ @misc{praxy_stt_2026,
96
+ author = {Menta, Venkata Pushpak Teja},
97
+ title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
98
+ year = {2026},
99
+ publisher = {Praxel Ventures},
100
+ howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
101
+ }
102
+ ```
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "WhisperForConditionalGeneration",
5
+ "parent_library": "transformers.models.whisper.modeling_whisper"
6
+ },
7
+ "base_model_name_or_path": "vasista22/whisper-telugu-large-v2",
8
+ "bias": "none",
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_dropout": 0.05,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": null,
21
+ "peft_type": "LORA",
22
+ "r": 16,
23
+ "rank_pattern": {},
24
+ "revision": null,
25
+ "target_modules": [
26
+ "v_proj",
27
+ "q_proj",
28
+ "out_proj",
29
+ "k_proj"
30
+ ],
31
+ "task_type": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d63ca26518e18997022be01080e332125c9155a4c9d442e999b93ee3c7ed371d
3
+ size 31568280
preprocessor_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length": 30,
3
+ "feature_extractor_type": "WhisperFeatureExtractor",
4
+ "feature_size": 80,
5
+ "hop_length": 160,
6
+ "n_fft": 400,
7
+ "n_samples": 480000,
8
+ "nb_max_frames": 3000,
9
+ "padding_side": "right",
10
+ "padding_value": 0.0,
11
+ "processor_class": "WhisperProcessor",
12
+ "return_attention_mask": false,
13
+ "sampling_rate": 16000
14
+ }