Initial release: Praxy-STT-TE-rb (vasista22 + EDSA LoRA)

Browse files

Files changed (4) hide show

README.md +102 -0
adapter_config.json +34 -0
adapter_model.safetensors +3 -0
preprocessor_config.json +14 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+base_model: vasista22/whisper-telugu-large-v2
+library_name: peft
+language: te
+license: apache-2.0
+tags:
+- automatic-speech-recognition
+- whisper
+- telugu
+- indic
+- lora
+- entity-dense
+metrics:
+- wer
+- ehr
+datasets:
+- ai4bharat/IndicVoices
+- mozilla-foundation/common_voice_25_0
+- google/fleurs
+---
+# Praxy-STT-Te-rb: Entity-Dense Telugu ASR via TTS↔STT Flywheel
+LoRA adapter on top of `vasista22/whisper-telugu-large-v2` trained on the EDSA (Entity-Dense Synthetic Audio) corpus to recover Indian-style entity recognition (digit strings, currency amounts, addresses, brand names, English/Telugu code-mix) where the underlying base model fails.
+## Headline results (entity-dense Telugu, n=102, Cartesia held-out)
+| System | EHR | WER | SFR |
+|---|---|---|---|
+| Vanilla Whisper-large-v3 | 0.560 | 1.330 | 0.566 |
+| vasista22 (open SOTA, our base) | 0.027 | 0.582 | 1.000 |
+| Deepgram Nova-3 (commercial) | 0.160 | 0.690 | 0.978 |
+| **Praxy-STT-Te-rb (this model)** | **0.473** | **0.324** | 0.928 |
+= **17× over open SOTA, 3× over commercial** on Indian-entity recognition.
+Read-prose preserved within +6 pp WER on FLEURS-Te (0.39 vs vasista22 0.33), tied on IndicVoices conversational, +1 pp on Common Voice 25.
+## Usage
+```python
+from transformers import WhisperForConditionalGeneration, WhisperProcessor
+from peft import PeftModel
+base_model = "vasista22/whisper-telugu-large-v2"
+processor = WhisperProcessor.from_pretrained(base_model, language="telugu", task="transcribe")
+model = WhisperForConditionalGeneration.from_pretrained(base_model, torch_dtype="bfloat16").to("cuda")
+# vasista22's saved generation_config requires explicit forced_decoder_ids under transformers >=4.40
+forced = processor.tokenizer.get_decoder_prompt_ids(language="telugu", task="transcribe")
+model.config.forced_decoder_ids = forced
+model.generation_config.forced_decoder_ids = forced
+model.generation_config.suppress_tokens = []
+model = PeftModel.from_pretrained(model, "Praxel/praxy-stt-te-rb")
+model.eval()
+# Transcribe
+import librosa
+audio, _ = librosa.load("path/to/audio.wav", sr=16000, mono=True)
+feats = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", dtype=torch.bfloat16)
+pred_ids = model.generate(feats, max_new_tokens=400, num_beams=1)
+text = processor.tokenizer.decode(pred_ids[0], skip_special_tokens=True).strip()
+print(text)
+```
+## Training
+- **Base:** `vasista22/whisper-telugu-large-v2` (IIT-Madras Speech Lab, Apache-2.0)
+- **LoRA config:** rank 16, alpha 32, dropout 0.05, target modules `q_proj k_proj v_proj out_proj`
+- **Training corpus:** Entity-Dense Synthetic Audio (~22 audio-hours per language) from Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia synthesis; Cartesia rows held out as evaluation set
+- **Steps:** 4000 on Modal A10G, ~$5 compute
+- **Pin chain:** `transformers==4.36.2`, `peft==0.10.0`, `torch==2.4.0` (vasista22's saved generation_config is incompatible with newer transformers)
+## License + companion work
+Apache-2.0 (matches upstream vasista22 license).
+This is paper #3 in a series:
+- **Praxy Voice TTS** (paper #1, the synthesis half of this flywheel): [arXiv:2604.25441](https://arxiv.org/abs/2604.25441)
+- **PSP** (paper #2, accent metric used to validate synth quality): [arXiv:2604.25476](https://arxiv.org/abs/2604.25476)
+- **STT Flywheel** (this paper): preprint forthcoming
+Companion β models: `Praxel/praxy-stt-hi-rb`, `Praxel/praxy-stt-ta-rb`.
+## Limitations
+- Entity-dense evaluation is on Cartesia-synthesised audio held-out from training; transfer to native human entity-dense speech is not directly measured.
+- Pre-registered EHR ≥ 0.75 target was missed (achieved 0.473); entity-dense Indic ASR remains substantially open as a research direction.
+- Read-prose regression is bounded but exists (+6 pp on FLEURS-Te); for pure read-prose deployment the upstream vasista22 base is preferable.
+## Citation
+```bibtex
+@misc{praxy_stt_2026,
+  author = {Menta, Venkata Pushpak Teja},
+  title = {The TTS--STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail},
+  year = {2026},
+  publisher = {Praxel Ventures},
+  howpublished = {\url{https://huggingface.co/Praxel/praxy-stt-te-rb}},
+}
+```

adapter_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": {
+    "base_model_class": "WhisperForConditionalGeneration",
+    "parent_library": "transformers.models.whisper.modeling_whisper"
+  },
+  "base_model_name_or_path": "vasista22/whisper-telugu-large-v2",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "v_proj",
+    "q_proj",
+    "out_proj",
+    "k_proj"
+  ],
+  "task_type": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d63ca26518e18997022be01080e332125c9155a4c9d442e999b93ee3c7ed371d
+size 31568280

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "chunk_length": 30,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 80,
+  "hop_length": 160,
+  "n_fft": 400,
+  "n_samples": 480000,
+  "nb_max_frames": 3000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "WhisperProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 16000
+}