bc7ec356 commited on
Commit
ddfad4d
·
verified ·
1 Parent(s): 4cad574

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -63
README.md CHANGED
@@ -222,11 +222,63 @@ metrics:
222
  pipeline_tag: automatic-speech-recognition
223
  ---
224
 
225
- # HEEP Universal
226
 
227
- **High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
- HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of **3.10%** on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
  ## Model Overview
232
 
@@ -361,66 +413,6 @@ Output: Curated dataset D*
361
 
362
 
363
 
364
-
365
-
366
-
367
- # Post-Rebuttal Update: Cross-Architecture Validation with HEEP-Indic
368
-
369
- **Addressing Q1 (Gain Attribution), Q2 (Baselines), and Q3 (Base Model Dependency)**
370
-
371
- We apologize for the supplementary post after the rebuttal period. These results were finalized shortly after the deadline, and we wanted to ensure complete experimental evidence was available rather than leave placeholders.
372
-
373
-
374
- ### 🔗 Resources
375
-
376
- * **Reproducibility (Universal Model):** [https://huggingface.co/bc7ec356/heep-universal](https://huggingface.co/bc7ec356/heep-universal)
377
- * **Cross-Architecture Model (Indic):** [https://huggingface.co/bc7ec356/heep-indic](https://huggingface.co/bc7ec356/heep-indic)
378
-
379
-
380
- ## Cross-Architecture Generalization
381
-
382
- To directly address concerns about generalization beyond Whisper V3 Turbo, we trained **Qwen3-ASR (1.7B)**, an architecturally distinct audio-language model, on HEEP-curated data spanning **46 Indian languages** (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning.
383
-
384
- ## Hindi Benchmark Comparison (7 Benchmarks)
385
-
386
- | Model | Kathbath | Kathbath Noisy | CommonVoice | FLEURS | IndicTTS | RESPIN | Gramvaani | **Avg** |
387
- | :------------------------- | :------: | :------------: | :---------: | :-------: | :------: | :-------: | :-------: | :------: |
388
- | Google STT | 14.3 | 16.7 | 20.8 | 19.4 | 18.3 | – | 59.9 | 24.9 |
389
- | IndicWav2Vec | 12.2 | 16.2 | 20.2 | 18.3 | 15.0 | – | 42.1 | 20.7 |
390
- | Azure STT | 13.6 | 15.1 | 14.6 | 24.3 | 15.2 | – | 42.3 | 20.8 |
391
- | Nvidia Conformer-CTC Large | 12.7 | 14.2 | 21.2 | 15.7 | 12.2 | – | 42.6 | 19.8 |
392
- | IndicWhisper | 10.3 | 12.0 | 15.0 | 11.4 | 7.6 | – | 26.8 | 13.8 |
393
- | **HEEP-Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
394
-
395
- **HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).**
396
-
397
- ## Multilingual Results (16 Languages)
398
-
399
- | Dataset | Ben | Bho | Chh | Guj | Hin | Kan | Mag | Mai | Mal | Mar | Odi | Pun | San | Tam | Tel | Urd | **Avg** |
400
- | :------------ | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: |
401
- | Kathbath | 14.6 | – | – | 17.4 | 8.5 | 23.0 | – | – | 39.3 | 19.2 | 25.4 | 15.8 | 41.4 | 30.3 | 29.0 | 12.1 | 23.0 |
402
- | Kathbath Hard | 15.7 | – | – | 18.5 | 9.0 | 25.1 | – | – | 41.2 | 20.4 | 27.7 | 16.6 | 43.6 | 32.6 | 30.3 | 11.9 | 24.4 |
403
- | CommonVoice | 21.0 | – | – | – | 10.0 | – | – | – | 46.0 | 21.5 | 34.6 | 17.5 | – | 34.0 | – | 20.6 | 25.7 |
404
- | FLEURS | 22.4 | – | – | 23.3 | 11.0 | 23.1 | – | – | 34.4 | 25.5 | 33.3 | 25.0 | – | 35.1 | 31.9 | 22.4 | 26.1 |
405
- | IndicTTS | 15.8 | – | – | 16.9 | 6.6 | 19.6 | – | – | 26.4 | 14.5 | 14.8 | – | – | 22.6 | 31.3 | – | 18.7 |
406
- | Gramvaani | – | – | – | – | 26.0 | – | – | – | – | – | – | – | – | – | – | – | 26.0 |
407
- | RESPIN | 32.5 | 21.3 | 21.6 | – | 12.1 | 45.6 | 27.7 | 41.1 | – | 32.7 | – | – | – | – | 37.5 | – | 30.2 |
408
- | **Avg** | **20.4** | **21.3** | **21.6** | **19.0** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32.0** | **16.7** | **24.6** |
409
-
410
- ## Key Takeaways
411
-
412
- 1. **Cross-architecture generalization confirmed.** The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification.
413
-
414
- 2. **Controlled multilingual evaluation.** Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols.
415
-
416
- 3. **Model-independent scoring.** Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones.
417
-
418
- 4. **Reproducibility.** Model weights, curation code, and training scripts for both backbones are at the anonymous repository.
419
-
420
- *We hope Reviewers 2ezj, oXjG, and S4Jd also find this supplementary evidence relevant to their earlier questions on generalization and controlled multilingual evaluation.*
421
-
422
- ---
423
-
424
  ## Usage
425
 
426
  ```python
 
222
  pipeline_tag: automatic-speech-recognition
223
  ---
224
 
 
225
 
226
+ # Post-Rebuttal Update: Cross-Architecture Validation with HEEP-Indic
227
+
228
+ **Addressing Q1 (Gain Attribution), Q2 (Baselines), and Q3 (Base Model Dependency)**
229
+
230
+ We apologize for the supplementary post after the rebuttal period. These results were finalized shortly after the deadline, and we wanted to ensure complete experimental evidence was available rather than leave placeholders.
231
+
232
+
233
+ ### 🔗 Resources
234
+
235
+ * **Reproducibility (Universal Model):** [https://huggingface.co/bc7ec356/heep-universal](https://huggingface.co/bc7ec356/heep-universal)
236
+ * **Cross-Architecture Model (Indic):** [https://huggingface.co/bc7ec356/heep-indic](https://huggingface.co/bc7ec356/heep-indic)
237
+
238
+
239
+ ## Cross-Architecture Generalization
240
+
241
+ To directly address concerns about generalization beyond Whisper V3 Turbo, we trained **Qwen3-ASR (1.7B)**, an architecturally distinct audio-language model, on HEEP-curated data spanning **46 Indian languages** (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning.
242
+
243
+ ## Hindi Benchmark Comparison (7 Benchmarks)
244
+
245
+ | Model | Kathbath | Kathbath Noisy | CommonVoice | FLEURS | IndicTTS | RESPIN | Gramvaani | **Avg** |
246
+ | :------------------------- | :------: | :------------: | :---------: | :-------: | :------: | :-------: | :-------: | :------: |
247
+ | Google STT | 14.3 | 16.7 | 20.8 | 19.4 | 18.3 | – | 59.9 | 24.9 |
248
+ | IndicWav2Vec | 12.2 | 16.2 | 20.2 | 18.3 | 15.0 | – | 42.1 | 20.7 |
249
+ | Azure STT | 13.6 | 15.1 | 14.6 | 24.3 | 15.2 | – | 42.3 | 20.8 |
250
+ | Nvidia Conformer-CTC Large | 12.7 | 14.2 | 21.2 | 15.7 | 12.2 | – | 42.6 | 19.8 |
251
+ | IndicWhisper | 10.3 | 12.0 | 15.0 | 11.4 | 7.6 | – | 26.8 | 13.8 |
252
+ | **HEEP-Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
253
+
254
+ **HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).**
255
 
256
+ ## Multilingual Results (16 Languages)
257
+
258
+ | Dataset | Ben | Bho | Chh | Guj | Hin | Kan | Mag | Mai | Mal | Mar | Odi | Pun | San | Tam | Tel | Urd | **Avg** |
259
+ | :------------ | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: |
260
+ | Kathbath | 14.6 | – | – | 17.4 | 8.5 | 23.0 | – | – | 39.3 | 19.2 | 25.4 | 15.8 | 41.4 | 30.3 | 29.0 | 12.1 | 23.0 |
261
+ | Kathbath Hard | 15.7 | – | – | 18.5 | 9.0 | 25.1 | – | – | 41.2 | 20.4 | 27.7 | 16.6 | 43.6 | 32.6 | 30.3 | 11.9 | 24.4 |
262
+ | CommonVoice | 21.0 | – | – | – | 10.0 | – | – | – | 46.0 | 21.5 | 34.6 | 17.5 | – | 34.0 | – | 20.6 | 25.7 |
263
+ | FLEURS | 22.4 | – | – | 23.3 | 11.0 | 23.1 | – | – | 34.4 | 25.5 | 33.3 | 25.0 | – | 35.1 | 31.9 | 22.4 | 26.1 |
264
+ | IndicTTS | 15.8 | – | – | 16.9 | 6.6 | 19.6 | – | – | 26.4 | 14.5 | 14.8 | – | – | 22.6 | 31.3 | – | 18.7 |
265
+ | Gramvaani | – | – | – | – | 26.0 | – | – | – | – | – | – | – | – | – | – | – | 26.0 |
266
+ | RESPIN | 32.5 | 21.3 | 21.6 | – | 12.1 | 45.6 | 27.7 | 41.1 | – | 32.7 | – | – | – | – | 37.5 | – | 30.2 |
267
+ | **Avg** | **20.4** | **21.3** | **21.6** | **19.0** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32.0** | **16.7** | **24.6** |
268
+
269
+ ## Key Takeaways
270
+
271
+ 1. **Cross-architecture generalization confirmed.** The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification.
272
+
273
+ 2. **Controlled multilingual evaluation.** Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols.
274
+
275
+ 3. **Model-independent scoring.** Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones.
276
+
277
+ 4. **Reproducibility.** Model weights, curation code, and training scripts for both backbones are at the anonymous repository.
278
+
279
+ *We hope Reviewers 2ezj, oXjG, and S4Jd also find this supplementary evidence relevant to their earlier questions on generalization and controlled multilingual evaluation.*
280
+
281
+ ---
282
 
283
  ## Model Overview
284
 
 
413
 
414
 
415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  ## Usage
417
 
418
  ```python