Automatic Speech Recognition
ONNX
Transformers.js
onnxruntime
qwen3_asr
text-generation
onnxruntime-web
asr
speech-recognition
robust-asr
quantized
int4
int8
matmulnbits
gptq
on-device
browser
web
qwen3
qwen3-asr
mega-asr
Instructions to use Reza2kn/mega-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use Reza2kn/mega-asr-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');
Update quality table with mixed8 (INT8 enc + INT4 dec = 91.9%) as recommended config
Browse files
README.md
CHANGED
|
@@ -73,19 +73,24 @@ Conv2d front-end at fp16.
|
|
| 73 |
## Quality
|
| 74 |
|
| 75 |
Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|
| 76 |
-
example clips (real-world noisy conditions), word-level
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Inference (Python)
|
| 91 |
|
|
|
|
| 73 |
## Quality
|
| 74 |
|
| 75 |
Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|
| 76 |
+
example clips (real-world noisy conditions, all English), word-level
|
| 77 |
+
agreement (1 − WER), prompt forced to `language English`:
|
| 78 |
+
|
| 79 |
+
| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
|
| 80 |
+
| --- | --- | --- | ---: | ---: | ---: |
|
| 81 |
+
| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
|
| 82 |
+
| ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
|
| 83 |
+
| **ONNX recommended** | **INT8** | **INT4** | **91.9%** | **6 / 8** | **2.3 GB** |
|
| 84 |
+
| ONNX small | INT4 | INT4 | 87.8% | 6 / 8 | 2.0 GB |
|
| 85 |
+
|
| 86 |
+
The recommended config (INT8 audio encoder + INT4 LLM decoder) is the
|
| 87 |
+
size/quality sweet spot for browser deployment. Forcing the language
|
| 88 |
+
(rather than auto-detecting) recovers most of the quantization drift.
|
| 89 |
+
|
| 90 |
+
**Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
|
| 91 |
+
language skips the model's audio-quality-router language detection,
|
| 92 |
+
which is where the PT model loses points on `echo` (heard "by the walk"
|
| 93 |
+
instead of "while walking") and `recording` (truncated).
|
| 94 |
|
| 95 |
## Inference (Python)
|
| 96 |
|