Automatic Speech Recognition
ONNX
Transformers.js
onnxruntime
qwen3_asr
text-generation
onnxruntime-web
asr
speech-recognition
robust-asr
quantized
int4
int8
matmulnbits
gptq
on-device
browser
web
qwen3
qwen3-asr
mega-asr
Instructions to use Reza2kn/mega-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use Reza2kn/mega-asr-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');
Add README with full breakdown, tags, base_model_relation: quantized
Browse files
README.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
- ja
|
| 7 |
+
- ko
|
| 8 |
+
- multilingual
|
| 9 |
+
library_name: onnxruntime
|
| 10 |
+
tags:
|
| 11 |
+
- onnx
|
| 12 |
+
- onnxruntime
|
| 13 |
+
- onnxruntime-web
|
| 14 |
+
- automatic-speech-recognition
|
| 15 |
+
- asr
|
| 16 |
+
- speech-recognition
|
| 17 |
+
- robust-asr
|
| 18 |
+
- quantized
|
| 19 |
+
- int4
|
| 20 |
+
- int8
|
| 21 |
+
- matmulnbits
|
| 22 |
+
- on-device
|
| 23 |
+
- browser
|
| 24 |
+
- web
|
| 25 |
+
- qwen3
|
| 26 |
+
- qwen3-asr
|
| 27 |
+
- mega-asr
|
| 28 |
+
- transformers.js
|
| 29 |
+
pipeline_tag: automatic-speech-recognition
|
| 30 |
+
base_model: zhifeixie/Mega-ASR
|
| 31 |
+
base_model_relation: quantized
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
# Mega-ASR β INT4 ONNX
|
| 35 |
+
|
| 36 |
+
INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
|
| 37 |
+
a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
|
| 38 |
+
2.6M training samples covering noise, far-field speech, obstruction, recording
|
| 39 |
+
artifacts, echo, dropout, and transmission dropout.
|
| 40 |
+
|
| 41 |
+
The model is split into three ONNX files (Whisper-style: audio encoder + LLM
|
| 42 |
+
decoder prefill + LLM decoder step) so it can be loaded **directly in the
|
| 43 |
+
browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
|
| 44 |
+
a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
|
| 45 |
+
block-32 symmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
|
| 46 |
+
total β small enough for a one-time browser cache.
|
| 47 |
+
|
| 48 |
+
## What's in this repo
|
| 49 |
+
|
| 50 |
+
| File | Size | Role |
|
| 51 |
+
| --- | ---: | --- |
|
| 52 |
+
| `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β audio embeddings (24-layer Whisper-style encoder) |
|
| 53 |
+
| `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache) |
|
| 54 |
+
| `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache) |
|
| 55 |
+
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
|
| 56 |
+
| `tokenizer_config.json` / `vocab.json` / `merges.txt` | β | Qwen3 BPE tokenizer assets |
|
| 57 |
+
| `preprocessor_config.json` | β | Whisper-style mel feature extractor config |
|
| 58 |
+
| `inference.py` | β | Standalone Python ASR pipeline using these ONNX files |
|
| 59 |
+
|
| 60 |
+
## Compression vs original
|
| 61 |
+
|
| 62 |
+
| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
|
| 63 |
+
| --- | ---: | ---: | ---: |
|
| 64 |
+
| Audio encoder | ~635 MB | **214 MB** | 3.0Γ |
|
| 65 |
+
| LLM decoder | ~3.4 GB Γ 2 (prefill + step) | **968 MB Γ 2** | 3.5Γ |
|
| 66 |
+
| **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ smaller** |
|
| 67 |
+
|
| 68 |
+
The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
|
| 69 |
+
The audio encoder is mostly Conv2d / Linear in transformer layers β MatMulNBits
|
| 70 |
+
quantizes the transformer Linear ops (most of the weight) but leaves the small
|
| 71 |
+
Conv2d front-end at fp16.
|
| 72 |
+
|
| 73 |
+
## Quality
|
| 74 |
+
|
| 75 |
+
Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|
| 76 |
+
example clips (real-world noisy conditions), word-level agreement (1 β WER):
|
| 77 |
+
|
| 78 |
+
| Variant | Average agreement | 100% samples |
|
| 79 |
+
| --- | ---: | ---: |
|
| 80 |
+
| PT bf16 (original) | **95.1%** | 6 / 8 |
|
| 81 |
+
| ONNX fp16 (this export) | 87.8% | 6 / 8 |
|
| 82 |
+
| **ONNX INT4 (this repo)** | **87.8%** | 6 / 8 |
|
| 83 |
+
|
| 84 |
+
The INT4 quantization is **lossless within the export envelope** β the same
|
| 85 |
+
score as the fp16 ONNX export. The gap to bf16 PT is the export itself
|
| 86 |
+
(numerical drift through dynamo + Cache lowering on a few hard samples
|
| 87 |
+
`echo` / `recording`), not the quantization. Most clean speech transcribes at
|
| 88 |
+
100%.
|
| 89 |
+
|
| 90 |
+
## Inference (Python)
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
pip install onnxruntime numpy soundfile transformers qwen-asr
|
| 94 |
+
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
|
| 95 |
+
cd mega-asr-onnx
|
| 96 |
+
python inference.py --audio examples/noise.wav
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Inference (browser)
|
| 100 |
+
|
| 101 |
+
A live browser demo (loads these ONNX models directly via `onnxruntime-web`
|
| 102 |
+
and WebGPU) is at
|
| 103 |
+
[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
|
| 104 |
+
The first visit downloads ~2 GB of model weights, cached by the browser for
|
| 105 |
+
subsequent runs.
|
| 106 |
+
|
| 107 |
+
## Performance
|
| 108 |
+
|
| 109 |
+
| Hardware | Cold (model load) | Warm (3-4 s audio) |
|
| 110 |
+
| --- | ---: | ---: |
|
| 111 |
+
| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
|
| 112 |
+
| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
|
| 113 |
+
| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
|
| 114 |
+
| Browser, WASM CPU | ~10 s + download | ~30 s |
|
| 115 |
+
|
| 116 |
+
## Conversion details
|
| 117 |
+
|
| 118 |
+
- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
|
| 119 |
+
- Audio encoder rewrites: replaced packed-sequence flash-attention with
|
| 120 |
+
standard batched attention + chunked Conv2d (parity cos β 0.998 vs original).
|
| 121 |
+
- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
|
| 122 |
+
of KV cache tensors; prefill and step are two specialisations of the same
|
| 123 |
+
Python wrapper exported separately.
|
| 124 |
+
- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
|
| 125 |
+
with `block_size=32`, `is_symmetric=True`, `bits=4`. Non-MatMul ops (LayerNorm,
|
| 126 |
+
RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16.
|
| 127 |
+
- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
|
| 128 |
+
`dynamic_shapes` API.
|
| 129 |
+
|
| 130 |
+
## Credits
|
| 131 |
+
|
| 132 |
+
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
|
| 133 |
+
- ONNX export + quantization: this repo
|
| 134 |
+
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
|
| 135 |
+
- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)
|