Automatic Speech Recognition
ONNX
Transformers.js
onnxruntime
qwen3_asr
text-generation
onnxruntime-web
asr
speech-recognition
robust-asr
quantized
int4
int8
matmulnbits
gptq
on-device
browser
web
qwen3
qwen3-asr
mega-asr
Instructions to use Reza2kn/mega-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use Reza2kn/mega-asr-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');
File size: 7,878 Bytes
a7f30d3 d20060f a7f30d3 d20060f a7f30d3 d20060f a7f30d3 d20060f a7f30d3 d20060f a7f30d3 f6f9235 d20060f f6f9235 d20060f f6f9235 d20060f f6f9235 d20060f a7f30d3 d20060f a7f30d3 d20060f a7f30d3 d20060f a7f30d3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | ---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: onnxruntime
tags:
- onnx
- onnxruntime
- onnxruntime-web
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- int8
- matmulnbits
- gptq
- on-device
- browser
- web
- qwen3
- qwen3-asr
- mega-asr
- transformers.js
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β INT4 ONNX (GPTQ-calibrated)
INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
2.6M training samples covering noise, far-field speech, obstruction, recording
artifacts, echo, dropout, and transmission dropout.
The model is split into three ONNX files (Whisper-style: audio encoder + LLM
decoder prefill + LLM decoder step) so it can be loaded **directly in the
browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
total β small enough for a one-time browser cache.
**Both decoder halves are GPTQ-calibrated** on 168 / 63 English Voices-in-the-Wild
samples (prefill / step respectively). The step model uses past-KV-cache-aware
calibration: prefill output is piped into step, so the calibration captures the
realistic activation distribution of autoregressive decode.
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β audio embeddings (24-layer Whisper-style encoder) |
| `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache, **GPTQ-calibrated**) |
| `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache, **GPTQ-calibrated**) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| `tokenizer_config.json` / `vocab.json` / `merges.txt` | β | Qwen3 BPE tokenizer assets |
| `preprocessor_config.json` | β | Whisper-style mel feature extractor config |
| `inference.py` | β | Standalone Python ASR pipeline using these ONNX files |
## Compression vs original
| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
| --- | ---: | ---: | ---: |
| Audio encoder | ~635 MB | **214 MB** | 3.0Γ |
| LLM decoder | ~3.4 GB Γ 2 (prefill + step) | **968 MB Γ 2** | 3.5Γ |
| **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ smaller** |
The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
The audio encoder is mostly Conv2d / Linear in transformer layers β MatMulNBits
quantizes the transformer Linear ops (most of the weight) but leaves the small
Conv2d front-end at fp16.
## Quality
Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
example clips (real-world noisy conditions, all English), word-level
agreement (1 β WER), prompt forced to `language English`:
| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
| ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
| **ONNX recommended (GPTQ)** | **INT8** | **INT4 GPTQ** | **92.7%** | **6 / 8** | **2.3 GB** |
| ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB |
| ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB |
The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the
size/quality sweet spot for browser deployment. Forcing the language
(rather than auto-detecting) recovers most of the quantization drift.
GPTQ calibration on both prefill and step yields **+0.8% over plain RTN** at
the same model size, most visibly on the `echo` sample where the
RTN-quantized decoder previously hallucinated *"the size was fine standing
up at the terrible white wall"* β the GPTQ-quantized decoder produces
*"the size feels fine standing up against terrible white walls"*, recovering
the leading clause exactly.
**Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
language skips the model's audio-quality-router language detection,
which is where the PT model loses points on `echo` and `recording`
(truncated).
## Inference (Python)
```bash
pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav
```
## Inference (browser)
A live browser demo (loads these ONNX models directly via `onnxruntime-web`
and WebGPU) is at
[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
The first visit downloads ~2 GB of model weights, cached by the browser for
subsequent runs.
## Performance
| Hardware | Cold (model load) | Warm (3-4 s audio) |
| --- | ---: | ---: |
| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
| Browser, WASM CPU | ~10 s + download | ~30 s |
## Conversion details
- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
- Audio encoder rewrites: replaced packed-sequence flash-attention with
standard batched attention + chunked Conv2d (parity cos β 0.998 vs original).
- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
of KV cache tensors; prefill and step are two specialisations of the same
Python wrapper exported separately.
- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`.
Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder
Conv2d front-end) stay at fp16.
- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
`dynamic_shapes` API.
### GPTQ calibration
The default ORT GPTQ implementation in
`onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy
matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min
for a 1.7B model on a workstation CPU. For this release we ported the
GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a
diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter
than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the
prefill GPTQ runs at ~99% GPU util and finishes in **~35 min**; the step
GPTQ takes **~3 min** (fewer unique MatMul input names because of GQA
sharing).
- **Prefill calibration**: 168 samples (24 per noise/far_field/obstructed/
distortion/recording/echo/dropout split), English-only filter on the
`text` field, audio decoded via `soundfile` (`Audio(decode=False)`)
to avoid the `torchcodec` import on streaming `cast_column`.
- **Step calibration**: 63 samples from the same English-only set; each
sample's calibration feed is built by running the fp16 prefill ONNX,
capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the
greedy first predicted token, and pairing it with `attention_mask` of
length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian
exactly the autoregressive-decode activation distribution it sees at
inference.
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- ONNX export + GPTQ quantization: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)
|