File size: 7,878 Bytes
a7f30d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d20060f
a7f30d3
 
 
 
 
 
 
 
 
 
 
 
d20060f
a7f30d3
 
 
 
 
 
 
 
 
 
d20060f
a7f30d3
 
d20060f
 
 
 
 
a7f30d3
 
 
 
 
d20060f
 
a7f30d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6f9235
 
 
 
 
 
 
d20060f
 
 
f6f9235
d20060f
f6f9235
 
 
d20060f
 
 
 
 
 
 
f6f9235
 
d20060f
 
a7f30d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d20060f
 
 
a7f30d3
 
 
d20060f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7f30d3
 
 
d20060f
a7f30d3
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: onnxruntime
tags:
- onnx
- onnxruntime
- onnxruntime-web
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- int8
- matmulnbits
- gptq
- on-device
- browser
- web
- qwen3
- qwen3-asr
- mega-asr
- transformers.js
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---

# Mega-ASR β€” INT4 ONNX (GPTQ-calibrated)

INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
2.6M training samples covering noise, far-field speech, obstruction, recording
artifacts, echo, dropout, and transmission dropout.

The model is split into three ONNX files (Whisper-style: audio encoder + LLM
decoder prefill + LLM decoder step) so it can be loaded **directly in the
browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
total β€” small enough for a one-time browser cache.

**Both decoder halves are GPTQ-calibrated** on 168 / 63 English Voices-in-the-Wild
samples (prefill / step respectively). The step model uses past-KV-cache-aware
calibration: prefill output is piped into step, so the calibration captures the
realistic activation distribution of autoregressive decode.

## What's in this repo

| File | Size | Role |
| --- | ---: | --- |
| `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β†’ audio embeddings (24-layer Whisper-style encoder) |
| `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache, **GPTQ-calibrated**) |
| `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache, **GPTQ-calibrated**) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| `tokenizer_config.json` / `vocab.json` / `merges.txt` | β€” | Qwen3 BPE tokenizer assets |
| `preprocessor_config.json` | β€” | Whisper-style mel feature extractor config |
| `inference.py` | β€” | Standalone Python ASR pipeline using these ONNX files |

## Compression vs original

| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
| --- | ---: | ---: | ---: |
| Audio encoder | ~635 MB | **214 MB** | 3.0Γ— |
| LLM decoder | ~3.4 GB Γ— 2 (prefill + step) | **968 MB Γ— 2** | 3.5Γ— |
| **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ— smaller** |

The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
The audio encoder is mostly Conv2d / Linear in transformer layers β€” MatMulNBits
quantizes the transformer Linear ops (most of the weight) but leaves the small
Conv2d front-end at fp16.

## Quality

Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
example clips (real-world noisy conditions, all English), word-level
agreement (1 βˆ’ WER), prompt forced to `language English`:

| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
| ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
| **ONNX recommended (GPTQ)** | **INT8** | **INT4 GPTQ** | **92.7%** | **6 / 8** | **2.3 GB** |
| ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB |
| ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB |

The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the
size/quality sweet spot for browser deployment. Forcing the language
(rather than auto-detecting) recovers most of the quantization drift.

GPTQ calibration on both prefill and step yields **+0.8% over plain RTN** at
the same model size, most visibly on the `echo` sample where the
RTN-quantized decoder previously hallucinated *"the size was fine standing
up at the terrible white wall"* β€” the GPTQ-quantized decoder produces
*"the size feels fine standing up against terrible white walls"*, recovering
the leading clause exactly.

**Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
language skips the model's audio-quality-router language detection,
which is where the PT model loses points on `echo` and `recording`
(truncated).

## Inference (Python)

```bash
pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav
```

## Inference (browser)

A live browser demo (loads these ONNX models directly via `onnxruntime-web`
and WebGPU) is at
[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
The first visit downloads ~2 GB of model weights, cached by the browser for
subsequent runs.

## Performance

| Hardware | Cold (model load) | Warm (3-4 s audio) |
| --- | ---: | ---: |
| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
| Browser, WASM CPU | ~10 s + download | ~30 s |

## Conversion details

- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
- Audio encoder rewrites: replaced packed-sequence flash-attention with
  standard batched attention + chunked Conv2d (parity cos β‰ˆ 0.998 vs original).
- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
  of KV cache tensors; prefill and step are two specialisations of the same
  Python wrapper exported separately.
- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
  with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`.
  Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder
  Conv2d front-end) stay at fp16.
- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
  `dynamic_shapes` API.

### GPTQ calibration

The default ORT GPTQ implementation in
`onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy
matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min
for a 1.7B model on a workstation CPU. For this release we ported the
GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a
diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter
than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the
prefill GPTQ runs at ~99% GPU util and finishes in **~35 min**; the step
GPTQ takes **~3 min** (fewer unique MatMul input names because of GQA
sharing).

- **Prefill calibration**: 168 samples (24 per noise/far_field/obstructed/
  distortion/recording/echo/dropout split), English-only filter on the
  `text` field, audio decoded via `soundfile` (`Audio(decode=False)`)
  to avoid the `torchcodec` import on streaming `cast_column`.
- **Step calibration**: 63 samples from the same English-only set; each
  sample's calibration feed is built by running the fp16 prefill ONNX,
  capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the
  greedy first predicted token, and pairing it with `attention_mask` of
  length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian
  exactly the autoregressive-decode activation distribution it sees at
  inference.

## Credits

- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- ONNX export + GPTQ quantization: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)