mega-asr-onnx / README.md
Reza2kn's picture
Update README: GPTQ-calibrated INT4 (92.7%)
d20060f verified
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: onnxruntime
tags:
- onnx
- onnxruntime
- onnxruntime-web
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- int8
- matmulnbits
- gptq
- on-device
- browser
- web
- qwen3
- qwen3-asr
- mega-asr
- transformers.js
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β€” INT4 ONNX (GPTQ-calibrated)
INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
2.6M training samples covering noise, far-field speech, obstruction, recording
artifacts, echo, dropout, and transmission dropout.
The model is split into three ONNX files (Whisper-style: audio encoder + LLM
decoder prefill + LLM decoder step) so it can be loaded **directly in the
browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
total β€” small enough for a one-time browser cache.
**Both decoder halves are GPTQ-calibrated** on 168 / 63 English Voices-in-the-Wild
samples (prefill / step respectively). The step model uses past-KV-cache-aware
calibration: prefill output is piped into step, so the calibration captures the
realistic activation distribution of autoregressive decode.
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β†’ audio embeddings (24-layer Whisper-style encoder) |
| `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache, **GPTQ-calibrated**) |
| `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache, **GPTQ-calibrated**) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
| `tokenizer_config.json` / `vocab.json` / `merges.txt` | β€” | Qwen3 BPE tokenizer assets |
| `preprocessor_config.json` | β€” | Whisper-style mel feature extractor config |
| `inference.py` | β€” | Standalone Python ASR pipeline using these ONNX files |
## Compression vs original
| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
| --- | ---: | ---: | ---: |
| Audio encoder | ~635 MB | **214 MB** | 3.0Γ— |
| LLM decoder | ~3.4 GB Γ— 2 (prefill + step) | **968 MB Γ— 2** | 3.5Γ— |
| **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ— smaller** |
The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
The audio encoder is mostly Conv2d / Linear in transformer layers β€” MatMulNBits
quantizes the transformer Linear ops (most of the weight) but leaves the small
Conv2d front-end at fp16.
## Quality
Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
example clips (real-world noisy conditions, all English), word-level
agreement (1 βˆ’ WER), prompt forced to `language English`:
| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
| ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB |
| **ONNX recommended (GPTQ)** | **INT8** | **INT4 GPTQ** | **92.7%** | **6 / 8** | **2.3 GB** |
| ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB |
| ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB |
The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the
size/quality sweet spot for browser deployment. Forcing the language
(rather than auto-detecting) recovers most of the quantization drift.
GPTQ calibration on both prefill and step yields **+0.8% over plain RTN** at
the same model size, most visibly on the `echo` sample where the
RTN-quantized decoder previously hallucinated *"the size was fine standing
up at the terrible white wall"* β€” the GPTQ-quantized decoder produces
*"the size feels fine standing up against terrible white walls"*, recovering
the leading clause exactly.
**Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
language skips the model's audio-quality-router language detection,
which is where the PT model loses points on `echo` and `recording`
(truncated).
## Inference (Python)
```bash
pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav
```
## Inference (browser)
A live browser demo (loads these ONNX models directly via `onnxruntime-web`
and WebGPU) is at
[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
The first visit downloads ~2 GB of model weights, cached by the browser for
subsequent runs.
## Performance
| Hardware | Cold (model load) | Warm (3-4 s audio) |
| --- | ---: | ---: |
| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
| Browser, WASM CPU | ~10 s + download | ~30 s |
## Conversion details
- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
- Audio encoder rewrites: replaced packed-sequence flash-attention with
standard batched attention + chunked Conv2d (parity cos β‰ˆ 0.998 vs original).
- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
of KV cache tensors; prefill and step are two specialisations of the same
Python wrapper exported separately.
- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`.
Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder
Conv2d front-end) stay at fp16.
- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
`dynamic_shapes` API.
### GPTQ calibration
The default ORT GPTQ implementation in
`onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy
matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min
for a 1.7B model on a workstation CPU. For this release we ported the
GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a
diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter
than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the
prefill GPTQ runs at ~99% GPU util and finishes in **~35 min**; the step
GPTQ takes **~3 min** (fewer unique MatMul input names because of GQA
sharing).
- **Prefill calibration**: 168 samples (24 per noise/far_field/obstructed/
distortion/recording/echo/dropout split), English-only filter on the
`text` field, audio decoded via `soundfile` (`Audio(decode=False)`)
to avoid the `torchcodec` import on streaming `cast_column`.
- **Step calibration**: 63 samples from the same English-only set; each
sample's calibration feed is built by running the fp16 prefill ONNX,
capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the
greedy first predicted token, and pairing it with `attention_mask` of
length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian
exactly the autoregressive-decode activation distribution it sees at
inference.
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- ONNX export + GPTQ quantization: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)