Update README: GPTQ-calibrated INT4 (92.7%)

d20060f verified about 12 hours ago

7.88 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- ja
	- ko
	- multilingual
	library_name: onnxruntime
	tags:
	- onnx
	- onnxruntime
	- onnxruntime-web
	- automatic-speech-recognition
	- asr
	- speech-recognition
	- robust-asr
	- quantized
	- int4
	- int8
	- matmulnbits
	- gptq
	- on-device
	- browser
	- web
	- qwen3
	- qwen3-asr
	- mega-asr
	- transformers.js
	pipeline_tag: automatic-speech-recognition
	base_model: zhifeixie/Mega-ASR
	base_model_relation: quantized
	---

	# Mega-ASR — INT4 ONNX (GPTQ-calibrated)

	INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
	a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
	2.6M training samples covering noise, far-field speech, obstruction, recording
	artifacts, echo, dropout, and transmission dropout.

	The model is split into three ONNX files (Whisper-style: audio encoder + LLM
	decoder prefill + LLM decoder step) so it can be loaded **directly in the
	browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
	a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
	block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to ~2 GB
	total — small enough for a one-time browser cache.

	Both decoder halves are GPTQ-calibrated on 168 / 63 English Voices-in-the-Wild
	samples (prefill / step respectively). The step model uses past-KV-cache-aware
	calibration: prefill output is piped into step, so the calibration captures the
	realistic activation distribution of autoregressive decode.

	## What's in this repo

	\| File \| Size \| Role \|
	\| --- \| ---: \| --- \|
	\| `onnx/audio_encoder_int4.onnx` (+ `.data`) \| 214 MB \| mel features → audio embeddings (24-layer Whisper-style encoder) \|
	\| `onnx/decoder_prefill_int4.onnx` (+ `.data`) \| 968 MB \| Qwen3 1.7B decoder, full-length prefill (no KV cache, GPTQ-calibrated) \|
	\| `onnx/decoder_step_int4.onnx` (+ `.data`) \| 968 MB \| Qwen3 1.7B decoder, single-token step (with KV cache, GPTQ-calibrated) \|
	\| `examples/*.wav` \| ~3 MB \| 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) \|
	\| `tokenizer_config.json` / `vocab.json` / `merges.txt` \| — \| Qwen3 BPE tokenizer assets \|
	\| `preprocessor_config.json` \| — \| Whisper-style mel feature extractor config \|
	\| `inference.py` \| — \| Standalone Python ASR pipeline using these ONNX files \|

	## Compression vs original

	\| Component \| Original (fp16 PT) \| This (INT4 ONNX) \| Savings \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Audio encoder \| ~635 MB \| 214 MB \| 3.0× \|
	\| LLM decoder \| ~3.4 GB × 2 (prefill + step) \| 968 MB × 2 \| 3.5× \|
	\| Total deploy \| ~7.5 GB \| ~2.0 GB \| 3.7× smaller \|

	The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
	The audio encoder is mostly Conv2d / Linear in transformer layers — MatMulNBits
	quantizes the transformer Linear ops (most of the weight) but leaves the small
	Conv2d front-end at fp16.

	## Quality

	Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
	example clips (real-world noisy conditions, all English), word-level
	agreement (1 − WER), prompt forced to `language English`:

	\| Variant \| Encoder \| Decoder \| Avg agreement \| 100% samples \| Total size \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \|
	\| PT bf16 (original) \| fp16 \| fp16 \| 95.1% \| 6 / 8 \| 7.5 GB \|
	\| ONNX fp16 (ref) \| fp32 \| fp16 \| 96.7% \| 7 / 8 \| 8.2 GB \|
	\| ONNX recommended (GPTQ) \| INT8 \| INT4 GPTQ \| 92.7% \| 6 / 8 \| 2.3 GB \|
	\| ONNX RTN (previous ship) \| INT8 \| INT4 RTN \| 91.9% \| 6 / 8 \| 2.3 GB \|
	\| ONNX small \| INT4 \| INT4 RTN \| 87.8% \| 6 / 8 \| 2.0 GB \|

	The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the
	size/quality sweet spot for browser deployment. Forcing the language
	(rather than auto-detecting) recovers most of the quantization drift.

	GPTQ calibration on both prefill and step yields +0.8% over plain RTN at
	the same model size, most visibly on the `echo` sample where the
	RTN-quantized decoder previously hallucinated *"the size was fine standing
	up at the terrible white wall"* — the GPTQ-quantized decoder produces
	"the size feels fine standing up against terrible white walls", recovering
	the leading clause exactly.

	Note: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
	language skips the model's audio-quality-router language detection,
	which is where the PT model loses points on `echo` and `recording`
	(truncated).

	## Inference (Python)

	```bash
	pip install onnxruntime numpy soundfile transformers qwen-asr
	git clone https://huggingface.co/Reza2kn/mega-asr-onnx
	cd mega-asr-onnx
	python inference.py --audio examples/noise.wav
	```

	## Inference (browser)

	A live browser demo (loads these ONNX models directly via `onnxruntime-web`
	and WebGPU) is at
	[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
	The first visit downloads ~2 GB of model weights, cached by the browser for
	subsequent runs.

	## Performance

	\| Hardware \| Cold (model load) \| Warm (3-4 s audio) \|
	\| --- \| ---: \| ---: \|
	\| RTX 5080 (CUDAExecutionProvider) \| ~5 s \| ~1.5 s \|
	\| M-series Mac (CPUExecutionProvider) \| ~12 s \| ~6 s \|
	\| Browser, WebGPU (RTX 5080) \| ~10 s + ~1 GB download (cached) \| ~3 s \|
	\| Browser, WASM CPU \| ~10 s + download \| ~30 s \|

	## Conversion details

	- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
	- Audio encoder rewrites: replaced packed-sequence flash-attention with
	standard batched attention + chunked Conv2d (parity cos ≈ 0.998 vs original).
	- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
	of KV cache tensors; prefill and step are two specialisations of the same
	Python wrapper exported separately.
	- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
	with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`.
	Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder
	Conv2d front-end) stay at fp16.
	- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
	`dynamic_shapes` API.

	### GPTQ calibration

	The default ORT GPTQ implementation in
	`onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy
	matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min
	for a 1.7B model on a workstation CPU. For this release we ported the
	GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a
	diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter
	than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the
	prefill GPTQ runs at ~99% GPU util and finishes in ~35 min; the step
	GPTQ takes ~3 min (fewer unique MatMul input names because of GQA
	sharing).

	- Prefill calibration: 168 samples (24 per noise/far_field/obstructed/
	distortion/recording/echo/dropout split), English-only filter on the
	`text` field, audio decoded via `soundfile` (`Audio(decode=False)`)
	to avoid the `torchcodec` import on streaming `cast_column`.
	- Step calibration: 63 samples from the same English-only set; each
	sample's calibration feed is built by running the fp16 prefill ONNX,
	capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the
	greedy first predicted token, and pairing it with `attention_mask` of
	length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian
	exactly the autoregressive-decode activation distribution it sees at
	inference.

	## Credits

	- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
	- ONNX export + GPTQ quantization: this repo
	- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
	- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)