Update README: GPTQ-calibrated INT4 (92.7%)

d20060f verified about 11 hours ago

7.88 kB

license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: onnxruntime
tags:
  - onnx
  - onnxruntime
  - onnxruntime-web
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - int8
  - matmulnbits
  - gptq
  - on-device
  - browser
  - web
  - qwen3
  - qwen3-asr
  - mega-asr
  - transformers.js
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR — INT4 ONNX (GPTQ-calibrated)

INT4 ONNX export of zhifeixie/Mega-ASR, a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with 2.6M training samples covering noise, far-field speech, obstruction, recording artifacts, echo, dropout, and transmission dropout.

The model is split into three ONNX files (Whisper-style: audio encoder + LLM decoder prefill + LLM decoder step) so it can be loaded directly in the browser via onnxruntime-web or as a CPU/GPU service via onnxruntime. INT4 weight quantization (MatMulNBits, 4-bit block-32 asymmetric) compresses the model from 7.5 GB fp16 down to **2 GB** total — small enough for a one-time browser cache.

Both decoder halves are GPTQ-calibrated on 168 / 63 English Voices-in-the-Wild samples (prefill / step respectively). The step model uses past-KV-cache-aware calibration: prefill output is piped into step, so the calibration captures the realistic activation distribution of autoregressive decode.

What's in this repo

File	Size	Role
`onnx/audio_encoder_int4.onnx` (+ `.data`)	214 MB	mel features → audio embeddings (24-layer Whisper-style encoder)
`onnx/decoder_prefill_int4.onnx` (+ `.data`)	968 MB	Qwen3 1.7B decoder, full-length prefill (no KV cache, GPTQ-calibrated)
`onnx/decoder_step_int4.onnx` (+ `.data`)	968 MB	Qwen3 1.7B decoder, single-token step (with KV cache, GPTQ-calibrated)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench
`tokenizer_config.json` / `vocab.json` / `merges.txt`	—	Qwen3 BPE tokenizer assets
`preprocessor_config.json`	—	Whisper-style mel feature extractor config
`inference.py`	—	Standalone Python ASR pipeline using these ONNX files

Compression vs original

Component	Original (fp16 PT)	This (INT4 ONNX)	Savings
Audio encoder	~635 MB	214 MB	3.0×
LLM decoder	~3.4 GB × 2 (prefill + step)	968 MB × 2	3.5×
Total deploy	~7.5 GB	~2.0 GB	3.7× smaller

The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well. The audio encoder is mostly Conv2d / Linear in transformer layers — MatMulNBits quantizes the transformer Linear ops (most of the weight) but leaves the small Conv2d front-end at fp16.

Quality

Benchmarked on the 8 Voices-in-the-Wild-Bench example clips (real-world noisy conditions, all English), word-level agreement (1 − WER), prompt forced to language English:

Variant	Encoder	Decoder	Avg agreement	100% samples	Total size
PT bf16 (original)	fp16	fp16	95.1%	6 / 8	7.5 GB
ONNX fp16 (ref)	fp32	fp16	96.7%	7 / 8	8.2 GB
ONNX recommended (GPTQ)	INT8	INT4 GPTQ	92.7%	6 / 8	2.3 GB
ONNX RTN (previous ship)	INT8	INT4 RTN	91.9%	6 / 8	2.3 GB
ONNX small	INT4	INT4 RTN	87.8%	6 / 8	2.0 GB

The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the size/quality sweet spot for browser deployment. Forcing the language (rather than auto-detecting) recovers most of the quantization drift.

GPTQ calibration on both prefill and step yields +0.8% over plain RTN at the same model size, most visibly on the echo sample where the RTN-quantized decoder previously hallucinated "the size was fine standing up at the terrible white wall" — the GPTQ-quantized decoder produces "the size feels fine standing up against terrible white walls", recovering the leading clause exactly.

Note: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing language skips the model's audio-quality-router language detection, which is where the PT model loses points on echo and recording (truncated).

Inference (Python)

pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav

Inference (browser)

A live browser demo (loads these ONNX models directly via onnxruntime-web and WebGPU) is at Reza2kn/mega-asr-bench. The first visit downloads ~2 GB of model weights, cached by the browser for subsequent runs.

Performance

Hardware	Cold (model load)	Warm (3-4 s audio)
RTX 5080 (CUDAExecutionProvider)	~5 s	~1.5 s
M-series Mac (CPUExecutionProvider)	~12 s	~6 s
Browser, WebGPU (RTX 5080)	~10 s + ~1 GB download (cached)	~3 s
Browser, WASM CPU	~10 s + download	~30 s

Conversion details

Exported via torch.onnx.export(..., dynamo=True) from PyTorch 2.12.
Audio encoder rewrites: replaced packed-sequence flash-attention with standard batched attention + chunked Conv2d (parity cos ≈ 0.998 vs original).
Decoder uses a single DecoderForExport wrapper that accepts a flat tuple of KV cache tensors; prefill and step are two specialisations of the same Python wrapper exported separately.
Quantization: onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer with block_size=32, is_symmetric=False, bits=4, algo=GPTQ. Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16.
KV cache: dynamic past-length axis (Dim("T_past")) via dynamo's dynamic_shapes API.

GPTQ calibration

The default ORT GPTQ implementation in onnxruntime.quantization.neural_compressor.weight_only is CPU-only (numpy matmul + np.linalg.cholesky for the Hessian inverse), and runs at 90 min for a 1.7B model on a workstation CPU. For this release we ported the GPTQ inner loop + Hessian accumulation to torch.cuda and added a diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the prefill GPTQ runs at ~99% GPU util and finishes in **35 min**; the step GPTQ takes ~3 min (fewer unique MatMul input names because of GQA sharing).

Prefill calibration: 168 samples (24 per noise/far_field/obstructed/ distortion/recording/echo/dropout split), English-only filter on the text field, audio decoded via soundfile (Audio(decode=False)) to avoid the torchcodec import on streaming cast_column.
Step calibration: 63 samples from the same English-only set; each sample's calibration feed is built by running the fp16 prefill ONNX, capturing all 56 present.{0..27}.{key,value} tensors, embedding the greedy first predicted token, and pairing it with attention_mask of length L + 1 and position_ids = L. This gives the step's GPTQ Hessian exactly the autoregressive-decode activation distribution it sees at inference.

Credits

Original model: zhifeixie/Mega-ASR (1.7B params, Apache-2.0)
ONNX export + GPTQ quantization: this repo
Benchmark: Voices-in-the-Wild-Bench
Live demo: Reza2kn/mega-asr-bench