--- license: apache-2.0 language: - en - zh - ja - ko - multilingual library_name: onnxruntime tags: - onnx - onnxruntime - onnxruntime-web - automatic-speech-recognition - asr - speech-recognition - robust-asr - quantized - int4 - int8 - matmulnbits - gptq - on-device - browser - web - qwen3 - qwen3-asr - mega-asr - transformers.js pipeline_tag: automatic-speech-recognition base_model: zhifeixie/Mega-ASR base_model_relation: quantized --- # Mega-ASR — INT4 ONNX (GPTQ-calibrated) INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with 2.6M training samples covering noise, far-field speech, obstruction, recording artifacts, echo, dropout, and transmission dropout. The model is split into three ONNX files (Whisper-style: audio encoder + LLM decoder prefill + LLM decoder step) so it can be loaded **directly in the browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB** total — small enough for a one-time browser cache. **Both decoder halves are GPTQ-calibrated** on 168 / 63 English Voices-in-the-Wild samples (prefill / step respectively). The step model uses past-KV-cache-aware calibration: prefill output is piped into step, so the calibration captures the realistic activation distribution of autoregressive decode. ## What's in this repo | File | Size | Role | | --- | ---: | --- | | `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features → audio embeddings (24-layer Whisper-style encoder) | | `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache, **GPTQ-calibrated**) | | `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache, **GPTQ-calibrated**) | | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) | | `tokenizer_config.json` / `vocab.json` / `merges.txt` | — | Qwen3 BPE tokenizer assets | | `preprocessor_config.json` | — | Whisper-style mel feature extractor config | | `inference.py` | — | Standalone Python ASR pipeline using these ONNX files | ## Compression vs original | Component | Original (fp16 PT) | This (INT4 ONNX) | Savings | | --- | ---: | ---: | ---: | | Audio encoder | ~635 MB | **214 MB** | 3.0× | | LLM decoder | ~3.4 GB × 2 (prefill + step) | **968 MB × 2** | 3.5× | | **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7× smaller** | The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well. The audio encoder is mostly Conv2d / Linear in transformer layers — MatMulNBits quantizes the transformer Linear ops (most of the weight) but leaves the small Conv2d front-end at fp16. ## Quality Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) example clips (real-world noisy conditions, all English), word-level agreement (1 − WER), prompt forced to `language English`: | Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size | | --- | --- | --- | ---: | ---: | ---: | | PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB | | ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB | | **ONNX recommended (GPTQ)** | **INT8** | **INT4 GPTQ** | **92.7%** | **6 / 8** | **2.3 GB** | | ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB | | ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB | The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the size/quality sweet spot for browser deployment. Forcing the language (rather than auto-detecting) recovers most of the quantization drift. GPTQ calibration on both prefill and step yields **+0.8% over plain RTN** at the same model size, most visibly on the `echo` sample where the RTN-quantized decoder previously hallucinated *"the size was fine standing up at the terrible white wall"* — the GPTQ-quantized decoder produces *"the size feels fine standing up against terrible white walls"*, recovering the leading clause exactly. **Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing language skips the model's audio-quality-router language detection, which is where the PT model loses points on `echo` and `recording` (truncated). ## Inference (Python) ```bash pip install onnxruntime numpy soundfile transformers qwen-asr git clone https://huggingface.co/Reza2kn/mega-asr-onnx cd mega-asr-onnx python inference.py --audio examples/noise.wav ``` ## Inference (browser) A live browser demo (loads these ONNX models directly via `onnxruntime-web` and WebGPU) is at [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench). The first visit downloads ~2 GB of model weights, cached by the browser for subsequent runs. ## Performance | Hardware | Cold (model load) | Warm (3-4 s audio) | | --- | ---: | ---: | | RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s | | M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s | | Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s | | Browser, WASM CPU | ~10 s + download | ~30 s | ## Conversion details - Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12. - Audio encoder rewrites: replaced packed-sequence flash-attention with standard batched attention + chunked Conv2d (parity cos ≈ 0.998 vs original). - Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple of KV cache tensors; prefill and step are two specialisations of the same Python wrapper exported separately. - Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer` with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`. Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16. - KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's `dynamic_shapes` API. ### GPTQ calibration The default ORT GPTQ implementation in `onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min for a 1.7B model on a workstation CPU. For this release we ported the GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the prefill GPTQ runs at ~99% GPU util and finishes in **~35 min**; the step GPTQ takes **~3 min** (fewer unique MatMul input names because of GQA sharing). - **Prefill calibration**: 168 samples (24 per noise/far_field/obstructed/ distortion/recording/echo/dropout split), English-only filter on the `text` field, audio decoded via `soundfile` (`Audio(decode=False)`) to avoid the `torchcodec` import on streaming `cast_column`. - **Step calibration**: 63 samples from the same English-only set; each sample's calibration feed is built by running the fp16 prefill ONNX, capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the greedy first predicted token, and pairing it with `attention_mask` of length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian exactly the autoregressive-decode activation distribution it sees at inference. ## Credits - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0) - ONNX export + GPTQ quantization: this repo - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) - Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)