Instructions to use Reza2kn/mega-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use Reza2kn/mega-asr-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: onnxruntime
tags:
- onnx
- onnxruntime
- onnxruntime-web
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- int8
- matmulnbits
- gptq
- on-device
- browser
- web
- qwen3
- qwen3-asr
- mega-asr
- transformers.js
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
Mega-ASR β INT4 ONNX (GPTQ-calibrated)
INT4 ONNX export of zhifeixie/Mega-ASR, a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with 2.6M training samples covering noise, far-field speech, obstruction, recording artifacts, echo, dropout, and transmission dropout.
The model is split into three ONNX files (Whisper-style: audio encoder + LLM
decoder prefill + LLM decoder step) so it can be loaded directly in the
browser via onnxruntime-web or as
a CPU/GPU service via onnxruntime. INT4 weight quantization (MatMulNBits, 4-bit
block-32 asymmetric) compresses the model from 7.5 GB fp16 down to **2 GB**
total β small enough for a one-time browser cache.
Both decoder halves are GPTQ-calibrated on 168 / 63 English Voices-in-the-Wild samples (prefill / step respectively). The step model uses past-KV-cache-aware calibration: prefill output is piped into step, so the calibration captures the realistic activation distribution of autoregressive decode.
What's in this repo
| File | Size | Role |
|---|---|---|
onnx/audio_encoder_int4.onnx (+ .data) |
214 MB | mel features β audio embeddings (24-layer Whisper-style encoder) |
onnx/decoder_prefill_int4.onnx (+ .data) |
968 MB | Qwen3 1.7B decoder, full-length prefill (no KV cache, GPTQ-calibrated) |
onnx/decoder_step_int4.onnx (+ .data) |
968 MB | Qwen3 1.7B decoder, single-token step (with KV cache, GPTQ-calibrated) |
examples/*.wav |
~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
tokenizer_config.json / vocab.json / merges.txt |
β | Qwen3 BPE tokenizer assets |
preprocessor_config.json |
β | Whisper-style mel feature extractor config |
inference.py |
β | Standalone Python ASR pipeline using these ONNX files |
Compression vs original
| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
|---|---|---|---|
| Audio encoder | ~635 MB | 214 MB | 3.0Γ |
| LLM decoder | ~3.4 GB Γ 2 (prefill + step) | 968 MB Γ 2 | 3.5Γ |
| Total deploy | ~7.5 GB | ~2.0 GB | 3.7Γ smaller |
The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well. The audio encoder is mostly Conv2d / Linear in transformer layers β MatMulNBits quantizes the transformer Linear ops (most of the weight) but leaves the small Conv2d front-end at fp16.
Quality
Benchmarked on the 8 Voices-in-the-Wild-Bench
example clips (real-world noisy conditions, all English), word-level
agreement (1 β WER), prompt forced to language English:
| Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size |
|---|---|---|---|---|---|
| PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB |
| ONNX fp16 (ref) | fp32 | fp16 | 96.7% | 7 / 8 | 8.2 GB |
| ONNX recommended (GPTQ) | INT8 | INT4 GPTQ | 92.7% | 6 / 8 | 2.3 GB |
| ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB |
| ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB |
The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the size/quality sweet spot for browser deployment. Forcing the language (rather than auto-detecting) recovers most of the quantization drift.
GPTQ calibration on both prefill and step yields +0.8% over plain RTN at
the same model size, most visibly on the echo sample where the
RTN-quantized decoder previously hallucinated "the size was fine standing
up at the terrible white wall" β the GPTQ-quantized decoder produces
"the size feels fine standing up against terrible white walls", recovering
the leading clause exactly.
Note: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing
language skips the model's audio-quality-router language detection,
which is where the PT model loses points on echo and recording
(truncated).
Inference (Python)
pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav
Inference (browser)
A live browser demo (loads these ONNX models directly via onnxruntime-web
and WebGPU) is at
Reza2kn/mega-asr-bench.
The first visit downloads ~2 GB of model weights, cached by the browser for
subsequent runs.
Performance
| Hardware | Cold (model load) | Warm (3-4 s audio) |
|---|---|---|
| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
| Browser, WASM CPU | ~10 s + download | ~30 s |
Conversion details
- Exported via
torch.onnx.export(..., dynamo=True)from PyTorch 2.12. - Audio encoder rewrites: replaced packed-sequence flash-attention with standard batched attention + chunked Conv2d (parity cos β 0.998 vs original).
- Decoder uses a single
DecoderForExportwrapper that accepts a flat tuple of KV cache tensors; prefill and step are two specialisations of the same Python wrapper exported separately. - Quantization:
onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizerwithblock_size=32,is_symmetric=False,bits=4,algo=GPTQ. Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16. - KV cache: dynamic past-length axis (
Dim("T_past")) via dynamo'sdynamic_shapesAPI.
GPTQ calibration
The default ORT GPTQ implementation in
onnxruntime.quantization.neural_compressor.weight_only is CPU-only (numpy
matmul + np.linalg.cholesky for the Hessian inverse), and runs at 90 min
for a 1.7B model on a workstation CPU. For this release we ported the
GPTQ inner loop + Hessian accumulation to 35 min**; the step
GPTQ takes ~3 min (fewer unique MatMul input names because of GQA
sharing).torch.cuda and added a
diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter
than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the
prefill GPTQ runs at ~99% GPU util and finishes in **
- Prefill calibration: 168 samples (24 per noise/far_field/obstructed/
distortion/recording/echo/dropout split), English-only filter on the
textfield, audio decoded viasoundfile(Audio(decode=False)) to avoid thetorchcodecimport on streamingcast_column. - Step calibration: 63 samples from the same English-only set; each
sample's calibration feed is built by running the fp16 prefill ONNX,
capturing all 56
present.{0..27}.{key,value}tensors, embedding the greedy first predicted token, and pairing it withattention_maskof lengthL + 1andposition_ids = L. This gives the step's GPTQ Hessian exactly the autoregressive-decode activation distribution it sees at inference.
Credits
- Original model: zhifeixie/Mega-ASR (1.7B params, Apache-2.0)
- ONNX export + GPTQ quantization: this repo
- Benchmark: Voices-in-the-Wild-Bench
- Live demo: Reza2kn/mega-asr-bench