How to use from the
Use from the
Transformers.js library
// npm i @huggingface/transformers
import { pipeline } from '@huggingface/transformers';

// Allocate pipeline
const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');

Mega-ASR β€” INT4 ONNX (GPTQ-calibrated)

INT4 ONNX export of zhifeixie/Mega-ASR, a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with 2.6M training samples covering noise, far-field speech, obstruction, recording artifacts, echo, dropout, and transmission dropout.

The model is split into three ONNX files (Whisper-style: audio encoder + LLM decoder prefill + LLM decoder step) so it can be loaded directly in the browser via onnxruntime-web or as a CPU/GPU service via onnxruntime. INT4 weight quantization (MatMulNBits, 4-bit block-32 asymmetric) compresses the model from 7.5 GB fp16 down to **2 GB** total β€” small enough for a one-time browser cache.

Both decoder halves are GPTQ-calibrated on 168 / 63 English Voices-in-the-Wild samples (prefill / step respectively). The step model uses past-KV-cache-aware calibration: prefill output is piped into step, so the calibration captures the realistic activation distribution of autoregressive decode.

What's in this repo

File Size Role
onnx/audio_encoder_int4.onnx (+ .data) 214 MB mel features β†’ audio embeddings (24-layer Whisper-style encoder)
onnx/decoder_prefill_int4.onnx (+ .data) 968 MB Qwen3 1.7B decoder, full-length prefill (no KV cache, GPTQ-calibrated)
onnx/decoder_step_int4.onnx (+ .data) 968 MB Qwen3 1.7B decoder, single-token step (with KV cache, GPTQ-calibrated)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench
tokenizer_config.json / vocab.json / merges.txt β€” Qwen3 BPE tokenizer assets
preprocessor_config.json β€” Whisper-style mel feature extractor config
inference.py β€” Standalone Python ASR pipeline using these ONNX files

Compression vs original

Component Original (fp16 PT) This (INT4 ONNX) Savings
Audio encoder ~635 MB 214 MB 3.0Γ—
LLM decoder ~3.4 GB Γ— 2 (prefill + step) 968 MB Γ— 2 3.5Γ—
Total deploy ~7.5 GB ~2.0 GB 3.7Γ— smaller

The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well. The audio encoder is mostly Conv2d / Linear in transformer layers β€” MatMulNBits quantizes the transformer Linear ops (most of the weight) but leaves the small Conv2d front-end at fp16.

Quality

Benchmarked on the 8 Voices-in-the-Wild-Bench example clips (real-world noisy conditions, all English), word-level agreement (1 βˆ’ WER), prompt forced to language English:

Variant Encoder Decoder Avg agreement 100% samples Total size
PT bf16 (original) fp16 fp16 95.1% 6 / 8 7.5 GB
ONNX fp16 (ref) fp32 fp16 96.7% 7 / 8 8.2 GB
ONNX recommended (GPTQ) INT8 INT4 GPTQ 92.7% 6 / 8 2.3 GB
ONNX RTN (previous ship) INT8 INT4 RTN 91.9% 6 / 8 2.3 GB
ONNX small INT4 INT4 RTN 87.8% 6 / 8 2.0 GB

The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the size/quality sweet spot for browser deployment. Forcing the language (rather than auto-detecting) recovers most of the quantization drift.

GPTQ calibration on both prefill and step yields +0.8% over plain RTN at the same model size, most visibly on the echo sample where the RTN-quantized decoder previously hallucinated "the size was fine standing up at the terrible white wall" β€” the GPTQ-quantized decoder produces "the size feels fine standing up against terrible white walls", recovering the leading clause exactly.

Note: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing language skips the model's audio-quality-router language detection, which is where the PT model loses points on echo and recording (truncated).

Inference (Python)

pip install onnxruntime numpy soundfile transformers qwen-asr
git clone https://huggingface.co/Reza2kn/mega-asr-onnx
cd mega-asr-onnx
python inference.py --audio examples/noise.wav

Inference (browser)

A live browser demo (loads these ONNX models directly via onnxruntime-web and WebGPU) is at Reza2kn/mega-asr-bench. The first visit downloads ~2 GB of model weights, cached by the browser for subsequent runs.

Performance

Hardware Cold (model load) Warm (3-4 s audio)
RTX 5080 (CUDAExecutionProvider) ~5 s ~1.5 s
M-series Mac (CPUExecutionProvider) ~12 s ~6 s
Browser, WebGPU (RTX 5080) ~10 s + ~1 GB download (cached) ~3 s
Browser, WASM CPU ~10 s + download ~30 s

Conversion details

  • Exported via torch.onnx.export(..., dynamo=True) from PyTorch 2.12.
  • Audio encoder rewrites: replaced packed-sequence flash-attention with standard batched attention + chunked Conv2d (parity cos β‰ˆ 0.998 vs original).
  • Decoder uses a single DecoderForExport wrapper that accepts a flat tuple of KV cache tensors; prefill and step are two specialisations of the same Python wrapper exported separately.
  • Quantization: onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer with block_size=32, is_symmetric=False, bits=4, algo=GPTQ. Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16.
  • KV cache: dynamic past-length axis (Dim("T_past")) via dynamo's dynamic_shapes API.

GPTQ calibration

The default ORT GPTQ implementation in onnxruntime.quantization.neural_compressor.weight_only is CPU-only (numpy matmul + np.linalg.cholesky for the Hessian inverse), and runs at 90 min for a 1.7B model on a workstation CPU. For this release we ported the GPTQ inner loop + Hessian accumulation to torch.cuda and added a diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the prefill GPTQ runs at ~99% GPU util and finishes in **35 min**; the step GPTQ takes ~3 min (fewer unique MatMul input names because of GQA sharing).

  • Prefill calibration: 168 samples (24 per noise/far_field/obstructed/ distortion/recording/echo/dropout split), English-only filter on the text field, audio decoded via soundfile (Audio(decode=False)) to avoid the torchcodec import on streaming cast_column.
  • Step calibration: 63 samples from the same English-only set; each sample's calibration feed is built by running the fp16 prefill ONNX, capturing all 56 present.{0..27}.{key,value} tensors, embedding the greedy first predicted token, and pairing it with attention_mask of length L + 1 and position_ids = L. This gives the step's GPTQ Hessian exactly the autoregressive-decode activation distribution it sees at inference.

Credits

Downloads last month
125
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/mega-asr-onnx

Quantized
(4)
this model

Space using Reza2kn/mega-asr-onnx 1