Automatic Speech Recognition
ONNX
Transformers.js
onnxruntime
qwen3_asr
text-generation
onnxruntime-web
asr
speech-recognition
robust-asr
quantized
int4
int8
matmulnbits
gptq
on-device
browser
web
qwen3
qwen3-asr
mega-asr
Instructions to use Reza2kn/mega-asr-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use Reza2kn/mega-asr-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'Reza2kn/mega-asr-onnx');
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| - ja | |
| - ko | |
| - multilingual | |
| library_name: onnxruntime | |
| tags: | |
| - onnx | |
| - onnxruntime | |
| - onnxruntime-web | |
| - automatic-speech-recognition | |
| - asr | |
| - speech-recognition | |
| - robust-asr | |
| - quantized | |
| - int4 | |
| - int8 | |
| - matmulnbits | |
| - gptq | |
| - on-device | |
| - browser | |
| - web | |
| - qwen3 | |
| - qwen3-asr | |
| - mega-asr | |
| - transformers.js | |
| pipeline_tag: automatic-speech-recognition | |
| base_model: zhifeixie/Mega-ASR | |
| base_model_relation: quantized | |
| # Mega-ASR β INT4 ONNX (GPTQ-calibrated) | |
| INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR), | |
| a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with | |
| 2.6M training samples covering noise, far-field speech, obstruction, recording | |
| artifacts, echo, dropout, and transmission dropout. | |
| The model is split into three ONNX files (Whisper-style: audio encoder + LLM | |
| decoder prefill + LLM decoder step) so it can be loaded **directly in the | |
| browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as | |
| a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit | |
| block-32 asymmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB** | |
| total β small enough for a one-time browser cache. | |
| **Both decoder halves are GPTQ-calibrated** on 168 / 63 English Voices-in-the-Wild | |
| samples (prefill / step respectively). The step model uses past-KV-cache-aware | |
| calibration: prefill output is piped into step, so the calibration captures the | |
| realistic activation distribution of autoregressive decode. | |
| ## What's in this repo | |
| | File | Size | Role | | |
| | --- | ---: | --- | | |
| | `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β audio embeddings (24-layer Whisper-style encoder) | | |
| | `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache, **GPTQ-calibrated**) | | |
| | `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache, **GPTQ-calibrated**) | | |
| | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) | | |
| | `tokenizer_config.json` / `vocab.json` / `merges.txt` | β | Qwen3 BPE tokenizer assets | | |
| | `preprocessor_config.json` | β | Whisper-style mel feature extractor config | | |
| | `inference.py` | β | Standalone Python ASR pipeline using these ONNX files | | |
| ## Compression vs original | |
| | Component | Original (fp16 PT) | This (INT4 ONNX) | Savings | | |
| | --- | ---: | ---: | ---: | | |
| | Audio encoder | ~635 MB | **214 MB** | 3.0Γ | | |
| | LLM decoder | ~3.4 GB Γ 2 (prefill + step) | **968 MB Γ 2** | 3.5Γ | | |
| | **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ smaller** | | |
| The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well. | |
| The audio encoder is mostly Conv2d / Linear in transformer layers β MatMulNBits | |
| quantizes the transformer Linear ops (most of the weight) but leaves the small | |
| Conv2d front-end at fp16. | |
| ## Quality | |
| Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) | |
| example clips (real-world noisy conditions, all English), word-level | |
| agreement (1 β WER), prompt forced to `language English`: | |
| | Variant | Encoder | Decoder | Avg agreement | 100% samples | Total size | | |
| | --- | --- | --- | ---: | ---: | ---: | | |
| | PT bf16 (original) | fp16 | fp16 | 95.1% | 6 / 8 | 7.5 GB | | |
| | ONNX fp16 (ref) | fp32 | fp16 | **96.7%** | 7 / 8 | 8.2 GB | | |
| | **ONNX recommended (GPTQ)** | **INT8** | **INT4 GPTQ** | **92.7%** | **6 / 8** | **2.3 GB** | | |
| | ONNX RTN (previous ship) | INT8 | INT4 RTN | 91.9% | 6 / 8 | 2.3 GB | | |
| | ONNX small | INT4 | INT4 RTN | 87.8% | 6 / 8 | 2.0 GB | | |
| The recommended config (INT8 audio encoder + GPTQ-INT4 LLM decoder) is the | |
| size/quality sweet spot for browser deployment. Forcing the language | |
| (rather than auto-detecting) recovers most of the quantization drift. | |
| GPTQ calibration on both prefill and step yields **+0.8% over plain RTN** at | |
| the same model size, most visibly on the `echo` sample where the | |
| RTN-quantized decoder previously hallucinated *"the size was fine standing | |
| up at the terrible white wall"* β the GPTQ-quantized decoder produces | |
| *"the size feels fine standing up against terrible white walls"*, recovering | |
| the leading clause exactly. | |
| **Note**: fp16 ONNX actually beats PT bf16 (96.7 vs 95.1) because forcing | |
| language skips the model's audio-quality-router language detection, | |
| which is where the PT model loses points on `echo` and `recording` | |
| (truncated). | |
| ## Inference (Python) | |
| ```bash | |
| pip install onnxruntime numpy soundfile transformers qwen-asr | |
| git clone https://huggingface.co/Reza2kn/mega-asr-onnx | |
| cd mega-asr-onnx | |
| python inference.py --audio examples/noise.wav | |
| ``` | |
| ## Inference (browser) | |
| A live browser demo (loads these ONNX models directly via `onnxruntime-web` | |
| and WebGPU) is at | |
| [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench). | |
| The first visit downloads ~2 GB of model weights, cached by the browser for | |
| subsequent runs. | |
| ## Performance | |
| | Hardware | Cold (model load) | Warm (3-4 s audio) | | |
| | --- | ---: | ---: | | |
| | RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s | | |
| | M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s | | |
| | Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s | | |
| | Browser, WASM CPU | ~10 s + download | ~30 s | | |
| ## Conversion details | |
| - Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12. | |
| - Audio encoder rewrites: replaced packed-sequence flash-attention with | |
| standard batched attention + chunked Conv2d (parity cos β 0.998 vs original). | |
| - Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple | |
| of KV cache tensors; prefill and step are two specialisations of the same | |
| Python wrapper exported separately. | |
| - Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer` | |
| with `block_size=32`, `is_symmetric=False`, `bits=4`, `algo=GPTQ`. | |
| Non-MatMul ops (LayerNorm, RMSNorm, residuals, RoPE, the audio-encoder | |
| Conv2d front-end) stay at fp16. | |
| - KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's | |
| `dynamic_shapes` API. | |
| ### GPTQ calibration | |
| The default ORT GPTQ implementation in | |
| `onnxruntime.quantization.neural_compressor.weight_only` is CPU-only (numpy | |
| matmul + `np.linalg.cholesky` for the Hessian inverse), and runs at ~90 min | |
| for a 1.7B model on a workstation CPU. For this release we ported the | |
| GPTQ inner loop + Hessian accumulation to `torch.cuda` and added a | |
| diagonal-jitter retry on the Cholesky factorisation (fp32 is stricter | |
| than LAPACK on barely-singular Hessians). On an RTX 5080 Laptop the | |
| prefill GPTQ runs at ~99% GPU util and finishes in **~35 min**; the step | |
| GPTQ takes **~3 min** (fewer unique MatMul input names because of GQA | |
| sharing). | |
| - **Prefill calibration**: 168 samples (24 per noise/far_field/obstructed/ | |
| distortion/recording/echo/dropout split), English-only filter on the | |
| `text` field, audio decoded via `soundfile` (`Audio(decode=False)`) | |
| to avoid the `torchcodec` import on streaming `cast_column`. | |
| - **Step calibration**: 63 samples from the same English-only set; each | |
| sample's calibration feed is built by running the fp16 prefill ONNX, | |
| capturing all 56 `present.{0..27}.{key,value}` tensors, embedding the | |
| greedy first predicted token, and pairing it with `attention_mask` of | |
| length `L + 1` and `position_ids = L`. This gives the step's GPTQ Hessian | |
| exactly the autoregressive-decode activation distribution it sees at | |
| inference. | |
| ## Credits | |
| - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0) | |
| - ONNX export + GPTQ quantization: this repo | |
| - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) | |
| - Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench) | |