Reza2kn
/

mega-asr-onnx

Model card Files Files and versions

xet

Community

Reza2kn commited on about 15 hours ago

Commit

a7f30d3

verified ·

1 Parent(s): cead59c

Add README with full breakdown, tags, base_model_relation: quantized

Browse files

Files changed (1) hide show

README.md +135 -0

README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+- ja
+- ko
+- multilingual
+library_name: onnxruntime
+tags:
+- onnx
+- onnxruntime
+- onnxruntime-web
+- automatic-speech-recognition
+- asr
+- speech-recognition
+- robust-asr
+- quantized
+- int4
+- int8
+- matmulnbits
+- on-device
+- browser
+- web
+- qwen3
+- qwen3-asr
+- mega-asr
+- transformers.js
+pipeline_tag: automatic-speech-recognition
+base_model: zhifeixie/Mega-ASR
+base_model_relation: quantized
+---
+# Mega-ASR — INT4 ONNX
+INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
+a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
+2.6M training samples covering noise, far-field speech, obstruction, recording
+artifacts, echo, dropout, and transmission dropout.
+The model is split into three ONNX files (Whisper-style: audio encoder + LLM
+decoder prefill + LLM decoder step) so it can be loaded **directly in the
+browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
+a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
+block-32 symmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
+total — small enough for a one-time browser cache.
+## What's in this repo
+| File | Size | Role |
+| --- | ---: | --- |
+| `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features → audio embeddings (24-layer Whisper-style encoder) |
+| `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache) |
+| `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache) |
+| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
+| `tokenizer_config.json` / `vocab.json` / `merges.txt` | — | Qwen3 BPE tokenizer assets |
+| `preprocessor_config.json` | — | Whisper-style mel feature extractor config |
+| `inference.py` | — | Standalone Python ASR pipeline using these ONNX files |
+## Compression vs original
+| Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
+| --- | ---: | ---: | ---: |
+| Audio encoder | ~635 MB | **214 MB** | 3.0× |
+| LLM decoder | ~3.4 GB × 2 (prefill + step) | **968 MB × 2** | 3.5× |
+| **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7× smaller** |
+The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
+The audio encoder is mostly Conv2d / Linear in transformer layers — MatMulNBits
+quantizes the transformer Linear ops (most of the weight) but leaves the small
+Conv2d front-end at fp16.
+## Quality
+Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+example clips (real-world noisy conditions), word-level agreement (1 − WER):
+| Variant | Average agreement | 100% samples |
+| --- | ---: | ---: |
+| PT bf16 (original) | **95.1%** | 6 / 8 |
+| ONNX fp16 (this export) | 87.8% | 6 / 8 |
+| **ONNX INT4 (this repo)** | **87.8%** | 6 / 8 |
+The INT4 quantization is **lossless within the export envelope** — the same
+score as the fp16 ONNX export. The gap to bf16 PT is the export itself
+(numerical drift through dynamo + Cache lowering on a few hard samples
+`echo` / `recording`), not the quantization. Most clean speech transcribes at
+100%.
+## Inference (Python)
+```bash
+pip install onnxruntime numpy soundfile transformers qwen-asr
+git clone https://huggingface.co/Reza2kn/mega-asr-onnx
+cd mega-asr-onnx
+python inference.py --audio examples/noise.wav
+```
+## Inference (browser)
+A live browser demo (loads these ONNX models directly via `onnxruntime-web`
+and WebGPU) is at
+[Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
+The first visit downloads ~2 GB of model weights, cached by the browser for
+subsequent runs.
+## Performance
+| Hardware | Cold (model load) | Warm (3-4 s audio) |
+| --- | ---: | ---: |
+| RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
+| M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
+| Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
+| Browser, WASM CPU | ~10 s + download | ~30 s |
+## Conversion details
+- Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
+- Audio encoder rewrites: replaced packed-sequence flash-attention with
+  standard batched attention + chunked Conv2d (parity cos ≈ 0.998 vs original).
+- Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
+  of KV cache tensors; prefill and step are two specialisations of the same
+  Python wrapper exported separately.
+- Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
+  with `block_size=32`, `is_symmetric=True`, `bits=4`. Non-MatMul ops (LayerNorm,
+  RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16.
+- KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
+  `dynamic_shapes` API.
+## Credits
+- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
+- ONNX export + quantization: this repo
+- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
+- Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)