Reza2kn commited on
Commit
a7f30d3
Β·
verified Β·
1 Parent(s): cead59c

Add README with full breakdown, tags, base_model_relation: quantized

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ja
7
+ - ko
8
+ - multilingual
9
+ library_name: onnxruntime
10
+ tags:
11
+ - onnx
12
+ - onnxruntime
13
+ - onnxruntime-web
14
+ - automatic-speech-recognition
15
+ - asr
16
+ - speech-recognition
17
+ - robust-asr
18
+ - quantized
19
+ - int4
20
+ - int8
21
+ - matmulnbits
22
+ - on-device
23
+ - browser
24
+ - web
25
+ - qwen3
26
+ - qwen3-asr
27
+ - mega-asr
28
+ - transformers.js
29
+ pipeline_tag: automatic-speech-recognition
30
+ base_model: zhifeixie/Mega-ASR
31
+ base_model_relation: quantized
32
+ ---
33
+
34
+ # Mega-ASR β€” INT4 ONNX
35
+
36
+ INT4 ONNX export of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
37
+ a 1.7B-parameter multilingual ASR foundation model built on Qwen3-ASR-1.7B with
38
+ 2.6M training samples covering noise, far-field speech, obstruction, recording
39
+ artifacts, echo, dropout, and transmission dropout.
40
+
41
+ The model is split into three ONNX files (Whisper-style: audio encoder + LLM
42
+ decoder prefill + LLM decoder step) so it can be loaded **directly in the
43
+ browser** via [`onnxruntime-web`](https://onnxruntime.ai/docs/api/javascript/) or as
44
+ a CPU/GPU service via `onnxruntime`. INT4 weight quantization (MatMulNBits, 4-bit
45
+ block-32 symmetric) compresses the model from ~7.5 GB fp16 down to **~2 GB**
46
+ total β€” small enough for a one-time browser cache.
47
+
48
+ ## What's in this repo
49
+
50
+ | File | Size | Role |
51
+ | --- | ---: | --- |
52
+ | `onnx/audio_encoder_int4.onnx` (+ `.data`) | **214 MB** | mel features β†’ audio embeddings (24-layer Whisper-style encoder) |
53
+ | `onnx/decoder_prefill_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, full-length prefill (no KV cache) |
54
+ | `onnx/decoder_step_int4.onnx` (+ `.data`) | **968 MB** | Qwen3 1.7B decoder, single-token step (with KV cache) |
55
+ | `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench) |
56
+ | `tokenizer_config.json` / `vocab.json` / `merges.txt` | β€” | Qwen3 BPE tokenizer assets |
57
+ | `preprocessor_config.json` | β€” | Whisper-style mel feature extractor config |
58
+ | `inference.py` | β€” | Standalone Python ASR pipeline using these ONNX files |
59
+
60
+ ## Compression vs original
61
+
62
+ | Component | Original (fp16 PT) | This (INT4 ONNX) | Savings |
63
+ | --- | ---: | ---: | ---: |
64
+ | Audio encoder | ~635 MB | **214 MB** | 3.0Γ— |
65
+ | LLM decoder | ~3.4 GB Γ— 2 (prefill + step) | **968 MB Γ— 2** | 3.5Γ— |
66
+ | **Total deploy** | **~7.5 GB** | **~2.0 GB** | **3.7Γ— smaller** |
67
+
68
+ The decoder is ~95% MatMul ops so MatMulNBits 4-bit quantization works well.
69
+ The audio encoder is mostly Conv2d / Linear in transformer layers β€” MatMulNBits
70
+ quantizes the transformer Linear ops (most of the weight) but leaves the small
71
+ Conv2d front-end at fp16.
72
+
73
+ ## Quality
74
+
75
+ Benchmarked on the 8 [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
76
+ example clips (real-world noisy conditions), word-level agreement (1 βˆ’ WER):
77
+
78
+ | Variant | Average agreement | 100% samples |
79
+ | --- | ---: | ---: |
80
+ | PT bf16 (original) | **95.1%** | 6 / 8 |
81
+ | ONNX fp16 (this export) | 87.8% | 6 / 8 |
82
+ | **ONNX INT4 (this repo)** | **87.8%** | 6 / 8 |
83
+
84
+ The INT4 quantization is **lossless within the export envelope** β€” the same
85
+ score as the fp16 ONNX export. The gap to bf16 PT is the export itself
86
+ (numerical drift through dynamo + Cache lowering on a few hard samples
87
+ `echo` / `recording`), not the quantization. Most clean speech transcribes at
88
+ 100%.
89
+
90
+ ## Inference (Python)
91
+
92
+ ```bash
93
+ pip install onnxruntime numpy soundfile transformers qwen-asr
94
+ git clone https://huggingface.co/Reza2kn/mega-asr-onnx
95
+ cd mega-asr-onnx
96
+ python inference.py --audio examples/noise.wav
97
+ ```
98
+
99
+ ## Inference (browser)
100
+
101
+ A live browser demo (loads these ONNX models directly via `onnxruntime-web`
102
+ and WebGPU) is at
103
+ [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench).
104
+ The first visit downloads ~2 GB of model weights, cached by the browser for
105
+ subsequent runs.
106
+
107
+ ## Performance
108
+
109
+ | Hardware | Cold (model load) | Warm (3-4 s audio) |
110
+ | --- | ---: | ---: |
111
+ | RTX 5080 (CUDAExecutionProvider) | ~5 s | ~1.5 s |
112
+ | M-series Mac (CPUExecutionProvider) | ~12 s | ~6 s |
113
+ | Browser, WebGPU (RTX 5080) | ~10 s + ~1 GB download (cached) | ~3 s |
114
+ | Browser, WASM CPU | ~10 s + download | ~30 s |
115
+
116
+ ## Conversion details
117
+
118
+ - Exported via `torch.onnx.export(..., dynamo=True)` from PyTorch 2.12.
119
+ - Audio encoder rewrites: replaced packed-sequence flash-attention with
120
+ standard batched attention + chunked Conv2d (parity cos β‰ˆ 0.998 vs original).
121
+ - Decoder uses a single `DecoderForExport` wrapper that accepts a flat tuple
122
+ of KV cache tensors; prefill and step are two specialisations of the same
123
+ Python wrapper exported separately.
124
+ - Quantization: `onnxruntime.quantization.matmul_nbits_quantizer.MatMulNBitsQuantizer`
125
+ with `block_size=32`, `is_symmetric=True`, `bits=4`. Non-MatMul ops (LayerNorm,
126
+ RMSNorm, residuals, RoPE, the audio-encoder Conv2d front-end) stay at fp16.
127
+ - KV cache: dynamic past-length axis (`Dim("T_past")`) via dynamo's
128
+ `dynamic_shapes` API.
129
+
130
+ ## Credits
131
+
132
+ - Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
133
+ - ONNX export + quantization: this repo
134
+ - Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
135
+ - Live demo: [Reza2kn/mega-asr-bench](https://huggingface.co/spaces/Reza2kn/mega-asr-bench)