File size: 5,395 Bytes
4ac889f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92db3ba
4ac889f
 
 
 
 
 
 
 
 
 
 
92db3ba
4ac889f
92db3ba
4ac889f
 
92db3ba
 
 
4ac889f
 
 
 
 
92db3ba
 
 
 
4ac889f
 
 
 
 
 
 
 
92db3ba
 
 
 
 
 
 
 
 
 
 
 
4ac889f
92db3ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ac889f
 
 
 
 
 
 
 
92db3ba
4ac889f
 
 
92db3ba
4ac889f
92db3ba
4ac889f
 
 
 
 
 
 
 
92db3ba
 
 
 
 
 
 
 
4ac889f
 
 
 
 
92db3ba
4ac889f
92db3ba
 
4ac889f
 
 
 
92db3ba
4ac889f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: mlx
tags:
- mlx
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- mixed-precision
- dwq
- on-device
- apple-silicon
- qwen3
- qwen3-asr
- mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---

# Mega-ASR β€” MLX 4-bit

MLX deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.

Two LLM variants ship in this repo. The **recommended** one is the mixed-precision
build β€” 8-bit attention + 4-bit MLP layers β€” which closes the quality gap to ONNX
GPTQ at the smallest viable size.

## What's in this repo

| File | Size | Role |
| --- | ---: | --- |
| `mlx/llm-mixed8_4/` | **1.5 GB** | **Recommended** Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg) |
| `mlx/llm-dwq4/` | 923 MB | 4-bit DWQ-distilled (smallest, slight quality drop) |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap. |
| `tokenizer/*` | β€” | Original Qwen3-ASR tokenizer (with audio special tokens `<\|audio_pad\|>` etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference.py` | β€” | End-to-end ASR pipeline: ONNX encoder + MLX LLM |

## Quality (bench)

8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 βˆ’ WER), prompt forced to `language English`:

| Variant | Encoder | LLM | Bpw | Agreement | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 16 | 95.1% | 7.5 GB |
| ONNX recommended (GPTQ) | INT8 ONNX | INT4 GPTQ | ~4.5 | 92.7% | 2.3 GB |
| **MLX recommended (mixed)** | **fp32 ONNX** | **MLX 8/4 mixed** | **5.0** | **92.2%** | **~2.8 GB** |
| MLX 4-bit DWQ | fp32 ONNX | MLX 4-bit DWQ | 4.5 | 89.9% | ~2.2 GB |
| MLX 4-bit (no DWQ) | fp32 ONNX | MLX 4-bit | 4.5 | 89.1% | ~2.2 GB |

The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard
samples (`echo`, `recording`) β€” only the audio-quality-limited tail remains.

### Why mixed precision

Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine
quantization is naive groupwise (no calibration, no GPTQ-style error
redistribution). Attention layers are the most quality-sensitive in Qwen3 β€”
keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the
4-bit quality loss at only ~12% more weight memory than uniform 8-bit.

| Variant | Attention | MLP | Bpw | Agreement |
| --- | --- | --- | ---: | ---: |
| pure 4-bit | 4-bit | 4-bit | 4.5 | 89.1% |
| **mixed 8/4** | **8-bit** | **4-bit** | **5.0** | **92.2%** |
| mixed 8/6 | 8-bit | 6-bit | 6.5 | 91.4% |
| 6-bit | 6-bit | 6-bit | 6.5 | 90.7% |
| 8-bit | 8-bit | 8-bit | 8.5 | 92.2% |

The mixed 8/4 build is Pareto-optimal β€” same quality as full 8-bit at ~60%
of its size, and 2.3 percentage points higher agreement than DWQ-distilled
4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's
inference distribution (scattered audio embeddings into a text prompt) is
out-of-distribution for the bf16 teacher's plain-text calibration corpus.

## Inference

```bash
pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
                    --mlx-llm-path mlx/llm-mixed8_4 \
                    --examples-dir examples
```

Pipeline:
1. mel features (Whisper preprocessor) β†’ ONNX audio encoder (onnxruntime CPU) β†’ audio embeddings (1, F, 2048)
2. tokenize the Qwen3-ASR chat prompt with `audio_pad_id=151676`, expand the single placeholder to F copies
3. embed all tokens via `model.model.embed_tokens` (MLX), scatter audio embeddings at the audio_pad positions
4. greedy decode via MLX with `input_embeddings`

## Conversion details

- LLM extracted from `zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/` by stripping the
  `thinker.model.` prefix from layer weights and dropping the tied `lm_head`
  (relies on `tie_word_embeddings=True`).
- **Mixed-precision quant** via `mlx_lm.utils.quantize_model` with a
  per-layer `quant_predicate`:
  - q_proj / k_proj / v_proj / o_proj β†’ 8-bit
  - gate_proj / up_proj / down_proj β†’ 4-bit
  - group_size=64, mode=affine
- **DWQ variant** via `mlx_lm.quant.dwq --bits 4 --group-size 64
  --num-samples 64 --max-seq-length 256 --learning-rate 1e-6`. 64 distillation
  steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
- Audio encoder ONNX is reused unchanged from
  [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx).

## Performance

| Hardware | Cold load | Warm (3-4 s audio) |
| --- | ---: | ---: |
| M-series Mac (MLX, mixed8_4) | ~3 s | ~1.5 s (LLM @ ~50 tps) |
| M-series Mac (MLX, dwq4) | ~3 s | ~1.5 s (LLM @ ~60 tps) |

## Credits

- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- MLX port + mixed-precision + DWQ: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- DWQ tool: [`mlx_lm.quant.dwq`](https://github.com/ml-explore/mlx-lm) (Apple Inc.)