Automatic Speech Recognition
MLX
ONNX
Safetensors
asr
speech-recognition
robust-asr
quantized
int4
4bit
mixed-precision
dwq
on-device
apple-silicon
qwen3
qwen3-asr
mega-asr
Instructions to use Reza2kn/mega-asr-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Reza2kn/mega-asr-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mega-asr-mlx Reza2kn/mega-asr-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
File size: 5,395 Bytes
4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f 92db3ba 4ac889f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: mlx
tags:
- mlx
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- mixed-precision
- dwq
- on-device
- apple-silicon
- qwen3
- qwen3-asr
- mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---
# Mega-ASR β MLX 4-bit
MLX deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.
Two LLM variants ship in this repo. The **recommended** one is the mixed-precision
build β 8-bit attention + 4-bit MLP layers β which closes the quality gap to ONNX
GPTQ at the smallest viable size.
## What's in this repo
| File | Size | Role |
| --- | ---: | --- |
| `mlx/llm-mixed8_4/` | **1.5 GB** | **Recommended** Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg) |
| `mlx/llm-dwq4/` | 923 MB | 4-bit DWQ-distilled (smallest, slight quality drop) |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap. |
| `tokenizer/*` | β | Original Qwen3-ASR tokenizer (with audio special tokens `<\|audio_pad\|>` etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference.py` | β | End-to-end ASR pipeline: ONNX encoder + MLX LLM |
## Quality (bench)
8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 β WER), prompt forced to `language English`:
| Variant | Encoder | LLM | Bpw | Agreement | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 16 | 95.1% | 7.5 GB |
| ONNX recommended (GPTQ) | INT8 ONNX | INT4 GPTQ | ~4.5 | 92.7% | 2.3 GB |
| **MLX recommended (mixed)** | **fp32 ONNX** | **MLX 8/4 mixed** | **5.0** | **92.2%** | **~2.8 GB** |
| MLX 4-bit DWQ | fp32 ONNX | MLX 4-bit DWQ | 4.5 | 89.9% | ~2.2 GB |
| MLX 4-bit (no DWQ) | fp32 ONNX | MLX 4-bit | 4.5 | 89.1% | ~2.2 GB |
The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard
samples (`echo`, `recording`) β only the audio-quality-limited tail remains.
### Why mixed precision
Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine
quantization is naive groupwise (no calibration, no GPTQ-style error
redistribution). Attention layers are the most quality-sensitive in Qwen3 β
keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the
4-bit quality loss at only ~12% more weight memory than uniform 8-bit.
| Variant | Attention | MLP | Bpw | Agreement |
| --- | --- | --- | ---: | ---: |
| pure 4-bit | 4-bit | 4-bit | 4.5 | 89.1% |
| **mixed 8/4** | **8-bit** | **4-bit** | **5.0** | **92.2%** |
| mixed 8/6 | 8-bit | 6-bit | 6.5 | 91.4% |
| 6-bit | 6-bit | 6-bit | 6.5 | 90.7% |
| 8-bit | 8-bit | 8-bit | 8.5 | 92.2% |
The mixed 8/4 build is Pareto-optimal β same quality as full 8-bit at ~60%
of its size, and 2.3 percentage points higher agreement than DWQ-distilled
4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's
inference distribution (scattered audio embeddings into a text prompt) is
out-of-distribution for the bf16 teacher's plain-text calibration corpus.
## Inference
```bash
pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
--mlx-llm-path mlx/llm-mixed8_4 \
--examples-dir examples
```
Pipeline:
1. mel features (Whisper preprocessor) β ONNX audio encoder (onnxruntime CPU) β audio embeddings (1, F, 2048)
2. tokenize the Qwen3-ASR chat prompt with `audio_pad_id=151676`, expand the single placeholder to F copies
3. embed all tokens via `model.model.embed_tokens` (MLX), scatter audio embeddings at the audio_pad positions
4. greedy decode via MLX with `input_embeddings`
## Conversion details
- LLM extracted from `zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/` by stripping the
`thinker.model.` prefix from layer weights and dropping the tied `lm_head`
(relies on `tie_word_embeddings=True`).
- **Mixed-precision quant** via `mlx_lm.utils.quantize_model` with a
per-layer `quant_predicate`:
- q_proj / k_proj / v_proj / o_proj β 8-bit
- gate_proj / up_proj / down_proj β 4-bit
- group_size=64, mode=affine
- **DWQ variant** via `mlx_lm.quant.dwq --bits 4 --group-size 64
--num-samples 64 --max-seq-length 256 --learning-rate 1e-6`. 64 distillation
steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
- Audio encoder ONNX is reused unchanged from
[Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx).
## Performance
| Hardware | Cold load | Warm (3-4 s audio) |
| --- | ---: | ---: |
| M-series Mac (MLX, mixed8_4) | ~3 s | ~1.5 s (LLM @ ~50 tps) |
| M-series Mac (MLX, dwq4) | ~3 s | ~1.5 s (LLM @ ~60 tps) |
## Credits
- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- MLX port + mixed-precision + DWQ: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- DWQ tool: [`mlx_lm.quant.dwq`](https://github.com/ml-explore/mlx-lm) (Apple Inc.)
|