File size: 5,395 Bytes

4ac889f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92db3ba
4ac889f
 
 
 
 
 
 
 
 
 
 
92db3ba
4ac889f
92db3ba
4ac889f
 
92db3ba
 
 
4ac889f
 
 
 
 
92db3ba
 
 
 
4ac889f
 
 
 
 
 
 
 
92db3ba
 
 
 
 
 
 
 
 
 
 
 
4ac889f
92db3ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ac889f
 
 
 
 
 
 
 
92db3ba
4ac889f
 
 
92db3ba
4ac889f
92db3ba
4ac889f
 
 
 
 
 
 
 
92db3ba
 
 
 
 
 
 
 
4ac889f
 
 
 
 
92db3ba
4ac889f
92db3ba
 
4ac889f
 
 
 
92db3ba
4ac889f

---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- multilingual
library_name: mlx
tags:
- mlx
- automatic-speech-recognition
- asr
- speech-recognition
- robust-asr
- quantized
- int4
- 4bit
- mixed-precision
- dwq
- on-device
- apple-silicon
- qwen3
- qwen3-asr
- mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized
---

# Mega-ASR — MLX 4-bit

MLX deployment of [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR),
the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.

Two LLM variants ship in this repo. The **recommended** one is the mixed-precision
build — 8-bit attention + 4-bit MLP layers — which closes the quality gap to ONNX
GPTQ at the smallest viable size.

## What's in this repo

| File | Size | Role |
| --- | ---: | --- |
| `mlx/llm-mixed8_4/` | **1.5 GB** | **Recommended** Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg) |
| `mlx/llm-dwq4/` | 923 MB | 4-bit DWQ-distilled (smallest, slight quality drop) |
| `onnx/audio_encoder_fp32.onnx` | 1.27 GB | 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap. |
| `tokenizer/*` | — | Original Qwen3-ASR tokenizer (with audio special tokens `<\|audio_pad\|>` etc.) |
| `examples/*.wav` | ~3 MB | 8 noisy benchmark clips from Voices-in-the-Wild-Bench |
| `inference.py` | — | End-to-end ASR pipeline: ONNX encoder + MLX LLM |

## Quality (bench)

8-clip [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
agreement (1 − WER), prompt forced to `language English`:

| Variant | Encoder | LLM | Bpw | Agreement | Total size |
| --- | --- | --- | ---: | ---: | ---: |
| PT bf16 (original) | fp16 | fp16 | 16 | 95.1% | 7.5 GB |
| ONNX recommended (GPTQ) | INT8 ONNX | INT4 GPTQ | ~4.5 | 92.7% | 2.3 GB |
| **MLX recommended (mixed)** | **fp32 ONNX** | **MLX 8/4 mixed** | **5.0** | **92.2%** | **~2.8 GB** |
| MLX 4-bit DWQ | fp32 ONNX | MLX 4-bit DWQ | 4.5 | 89.9% | ~2.2 GB |
| MLX 4-bit (no DWQ) | fp32 ONNX | MLX 4-bit | 4.5 | 89.1% | ~2.2 GB |

The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard
samples (`echo`, `recording`) — only the audio-quality-limited tail remains.

### Why mixed precision

Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine
quantization is naive groupwise (no calibration, no GPTQ-style error
redistribution). Attention layers are the most quality-sensitive in Qwen3 —
keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the
4-bit quality loss at only ~12% more weight memory than uniform 8-bit.

| Variant | Attention | MLP | Bpw | Agreement |
| --- | --- | --- | ---: | ---: |
| pure 4-bit | 4-bit | 4-bit | 4.5 | 89.1% |
| **mixed 8/4** | **8-bit** | **4-bit** | **5.0** | **92.2%** |
| mixed 8/6 | 8-bit | 6-bit | 6.5 | 91.4% |
| 6-bit | 6-bit | 6-bit | 6.5 | 90.7% |
| 8-bit | 8-bit | 8-bit | 8.5 | 92.2% |

The mixed 8/4 build is Pareto-optimal — same quality as full 8-bit at ~60%
of its size, and 2.3 percentage points higher agreement than DWQ-distilled
4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's
inference distribution (scattered audio embeddings into a text prompt) is
out-of-distribution for the bf16 teacher's plain-text calibration corpus.

## Inference

```bash
pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
                    --mlx-llm-path mlx/llm-mixed8_4 \
                    --examples-dir examples
```

Pipeline:
1. mel features (Whisper preprocessor) → ONNX audio encoder (onnxruntime CPU) → audio embeddings (1, F, 2048)
2. tokenize the Qwen3-ASR chat prompt with `audio_pad_id=151676`, expand the single placeholder to F copies
3. embed all tokens via `model.model.embed_tokens` (MLX), scatter audio embeddings at the audio_pad positions
4. greedy decode via MLX with `input_embeddings`

## Conversion details

- LLM extracted from `zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/` by stripping the
  `thinker.model.` prefix from layer weights and dropping the tied `lm_head`
  (relies on `tie_word_embeddings=True`).
- **Mixed-precision quant** via `mlx_lm.utils.quantize_model` with a
  per-layer `quant_predicate`:
  - q_proj / k_proj / v_proj / o_proj → 8-bit
  - gate_proj / up_proj / down_proj → 4-bit
  - group_size=64, mode=affine
- **DWQ variant** via `mlx_lm.quant.dwq --bits 4 --group-size 64
  --num-samples 64 --max-seq-length 256 --learning-rate 1e-6`. 64 distillation
  steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
- Audio encoder ONNX is reused unchanged from
  [Reza2kn/mega-asr-onnx](https://huggingface.co/Reza2kn/mega-asr-onnx).

## Performance

| Hardware | Cold load | Warm (3-4 s audio) |
| --- | ---: | ---: |
| M-series Mac (MLX, mixed8_4) | ~3 s | ~1.5 s (LLM @ ~50 tps) |
| M-series Mac (MLX, dwq4) | ~3 s | ~1.5 s (LLM @ ~60 tps) |

## Credits

- Original model: [zhifeixie/Mega-ASR](https://huggingface.co/zhifeixie/Mega-ASR) (1.7B params, Apache-2.0)
- MLX port + mixed-precision + DWQ: this repo
- Benchmark: [Voices-in-the-Wild-Bench](https://github.com/xzf-thu/Voices-in-the-Wild-Bench)
- DWQ tool: [`mlx_lm.quant.dwq`](https://github.com/ml-explore/mlx-lm) (Apple Inc.)