Update README: mixed-precision (8/4) at 92.2% recommended

92db3ba verified about 12 hours ago

5.4 kB

license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: mlx
tags:
  - mlx
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - 4bit
  - mixed-precision
  - dwq
  - on-device
  - apple-silicon
  - qwen3
  - qwen3-asr
  - mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR — MLX 4-bit

MLX deployment of zhifeixie/Mega-ASR, the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.

Two LLM variants ship in this repo. The recommended one is the mixed-precision build — 8-bit attention + 4-bit MLP layers — which closes the quality gap to ONNX GPTQ at the smallest viable size.

What's in this repo

File	Size	Role
`mlx/llm-mixed8_4/`	1.5 GB	Recommended Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg)
`mlx/llm-dwq4/`	923 MB	4-bit DWQ-distilled (smallest, slight quality drop)
`onnx/audio_encoder_fp32.onnx`	1.27 GB	24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap.
`tokenizer/*`	—	Original Qwen3-ASR tokenizer (with audio special tokens `<\|audio_pad\|>` etc.)
`examples/*.wav`	~3 MB	8 noisy benchmark clips from Voices-in-the-Wild-Bench
`inference.py`	—	End-to-end ASR pipeline: ONNX encoder + MLX LLM

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 − WER), prompt forced to language English:

Variant	Encoder	LLM	Bpw	Agreement	Total size
PT bf16 (original)	fp16	fp16	16	95.1%	7.5 GB
ONNX recommended (GPTQ)	INT8 ONNX	INT4 GPTQ	~4.5	92.7%	2.3 GB
MLX recommended (mixed)	fp32 ONNX	MLX 8/4 mixed	5.0	92.2%	~2.8 GB
MLX 4-bit DWQ	fp32 ONNX	MLX 4-bit DWQ	4.5	89.9%	~2.2 GB
MLX 4-bit (no DWQ)	fp32 ONNX	MLX 4-bit	4.5	89.1%	~2.2 GB

The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard samples (echo, recording) — only the audio-quality-limited tail remains.

Why mixed precision

Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine quantization is naive groupwise (no calibration, no GPTQ-style error redistribution). Attention layers are the most quality-sensitive in Qwen3 — keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the 4-bit quality loss at only ~12% more weight memory than uniform 8-bit.

Variant	Attention	MLP	Bpw	Agreement
pure 4-bit	4-bit	4-bit	4.5	89.1%
mixed 8/4	8-bit	4-bit	5.0	92.2%
mixed 8/6	8-bit	6-bit	6.5	91.4%
6-bit	6-bit	6-bit	6.5	90.7%
8-bit	8-bit	8-bit	8.5	92.2%

The mixed 8/4 build is Pareto-optimal — same quality as full 8-bit at ~60% of its size, and 2.3 percentage points higher agreement than DWQ-distilled 4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's inference distribution (scattered audio embeddings into a text prompt) is out-of-distribution for the bf16 teacher's plain-text calibration corpus.

Inference

pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
                    --mlx-llm-path mlx/llm-mixed8_4 \
                    --examples-dir examples

Pipeline:

mel features (Whisper preprocessor) → ONNX audio encoder (onnxruntime CPU) → audio embeddings (1, F, 2048)
tokenize the Qwen3-ASR chat prompt with audio_pad_id=151676, expand the single placeholder to F copies
embed all tokens via model.model.embed_tokens (MLX), scatter audio embeddings at the audio_pad positions
greedy decode via MLX with input_embeddings

Conversion details

LLM extracted from zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/ by stripping the thinker.model. prefix from layer weights and dropping the tied lm_head (relies on tie_word_embeddings=True).
Mixed-precision quant via mlx_lm.utils.quantize_model with a per-layer quant_predicate:
- q_proj / k_proj / v_proj / o_proj → 8-bit
- gate_proj / up_proj / down_proj → 4-bit
- group_size=64, mode=affine
DWQ variant via mlx_lm.quant.dwq --bits 4 --group-size 64 --num-samples 64 --max-seq-length 256 --learning-rate 1e-6. 64 distillation steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
Audio encoder ONNX is reused unchanged from Reza2kn/mega-asr-onnx.

Performance

Hardware	Cold load	Warm (3-4 s audio)
M-series Mac (MLX, mixed8_4)	~3 s	~1.5 s (LLM @ ~50 tps)
M-series Mac (MLX, dwq4)	~3 s	~1.5 s (LLM @ ~60 tps)

Credits

Original model: zhifeixie/Mega-ASR (1.7B params, Apache-2.0)
MLX port + mixed-precision + DWQ: this repo
Benchmark: Voices-in-the-Wild-Bench
DWQ tool: mlx_lm.quant.dwq (Apple Inc.)