mega-asr-mlx / README.md
Reza2kn's picture
Update README: mixed-precision (8/4) at 92.2% recommended
92db3ba verified
metadata
license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
  - multilingual
library_name: mlx
tags:
  - mlx
  - automatic-speech-recognition
  - asr
  - speech-recognition
  - robust-asr
  - quantized
  - int4
  - 4bit
  - mixed-precision
  - dwq
  - on-device
  - apple-silicon
  - qwen3
  - qwen3-asr
  - mega-asr
pipeline_tag: automatic-speech-recognition
base_model: zhifeixie/Mega-ASR
base_model_relation: quantized

Mega-ASR β€” MLX 4-bit

MLX deployment of zhifeixie/Mega-ASR, the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.

Two LLM variants ship in this repo. The recommended one is the mixed-precision build β€” 8-bit attention + 4-bit MLP layers β€” which closes the quality gap to ONNX GPTQ at the smallest viable size.

What's in this repo

File Size Role
mlx/llm-mixed8_4/ 1.5 GB Recommended Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg)
mlx/llm-dwq4/ 923 MB 4-bit DWQ-distilled (smallest, slight quality drop)
onnx/audio_encoder_fp32.onnx 1.27 GB 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap.
tokenizer/* β€” Original Qwen3-ASR tokenizer (with audio special tokens <|audio_pad|> etc.)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench
inference.py β€” End-to-end ASR pipeline: ONNX encoder + MLX LLM

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 βˆ’ WER), prompt forced to language English:

Variant Encoder LLM Bpw Agreement Total size
PT bf16 (original) fp16 fp16 16 95.1% 7.5 GB
ONNX recommended (GPTQ) INT8 ONNX INT4 GPTQ ~4.5 92.7% 2.3 GB
MLX recommended (mixed) fp32 ONNX MLX 8/4 mixed 5.0 92.2% ~2.8 GB
MLX 4-bit DWQ fp32 ONNX MLX 4-bit DWQ 4.5 89.9% ~2.2 GB
MLX 4-bit (no DWQ) fp32 ONNX MLX 4-bit 4.5 89.1% ~2.2 GB

The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard samples (echo, recording) β€” only the audio-quality-limited tail remains.

Why mixed precision

Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine quantization is naive groupwise (no calibration, no GPTQ-style error redistribution). Attention layers are the most quality-sensitive in Qwen3 β€” keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the 4-bit quality loss at only ~12% more weight memory than uniform 8-bit.

Variant Attention MLP Bpw Agreement
pure 4-bit 4-bit 4-bit 4.5 89.1%
mixed 8/4 8-bit 4-bit 5.0 92.2%
mixed 8/6 8-bit 6-bit 6.5 91.4%
6-bit 6-bit 6-bit 6.5 90.7%
8-bit 8-bit 8-bit 8.5 92.2%

The mixed 8/4 build is Pareto-optimal β€” same quality as full 8-bit at ~60% of its size, and 2.3 percentage points higher agreement than DWQ-distilled 4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's inference distribution (scattered audio embeddings into a text prompt) is out-of-distribution for the bf16 teacher's plain-text calibration corpus.

Inference

pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
                    --mlx-llm-path mlx/llm-mixed8_4 \
                    --examples-dir examples

Pipeline:

  1. mel features (Whisper preprocessor) β†’ ONNX audio encoder (onnxruntime CPU) β†’ audio embeddings (1, F, 2048)
  2. tokenize the Qwen3-ASR chat prompt with audio_pad_id=151676, expand the single placeholder to F copies
  3. embed all tokens via model.model.embed_tokens (MLX), scatter audio embeddings at the audio_pad positions
  4. greedy decode via MLX with input_embeddings

Conversion details

  • LLM extracted from zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/ by stripping the thinker.model. prefix from layer weights and dropping the tied lm_head (relies on tie_word_embeddings=True).
  • Mixed-precision quant via mlx_lm.utils.quantize_model with a per-layer quant_predicate:
    • q_proj / k_proj / v_proj / o_proj β†’ 8-bit
    • gate_proj / up_proj / down_proj β†’ 4-bit
    • group_size=64, mode=affine
  • DWQ variant via mlx_lm.quant.dwq --bits 4 --group-size 64 --num-samples 64 --max-seq-length 256 --learning-rate 1e-6. 64 distillation steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
  • Audio encoder ONNX is reused unchanged from Reza2kn/mega-asr-onnx.

Performance

Hardware Cold load Warm (3-4 s audio)
M-series Mac (MLX, mixed8_4) ~3 s ~1.5 s (LLM @ ~50 tps)
M-series Mac (MLX, dwq4) ~3 s ~1.5 s (LLM @ ~60 tps)

Credits