How to use from the
Use from the
MLX library
# Download the model from the Hub
pip install huggingface_hub[hf_xet]

huggingface-cli download --local-dir mega-asr-mlx Reza2kn/mega-asr-mlx

Mega-ASR β€” MLX 4-bit

MLX deployment of zhifeixie/Mega-ASR, the 1.7B-parameter robust multilingual ASR foundation model built on Qwen3-ASR-1.7B.

Two LLM variants ship in this repo. The recommended one is the mixed-precision build β€” 8-bit attention + 4-bit MLP layers β€” which closes the quality gap to ONNX GPTQ at the smallest viable size.

What's in this repo

File Size Role
mlx/llm-mixed8_4/ 1.5 GB Recommended Qwen3 LLM, 8-bit attention + 4-bit MLP (5.0 bpw avg)
mlx/llm-dwq4/ 923 MB 4-bit DWQ-distilled (smallest, slight quality drop)
onnx/audio_encoder_fp32.onnx 1.27 GB 24-layer Whisper-style audio encoder (ONNX fp32, run via onnxruntime). MLX port is on the roadmap.
tokenizer/* β€” Original Qwen3-ASR tokenizer (with audio special tokens <|audio_pad|> etc.)
examples/*.wav ~3 MB 8 noisy benchmark clips from Voices-in-the-Wild-Bench
inference.py β€” End-to-end ASR pipeline: ONNX encoder + MLX LLM

Quality (bench)

8-clip Voices-in-the-Wild-Bench agreement (1 βˆ’ WER), prompt forced to language English:

Variant Encoder LLM Bpw Agreement Total size
PT bf16 (original) fp16 fp16 16 95.1% 7.5 GB
ONNX recommended (GPTQ) INT8 ONNX INT4 GPTQ ~4.5 92.7% 2.3 GB
MLX recommended (mixed) fp32 ONNX MLX 8/4 mixed 5.0 92.2% ~2.8 GB
MLX 4-bit DWQ fp32 ONNX MLX 4-bit DWQ 4.5 89.9% ~2.2 GB
MLX 4-bit (no DWQ) fp32 ONNX MLX 4-bit 4.5 89.1% ~2.2 GB

The mixed variant gets all 6 "easy" samples perfect and improves the 2 hard samples (echo, recording) β€” only the audio-quality-limited tail remains.

Why mixed precision

Pure 4-bit MLX hits a quality wall around 89% because mlx-lm's affine quantization is naive groupwise (no calibration, no GPTQ-style error redistribution). Attention layers are the most quality-sensitive in Qwen3 β€” keeping them at 8-bit while dropping MLP layers to 4-bit recovers all the 4-bit quality loss at only ~12% more weight memory than uniform 8-bit.

Variant Attention MLP Bpw Agreement
pure 4-bit 4-bit 4-bit 4.5 89.1%
mixed 8/4 8-bit 4-bit 5.0 92.2%
mixed 8/6 8-bit 6-bit 6.5 91.4%
6-bit 6-bit 6-bit 6.5 90.7%
8-bit 8-bit 8-bit 8.5 92.2%

The mixed 8/4 build is Pareto-optimal β€” same quality as full 8-bit at ~60% of its size, and 2.3 percentage points higher agreement than DWQ-distilled 4-bit. DWQ on plain-text data couldn't bridge the gap because Mega-ASR's inference distribution (scattered audio embeddings into a text prompt) is out-of-distribution for the bf16 teacher's plain-text calibration corpus.

Inference

pip install mlx mlx-lm onnxruntime soundfile transformers librosa numpy
git clone https://huggingface.co/Reza2kn/mega-asr-mlx
cd mega-asr-mlx
python inference.py --encoder-path onnx/audio_encoder_fp32.onnx \
                    --mlx-llm-path mlx/llm-mixed8_4 \
                    --examples-dir examples

Pipeline:

  1. mel features (Whisper preprocessor) β†’ ONNX audio encoder (onnxruntime CPU) β†’ audio embeddings (1, F, 2048)
  2. tokenize the Qwen3-ASR chat prompt with audio_pad_id=151676, expand the single placeholder to F copies
  3. embed all tokens via model.model.embed_tokens (MLX), scatter audio embeddings at the audio_pad positions
  4. greedy decode via MLX with input_embeddings

Conversion details

  • LLM extracted from zhifeixie/Mega-ASR/Qwen3-ASR-1.7B/ by stripping the thinker.model. prefix from layer weights and dropping the tied lm_head (relies on tie_word_embeddings=True).
  • Mixed-precision quant via mlx_lm.utils.quantize_model with a per-layer quant_predicate:
    • q_proj / k_proj / v_proj / o_proj β†’ 8-bit
    • gate_proj / up_proj / down_proj β†’ 4-bit
    • group_size=64, mode=affine
  • DWQ variant via mlx_lm.quant.dwq --bits 4 --group-size 64 --num-samples 64 --max-seq-length 256 --learning-rate 1e-6. 64 distillation steps on tulu-3-sft-mixture reduced KL loss from ~0.18 to ~0.14.
  • Audio encoder ONNX is reused unchanged from Reza2kn/mega-asr-onnx.

Performance

Hardware Cold load Warm (3-4 s audio)
M-series Mac (MLX, mixed8_4) ~3 s ~1.5 s (LLM @ ~50 tps)
M-series Mac (MLX, dwq4) ~3 s ~1.5 s (LLM @ ~60 tps)

Credits

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/mega-asr-mlx

Quantized
(4)
this model