Qwen3-ASR-1.7B โ Mixed-Precision MLX (audio-8bit / text-4bit)
A mixed-precision quantized version of Qwen3-ASR-1.7B for Apple Silicon (MLX framework).
| Component | Precision | group_size |
|---|---|---|
audio_tower.* |
8-bit affine | 64 |
model.layers.* |
4-bit affine | 64 |
model.embed_tokens |
4-bit affine | 64 |
This configuration was identified as optimal after a systematic ablation study across 16 quantization configurations on Common Voice 22.0 (zh-TW / en / ja / ko).
Performance (n=1000, Common Voice 22.0)
Evaluated against mlx-community/Qwen3-ASR-1.7B-4bit on the same benchmark under identical conditions.
| Metric | This model | mlx-community-4bit | Delta |
|---|---|---|---|
| Disk size | 1,253 MB | 1,529 MB | โ18.0% |
| zh-TW CER (n=1000) | 3.670% | 3.644% | +0.026pp (within noise) |
| en WER (n=1000) | 6.458% | 6.482% | โ0.024pp |
| Speed โ zh-TW | 0.244 s/sample | 0.268 s/sample | โ9.9% |
| Speed โ en | 0.294 s/sample | 0.318 s/sample | โ7.5% |
| RAM delta on load | ~0 MB | +1,095 MB | load-from-weights |
Verdict: Accuracy is statistically equivalent to mlx-community/Qwen3-ASR-1.7B-4bit. The model is 18% smaller on disk and ~10% faster at inference.
Note: Japanese (ja) and other languages with embed_tokens quantized to 4-bit exhibit hallucination (~172% CER). This is the same behaviour as
mlx-community/Qwen3-ASR-1.7B-4bit(174% CER) and is an inherent limitation of 4-bit embed_tokens quantization, not specific to this model.
Why a custom loader is needed
mlx-audio's load_model hardcodes a model_quant_predicate inside Qwen3ASRForConditionalGeneration that unconditionally prevents the audio_tower from being quantized:
# mlx_audio/stt/models/qwen3_asr.py (~line 779)
def model_quant_predicate(self, path, module, group_size, bits):
return not path.startswith("audio_tower")
Because of this guard, calling load_model(repo_id) on this repo would silently load the 8-bit audio weights into un-quantized Linear layers, causing a shape mismatch and producing empty transcriptions.
The workaround is to load the bf16 base model, apply quantization manually (bypassing model_quant_predicate), then replace the weights with the saved ones. The complete loader is shown below.
Usage
Requirements
pip install mlx-audio>=0.3.1 mlx>=0.22
Transcription
import mlx.core as mx
import mlx.nn as nn
from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription
# โโ Step 1: load bf16 base โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
bundle = load_model("mlx-community/Qwen3-ASR-1.7B-bf16")
model = bundle._model
# โโ Step 2: apply the same mixed-precision quantization โโโโโโโโโโโโโโโโโโโโโโ
def pred_audio(path, m):
if not isinstance(m, (nn.Linear, nn.Embedding)):
return False
return path.startswith("audio_tower.")
def pred_text(path, m):
if not isinstance(m, (nn.Linear, nn.Embedding)):
return False
return path.startswith("model.") and not path.startswith("model.audio_tower")
nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio)
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text)
# โโ Step 3: load the saved weights from this repo โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx", "model.safetensors")
saved = list(mx.load(weights_path).items())
model.load_weights(saved, strict=True)
mx.eval(model.parameters())
# โโ Step 4: transcribe โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# language=None โ auto-detect (recommended for zh-TW / Chinese)
# language="English", "Japanese", "Korean", etc. โ force language
result = generate_transcription(bundle, "audio.wav", language=None)
# generate_transcription returns either a str or an STTOutput object
if isinstance(result, str):
text = result
elif hasattr(result, "text"):
text = result.text or ""
else:
text = str(result)
print(text)
Language hints
| Language | language argument |
|---|---|
| Chinese (Mandarin / zh-TW / zh-CN) | None (auto-detect) |
| English | "English" |
| Japanese | "Japanese" โ hallucination risk |
| Korean | "Korean" |
How this model was created
Quantization procedure
import mlx.nn as nn
def pred_audio(path, m):
return isinstance(m, (nn.Linear, nn.Embedding)) and path.startswith("audio_tower.")
def pred_text(path, m):
return (isinstance(m, (nn.Linear, nn.Embedding))
and path.startswith("model.")
and not path.startswith("model.audio_tower"))
nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio) # audio tower โ 8-bit
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text) # LM layers + embed โ 4-bit
Architecture equivalence
The text+embed component (model.layers + model.embed_tokens) is quantized identically to mlx-community/Qwen3-ASR-1.7B-4bit. Running our pred_text predicate alone on the bf16 base model reproduces the mlx-community weights digit-for-digit (verified across all four languages on the same benchmark).
The only difference is the audio_tower: mlx-community keeps it in bf16 (3,580 MB โ 1,529 MB), while this model quantizes it to 8-bit (โ 1,253 MB), saving an additional 276 MB.
Component sizes
| Component | bf16 | This model | mlx-community-4bit |
|---|---|---|---|
| audio_tower | ~635 MB | ~341 MB (8-bit) | ~635 MB (bf16) |
| model.layers | ~793 MB | ~268 MB (4-bit) | ~268 MB (4-bit) |
| model.embed_tokens | ~623 MB | ~175 MB (4-bit) | ~175 MB (4-bit) |
| Total (on-disk) | ~4,076 MB | ~1,253 MB | ~1,529 MB |
Ablation study summary (n=250 per language)
The configuration was selected from 16 candidates evaluated on Common Voice 22.0 (zh-TW, en, ja, ko):
| Config | Size | zh-TW CER | en WER | ja CERโ | ko CER |
|---|---|---|---|---|---|
| bf16-FULL (baseline) | 4,076 MB | 4.57% | 7.69% | 31.91% | 6.33% |
| mlx-community-4bit (anchor) | 1,603 MB | 4.569% | 8.559% | 35.57% | 8.077% |
| AUDIO-8-EMBED-TEXT-4 (this model) | 1,310 MB | 4.467% | 8.472% | 35.30% | 7.955% |
โ Japanese CER methodology: 2 out of 250 samples trigger an infinite repetition loop (e.g. "ใคใฎใชในใฎ" ร 700 times for a 58-char reference), inflating the raw aggregate CER to ~163โ174%. The reported figures use a 3ร length cap (hypothesis truncated to
min(len(hyp), len(ref)ร3)), which correctly bounds the loop penalty. The remaining 248/250 samples show 32.4% CER โ nearly identical to the bf16 baseline (31.91%). Both this model andmlx-community/Qwen3-ASR-1.7B-4bitexhibit the same loop behaviour; addingrepetition_penalty=1.1or amax_tokenslimit at inference time eliminates the issue entirely.
Limitations
- Japanese repetition loop: 4-bit
embed_tokenscauses 2/250 samples to enter an infinite repetition loop, inflating raw aggregate CER to ~163%. On the 248/250 unaffected samples the CER is 32.4% (comparable to bf16 baseline). The same issue exists inmlx-community/Qwen3-ASR-1.7B-4bit. Mitigation: passrepetition_penalty=1.1or amax_tokenscap togenerate_transcription. - Custom loader required: See Why a custom loader is needed above.
- MLX only: Weights are in MLX safetensors format and are not compatible with PyTorch/Transformers.
Future Work / TODO
The current quantization scheme applies a uniform bit-width per component (all audio_tower layers get 8-bit, all text layers get 4-bit). This is a coarse-grained assignment. The following directions would yield a better Pareto frontier:
1. Per-layer sensitivity analysis (HAWQ-style)
HAWQ uses the Hessian trace of each layer's loss contribution as a sensitivity proxy. Layers with high Hessian trace are more sensitive to quantization and should receive more bits; insensitive layers can be pushed lower.
How to apply here: run a forward+backward pass on a small calibration set, compute Tr(H_i) per layer, then use the ranking to assign bit-widths via class_predicate. This replaces the current hand-tuned audio-8 / text-4 split with a data-driven one.
Possible outcome: some audio_tower middle layers could be dropped to 4-bit (saving ~80โ120 MB) while keeping early/late transformer layers and the CNN frontend at 8-bit.
2. Better within-layer quantization (AWQ-style)
AWQ improves quality at a fixed bit-width by identifying per-channel "salient" weights (those corresponding to large activation magnitudes) and applying a learned per-channel rescaling before quantization. The scale factors are absorbed into adjacent layers so inference cost is unchanged.
Distinction from HAWQ: AWQ answers how to quantize a layer better; HAWQ answers which bit-width each layer should get. They are complementary and can be stacked.
Implementation note: AWQ requires scale absorption into paired layers (e.g. Linear โ LayerNorm), which needs architecture-aware code beyond MLX's built-in nn.quantize. Non-trivial to implement for the audio CNN + transformer hybrid architecture.
3. embed_tokens at 8-bit (Japanese hallucination fix)
Keeping embed_tokens at 8-bit instead of 4-bit (inspired by Unsloth's dynamic quantization findings) would add ~155 MB but is predicted to eliminate the Japanese hallucination:
| Config | Size | ja CER |
|---|---|---|
| Current (embed 4-bit) | ~1,253 MB | ~172% โ hallucination |
| embed 8-bit variant | ~1,408 MB | ~32% (expected, no hallucination) |
This would still be ~8% smaller than mlx-community/Qwen3-ASR-1.7B-4bit while restoring Japanese quality.
Citation
If you use this model, please cite the original Qwen3-ASR work and mlx-audio:
@misc{qwen3asr2025,
title = {Qwen3-ASR},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}
License
Apache 2.0 โ same as the base model.
- Downloads last month
- 241
4-bit
Model tree for Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx
Base model
mlx-community/Qwen3-ASR-1.7B-bf16