Qwen3-ASR-1.7B โ€” Mixed-Precision MLX (audio-8bit / text-4bit)

A mixed-precision quantized version of Qwen3-ASR-1.7B for Apple Silicon (MLX framework).

Component Precision group_size
audio_tower.* 8-bit affine 64
model.layers.* 4-bit affine 64
model.embed_tokens 4-bit affine 64

This configuration was identified as optimal after a systematic ablation study across 16 quantization configurations on Common Voice 22.0 (zh-TW / en / ja / ko).


Performance (n=1000, Common Voice 22.0)

Evaluated against mlx-community/Qwen3-ASR-1.7B-4bit on the same benchmark under identical conditions.

Metric This model mlx-community-4bit Delta
Disk size 1,253 MB 1,529 MB โˆ’18.0%
zh-TW CER (n=1000) 3.670% 3.644% +0.026pp (within noise)
en WER (n=1000) 6.458% 6.482% โˆ’0.024pp
Speed โ€” zh-TW 0.244 s/sample 0.268 s/sample โˆ’9.9%
Speed โ€” en 0.294 s/sample 0.318 s/sample โˆ’7.5%
RAM delta on load ~0 MB +1,095 MB load-from-weights

Verdict: Accuracy is statistically equivalent to mlx-community/Qwen3-ASR-1.7B-4bit. The model is 18% smaller on disk and ~10% faster at inference.

Note: Japanese (ja) and other languages with embed_tokens quantized to 4-bit exhibit hallucination (~172% CER). This is the same behaviour as mlx-community/Qwen3-ASR-1.7B-4bit (174% CER) and is an inherent limitation of 4-bit embed_tokens quantization, not specific to this model.


Why a custom loader is needed

mlx-audio's load_model hardcodes a model_quant_predicate inside Qwen3ASRForConditionalGeneration that unconditionally prevents the audio_tower from being quantized:

# mlx_audio/stt/models/qwen3_asr.py (~line 779)
def model_quant_predicate(self, path, module, group_size, bits):
    return not path.startswith("audio_tower")

Because of this guard, calling load_model(repo_id) on this repo would silently load the 8-bit audio weights into un-quantized Linear layers, causing a shape mismatch and producing empty transcriptions.

The workaround is to load the bf16 base model, apply quantization manually (bypassing model_quant_predicate), then replace the weights with the saved ones. The complete loader is shown below.


Usage

Requirements

pip install mlx-audio>=0.3.1 mlx>=0.22

Transcription

import mlx.core as mx
import mlx.nn as nn
from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription

# โ”€โ”€ Step 1: load bf16 base โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
bundle = load_model("mlx-community/Qwen3-ASR-1.7B-bf16")
model  = bundle._model

# โ”€โ”€ Step 2: apply the same mixed-precision quantization โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def pred_audio(path, m):
    if not isinstance(m, (nn.Linear, nn.Embedding)):
        return False
    return path.startswith("audio_tower.")

def pred_text(path, m):
    if not isinstance(m, (nn.Linear, nn.Embedding)):
        return False
    return path.startswith("model.") and not path.startswith("model.audio_tower")

nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio)
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text)

# โ”€โ”€ Step 3: load the saved weights from this repo โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx", "model.safetensors")
saved = list(mx.load(weights_path).items())
model.load_weights(saved, strict=True)
mx.eval(model.parameters())

# โ”€โ”€ Step 4: transcribe โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# language=None  โ†’ auto-detect (recommended for zh-TW / Chinese)
# language="English", "Japanese", "Korean", etc. โ†’ force language
result = generate_transcription(bundle, "audio.wav", language=None)

# generate_transcription returns either a str or an STTOutput object
if isinstance(result, str):
    text = result
elif hasattr(result, "text"):
    text = result.text or ""
else:
    text = str(result)

print(text)

Language hints

Language language argument
Chinese (Mandarin / zh-TW / zh-CN) None (auto-detect)
English "English"
Japanese "Japanese" โš  hallucination risk
Korean "Korean"

How this model was created

Quantization procedure

import mlx.nn as nn

def pred_audio(path, m):
    return isinstance(m, (nn.Linear, nn.Embedding)) and path.startswith("audio_tower.")

def pred_text(path, m):
    return (isinstance(m, (nn.Linear, nn.Embedding))
            and path.startswith("model.")
            and not path.startswith("model.audio_tower"))

nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio)  # audio tower โ†’ 8-bit
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text)   # LM layers + embed โ†’ 4-bit

Architecture equivalence

The text+embed component (model.layers + model.embed_tokens) is quantized identically to mlx-community/Qwen3-ASR-1.7B-4bit. Running our pred_text predicate alone on the bf16 base model reproduces the mlx-community weights digit-for-digit (verified across all four languages on the same benchmark).

The only difference is the audio_tower: mlx-community keeps it in bf16 (3,580 MB โ†’ 1,529 MB), while this model quantizes it to 8-bit (โ†’ 1,253 MB), saving an additional 276 MB.

Component sizes

Component bf16 This model mlx-community-4bit
audio_tower ~635 MB ~341 MB (8-bit) ~635 MB (bf16)
model.layers ~793 MB ~268 MB (4-bit) ~268 MB (4-bit)
model.embed_tokens ~623 MB ~175 MB (4-bit) ~175 MB (4-bit)
Total (on-disk) ~4,076 MB ~1,253 MB ~1,529 MB

Ablation study summary (n=250 per language)

The configuration was selected from 16 candidates evaluated on Common Voice 22.0 (zh-TW, en, ja, ko):

Config Size zh-TW CER en WER ja CERโ€  ko CER
bf16-FULL (baseline) 4,076 MB 4.57% 7.69% 31.91% 6.33%
mlx-community-4bit (anchor) 1,603 MB 4.569% 8.559% 35.57% 8.077%
AUDIO-8-EMBED-TEXT-4 (this model) 1,310 MB 4.467% 8.472% 35.30% 7.955%

โ€  Japanese CER methodology: 2 out of 250 samples trigger an infinite repetition loop (e.g. "ใ‚คใ‚ฎใƒชใ‚นใฎ" ร— 700 times for a 58-char reference), inflating the raw aggregate CER to ~163โ€“174%. The reported figures use a 3ร— length cap (hypothesis truncated to min(len(hyp), len(ref)ร—3)), which correctly bounds the loop penalty. The remaining 248/250 samples show 32.4% CER โ€” nearly identical to the bf16 baseline (31.91%). Both this model and mlx-community/Qwen3-ASR-1.7B-4bit exhibit the same loop behaviour; adding repetition_penalty=1.1 or a max_tokens limit at inference time eliminates the issue entirely.


Limitations

  • Japanese repetition loop: 4-bit embed_tokens causes 2/250 samples to enter an infinite repetition loop, inflating raw aggregate CER to ~163%. On the 248/250 unaffected samples the CER is 32.4% (comparable to bf16 baseline). The same issue exists in mlx-community/Qwen3-ASR-1.7B-4bit. Mitigation: pass repetition_penalty=1.1 or a max_tokens cap to generate_transcription.
  • Custom loader required: See Why a custom loader is needed above.
  • MLX only: Weights are in MLX safetensors format and are not compatible with PyTorch/Transformers.

Future Work / TODO

The current quantization scheme applies a uniform bit-width per component (all audio_tower layers get 8-bit, all text layers get 4-bit). This is a coarse-grained assignment. The following directions would yield a better Pareto frontier:

1. Per-layer sensitivity analysis (HAWQ-style)

HAWQ uses the Hessian trace of each layer's loss contribution as a sensitivity proxy. Layers with high Hessian trace are more sensitive to quantization and should receive more bits; insensitive layers can be pushed lower.

How to apply here: run a forward+backward pass on a small calibration set, compute Tr(H_i) per layer, then use the ranking to assign bit-widths via class_predicate. This replaces the current hand-tuned audio-8 / text-4 split with a data-driven one.

Possible outcome: some audio_tower middle layers could be dropped to 4-bit (saving ~80โ€“120 MB) while keeping early/late transformer layers and the CNN frontend at 8-bit.

2. Better within-layer quantization (AWQ-style)

AWQ improves quality at a fixed bit-width by identifying per-channel "salient" weights (those corresponding to large activation magnitudes) and applying a learned per-channel rescaling before quantization. The scale factors are absorbed into adjacent layers so inference cost is unchanged.

Distinction from HAWQ: AWQ answers how to quantize a layer better; HAWQ answers which bit-width each layer should get. They are complementary and can be stacked.

Implementation note: AWQ requires scale absorption into paired layers (e.g. Linear โ†’ LayerNorm), which needs architecture-aware code beyond MLX's built-in nn.quantize. Non-trivial to implement for the audio CNN + transformer hybrid architecture.

3. embed_tokens at 8-bit (Japanese hallucination fix)

Keeping embed_tokens at 8-bit instead of 4-bit (inspired by Unsloth's dynamic quantization findings) would add ~155 MB but is predicted to eliminate the Japanese hallucination:

Config Size ja CER
Current (embed 4-bit) ~1,253 MB ~172% โš  hallucination
embed 8-bit variant ~1,408 MB ~32% (expected, no hallucination)

This would still be ~8% smaller than mlx-community/Qwen3-ASR-1.7B-4bit while restoring Japanese quality.


Citation

If you use this model, please cite the original Qwen3-ASR work and mlx-audio:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

License

Apache 2.0 โ€” same as the base model.

Downloads last month
241
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx

Quantized
(1)
this model

Papers for Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx