Qwen3-ASR-1.7B — Mixed-Precision MLX (audio-8bit / text-4bit)

A mixed-precision quantized version of Qwen3-ASR-1.7B for Apple Silicon (MLX framework).

Component	Precision	group_size
`audio_tower.*`	8-bit affine	64
`model.layers.*`	4-bit affine	64
`model.embed_tokens`	4-bit affine	64

This configuration was identified as optimal after a systematic ablation study across 16 quantization configurations on Common Voice 22.0 (zh-TW / en / ja / ko).

Performance (n=1000, Common Voice 22.0)

Evaluated against mlx-community/Qwen3-ASR-1.7B-4bit on the same benchmark under identical conditions.

Metric	This model	mlx-community-4bit	Delta
Disk size	1,253 MB	1,529 MB	−18.0%
zh-TW CER (n=1000)	3.670%	3.644%	+0.026pp (within noise)
en WER (n=1000)	6.458%	6.482%	−0.024pp
Speed — zh-TW	0.244 s/sample	0.268 s/sample	−9.9%
Speed — en	0.294 s/sample	0.318 s/sample	−7.5%
RAM delta on load	~0 MB	+1,095 MB	load-from-weights

Verdict: Accuracy is statistically equivalent to mlx-community/Qwen3-ASR-1.7B-4bit. The model is 18% smaller on disk and ~10% faster at inference.

Note: Japanese (ja) and other languages with embed_tokens quantized to 4-bit exhibit hallucination (~172% CER). This is the same behaviour as mlx-community/Qwen3-ASR-1.7B-4bit (174% CER) and is an inherent limitation of 4-bit embed_tokens quantization, not specific to this model.

Why a custom loader is needed

mlx-audio's load_model hardcodes a model_quant_predicate inside Qwen3ASRForConditionalGeneration that unconditionally prevents the audio_tower from being quantized:

# mlx_audio/stt/models/qwen3_asr.py (~line 779)
def model_quant_predicate(self, path, module, group_size, bits):
    return not path.startswith("audio_tower")

Because of this guard, calling load_model(repo_id) on this repo would silently load the 8-bit audio weights into un-quantized Linear layers, causing a shape mismatch and producing empty transcriptions.

The workaround is to load the bf16 base model, apply quantization manually (bypassing model_quant_predicate), then replace the weights with the saved ones. The complete loader is shown below.

Usage

Requirements

pip install mlx-audio>=0.3.1 mlx>=0.22

Transcription

import mlx.core as mx
import mlx.nn as nn
from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription

# ── Step 1: load bf16 base ────────────────────────────────────────────────────
bundle = load_model("mlx-community/Qwen3-ASR-1.7B-bf16")
model  = bundle._model

# ── Step 2: apply the same mixed-precision quantization ──────────────────────
def pred_audio(path, m):
    if not isinstance(m, (nn.Linear, nn.Embedding)):
        return False
    return path.startswith("audio_tower.")

def pred_text(path, m):
    if not isinstance(m, (nn.Linear, nn.Embedding)):
        return False
    return path.startswith("model.") and not path.startswith("model.audio_tower")

nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio)
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text)

# ── Step 3: load the saved weights from this repo ────────────────────────────
from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx", "model.safetensors")
saved = list(mx.load(weights_path).items())
model.load_weights(saved, strict=True)
mx.eval(model.parameters())

# ── Step 4: transcribe ────────────────────────────────────────────────────────
# language=None  → auto-detect (recommended for zh-TW / Chinese)
# language="English", "Japanese", "Korean", etc. → force language
result = generate_transcription(bundle, "audio.wav", language=None)

# generate_transcription returns either a str or an STTOutput object
if isinstance(result, str):
    text = result
elif hasattr(result, "text"):
    text = result.text or ""
else:
    text = str(result)

print(text)

Language hints

Language	`language` argument
Chinese (Mandarin / zh-TW / zh-CN)	`None` (auto-detect)
English	`"English"`
Japanese	`"Japanese"` ⚠ hallucination risk
Korean	`"Korean"`

How this model was created

Quantization procedure

import mlx.nn as nn

def pred_audio(path, m):
    return isinstance(m, (nn.Linear, nn.Embedding)) and path.startswith("audio_tower.")

def pred_text(path, m):
    return (isinstance(m, (nn.Linear, nn.Embedding))
            and path.startswith("model.")
            and not path.startswith("model.audio_tower"))

nn.quantize(model, group_size=64, bits=8, class_predicate=pred_audio)  # audio tower → 8-bit
nn.quantize(model, group_size=64, bits=4, class_predicate=pred_text)   # LM layers + embed → 4-bit

Architecture equivalence

The text+embed component (model.layers + model.embed_tokens) is quantized identically to mlx-community/Qwen3-ASR-1.7B-4bit. Running our pred_text predicate alone on the bf16 base model reproduces the mlx-community weights digit-for-digit (verified across all four languages on the same benchmark).

The only difference is the audio_tower: mlx-community keeps it in bf16 (3,580 MB → 1,529 MB), while this model quantizes it to 8-bit (→ 1,253 MB), saving an additional 276 MB.

Component sizes

Component	bf16	This model	mlx-community-4bit
audio_tower	~635 MB	~341 MB (8-bit)	~635 MB (bf16)
model.layers	~793 MB	~268 MB (4-bit)	~268 MB (4-bit)
model.embed_tokens	~623 MB	~175 MB (4-bit)	~175 MB (4-bit)
Total (on-disk)	~4,076 MB	~1,253 MB	~1,529 MB

Ablation study summary (n=250 per language)

The configuration was selected from 16 candidates evaluated on Common Voice 22.0 (zh-TW, en, ja, ko):

Config	Size	zh-TW CER	en WER	ja CER†	ko CER
bf16-FULL (baseline)	4,076 MB	4.57%	7.69%	31.91%	6.33%
mlx-community-4bit (anchor)	1,603 MB	4.569%	8.559%	35.57%	8.077%
AUDIO-8-EMBED-TEXT-4 (this model)	1,310 MB	4.467%	8.472%	35.30%	7.955%

† Japanese CER methodology: 2 out of 250 samples trigger an infinite repetition loop (e.g. "イギリスの" × 700 times for a 58-char reference), inflating the raw aggregate CER to ~163–174%. The reported figures use a 3× length cap (hypothesis truncated to min(len(hyp), len(ref)×3)), which correctly bounds the loop penalty. The remaining 248/250 samples show 32.4% CER — nearly identical to the bf16 baseline (31.91%). Both this model and mlx-community/Qwen3-ASR-1.7B-4bit exhibit the same loop behaviour; adding repetition_penalty=1.1 or a max_tokens limit at inference time eliminates the issue entirely.

Limitations

Japanese repetition loop: 4-bit embed_tokens causes 2/250 samples to enter an infinite repetition loop, inflating raw aggregate CER to ~163%. On the 248/250 unaffected samples the CER is 32.4% (comparable to bf16 baseline). The same issue exists in mlx-community/Qwen3-ASR-1.7B-4bit. Mitigation: pass repetition_penalty=1.1 or a max_tokens cap to generate_transcription.
Custom loader required: See Why a custom loader is needed above.
MLX only: Weights are in MLX safetensors format and are not compatible with PyTorch/Transformers.

Future Work / TODO

The current quantization scheme applies a uniform bit-width per component (all audio_tower layers get 8-bit, all text layers get 4-bit). This is a coarse-grained assignment. The following directions would yield a better Pareto frontier:

1. Per-layer sensitivity analysis (HAWQ-style)

HAWQ uses the Hessian trace of each layer's loss contribution as a sensitivity proxy. Layers with high Hessian trace are more sensitive to quantization and should receive more bits; insensitive layers can be pushed lower.

How to apply here: run a forward+backward pass on a small calibration set, compute Tr(H_i) per layer, then use the ranking to assign bit-widths via class_predicate. This replaces the current hand-tuned audio-8 / text-4 split with a data-driven one.

Possible outcome: some audio_tower middle layers could be dropped to 4-bit (saving ~80–120 MB) while keeping early/late transformer layers and the CNN frontend at 8-bit.

2. Better within-layer quantization (AWQ-style)

AWQ improves quality at a fixed bit-width by identifying per-channel "salient" weights (those corresponding to large activation magnitudes) and applying a learned per-channel rescaling before quantization. The scale factors are absorbed into adjacent layers so inference cost is unchanged.

Distinction from HAWQ: AWQ answers how to quantize a layer better; HAWQ answers which bit-width each layer should get. They are complementary and can be stacked.

Implementation note: AWQ requires scale absorption into paired layers (e.g. Linear → LayerNorm), which needs architecture-aware code beyond MLX's built-in nn.quantize. Non-trivial to implement for the audio CNN + transformer hybrid architecture.

3. embed_tokens at 8-bit (Japanese hallucination fix)

Keeping embed_tokens at 8-bit instead of 4-bit (inspired by Unsloth's dynamic quantization findings) would add ~155 MB but is predicted to eliminate the Japanese hallucination:

Config	Size	ja CER
Current (embed 4-bit)	~1,253 MB	~172% ⚠ hallucination
embed 8-bit variant	~1,408 MB	~32% (expected, no hallucination)

This would still be ~8% smaller than mlx-community/Qwen3-ASR-1.7B-4bit while restoring Japanese quality.

Citation

If you use this model, please cite the original Qwen3-ASR work and mlx-audio:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

License

Apache 2.0 — same as the base model.

Downloads last month: 241

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx

Base model

mlx-community/Qwen3-ASR-1.7B-bf16

Quantized

(1)

this model

Papers for Alkd/Qwen3-ASR-1.7B-audio8-text4-mlx

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 12

HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision

Paper • 1905.03696 • Published Apr 29, 2019