Qwen3-ASR-0.6B β INT4 AWQ + Quantization-Aware Distillation
This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT4 AWQ (Activation-aware Weight Quantization) post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during aggressive 4-bit compression β while preserving low-latency, low-memory inference.
Method Overview
Stage 1 β INT4 AWQ (PTQ)
The base FP16 model is quantized to INT4 using the AWQ (awq_full) algorithm via NVIDIA ModelOpt. To preserve acoustic feature extraction and final token prediction integrity, the audio_tower and lm_head are explicitly excluded from quantization and kept in FP16. Calibration was performed dynamically across English, Chinese, and 28 other languages to ensure a balanced quantization scale mapping.
Stage 2 β Quantization-Aware Distillation (QAD)
To close the accuracy gap introduced by heavy 4-bit quantization, we apply a knowledge distillation fine-tuning stage where:
- Teacher: Qwen3-ASR-1.7B FP16 model (frozen)
- Student: the INT4-quantized 0.6B model (trainable QKV/MLP quantized weights β audio encoder and LM head are frozen)
- Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the 1.7B teacher model
- Loss: a combination of KL-divergence distillation loss (
alpha_kd = 0.5) and cross-entropy loss - Optimizer: AdamW with cosine decay learning rate schedule
Benchmark Results (Trilingual Evaluation)
| Model | VIVOS (Viet) WER β | LibriSpeech (Eng) WER β | Chinese Fleurs CER β |
|---|---|---|---|
| Teacher 1.7B (FP16 base) | 7.24% | 2.32% | 7.12% |
| INT4 AWQ (pre-QAD) | 14.34% | 3.47% | 8.16% |
| INT4 + QAD (this model) | 12.81% | 3.41% | 8.10% |
QAD successfully recovers ~1.53% absolute WER in Vietnamese and stabilizes English/Chinese performance against aggressive 4-bit compression degradation, with no additional labeled data required.
Usage
Install dependencies:
pip install qwen-asr nvidia-modelopt soundfile numpy
Run inference:
import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel
# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()
model = Qwen3ASRModel.from_pretrained(
"vrfai/Qwen3-ASR-0.6B-int4-QAD",
dtype=torch.float16,
device_map="cuda:0",
max_new_tokens=256,
)
audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
audio = audio.astype(np.float32)
results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-ASR-0.6B |
| Parameters | ~0.6B |
| Quantization | INT4 AWQ (awq_full via NVIDIA ModelOpt) |
| Audio encoder | Frozen (FP16, not quantized) |
| LM head | Frozen (FP16, not quantized) |
| Quantized scope | Transformer decoder layers |
| Distillation data | Multilingual unlabeled speech (30 languages) |
| License | Apache 2.0 |
Citation
If you use this model, please also cite the original Qwen3-ASR work:
@misc{qwen3asr2025,
title = {Qwen3-ASR},
author = {Qwen Team},
year = {2025},
url = {[https://huggingface.co/Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)}
}
Acknowledgements
- Qwen Team for the base ASR model
- NVIDIA ModelOpt for quantization tooling
- Downloads last month
- -
Model tree for vrfai/Qwen3-ASR-0.6B-int4-QAD
Base model
Qwen/Qwen3-ASR-0.6B