Qwen3-ASR-0.6B — INT4 AWQ + Quantization-Aware Distillation

This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT4 AWQ (Activation-aware Weight Quantization) post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during aggressive 4-bit compression — while preserving low-latency, low-memory inference.

Method Overview

Stage 1 — INT4 AWQ (PTQ)

The base FP16 model is quantized to INT4 using the AWQ (awq_full) algorithm via NVIDIA ModelOpt. To preserve acoustic feature extraction and final token prediction integrity, the audio_tower and lm_head are explicitly excluded from quantization and kept in FP16. Calibration was performed dynamically across English, Chinese, and 28 other languages to ensure a balanced quantization scale mapping.

Stage 2 — Quantization-Aware Distillation (QAD)

To close the accuracy gap introduced by heavy 4-bit quantization, we apply a knowledge distillation fine-tuning stage where:

Teacher: Qwen3-ASR-1.7B FP16 model (frozen)
Student: the INT4-quantized 0.6B model (trainable QKV/MLP quantized weights — audio encoder and LM head are frozen)
Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the 1.7B teacher model
Loss: a combination of KL-divergence distillation loss (alpha_kd = 0.5) and cross-entropy loss
Optimizer: AdamW with cosine decay learning rate schedule

Benchmark Results (Trilingual Evaluation)

Model	VIVOS (Viet) WER ↓	LibriSpeech (Eng) WER ↓	Chinese Fleurs CER ↓
Teacher 1.7B (FP16 base)	7.24%	2.32%	7.12%
INT4 AWQ (pre-QAD)	14.34%	3.47%	8.16%
INT4 + QAD (this model)	12.81%	3.41%	8.10%

QAD successfully recovers ~1.53% absolute WER in Vietnamese and stabilizes English/Chinese performance against aggressive 4-bit compression degradation, with no additional labeled data required.

Usage

Install dependencies:

pip install qwen-asr nvidia-modelopt soundfile numpy

Run inference:

import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel

# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()

model = Qwen3ASRModel.from_pretrained(
    "vrfai/Qwen3-ASR-0.6B-int4-QAD",
    dtype=torch.float16,
    device_map="cuda:0",
    max_new_tokens=256,
)

audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype(np.float32)

results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)

Model Details

Property	Value
Base model	Qwen/Qwen3-ASR-0.6B
Parameters	~0.6B
Quantization	INT4 AWQ (`awq_full` via NVIDIA ModelOpt)
Audio encoder	Frozen (FP16, not quantized)
LM head	Frozen (FP16, not quantized)
Quantized scope	Transformer decoder layers
Distillation data	Multilingual unlabeled speech (30 languages)
License	Apache 2.0

Citation

If you use this model, please also cite the original Qwen3-ASR work:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {[https://huggingface.co/Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)}
}

Acknowledgements

Qwen Team for the base ASR model
NVIDIA ModelOpt for quantization tooling

Downloads last month: -

Safetensors

Model size

0.9B params

Tensor type

BF16

F16

Model tree for vrfai/Qwen3-ASR-0.6B-int4-QAD

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(19)

this model

Collection including vrfai/Qwen3-ASR-0.6B-int4-QAD

Qwen3-ASR-Optimized

Collection

6 items • Updated about 19 hours ago • 2