Qwen3-ASR-0.6B β€” INT4 AWQ + Quantization-Aware Distillation

This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT4 AWQ (Activation-aware Weight Quantization) post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during aggressive 4-bit compression β€” while preserving low-latency, low-memory inference.


Method Overview

Stage 1 β€” INT4 AWQ (PTQ)

The base FP16 model is quantized to INT4 using the AWQ (awq_full) algorithm via NVIDIA ModelOpt. To preserve acoustic feature extraction and final token prediction integrity, the audio_tower and lm_head are explicitly excluded from quantization and kept in FP16. Calibration was performed dynamically across English, Chinese, and 28 other languages to ensure a balanced quantization scale mapping.

Stage 2 β€” Quantization-Aware Distillation (QAD)

To close the accuracy gap introduced by heavy 4-bit quantization, we apply a knowledge distillation fine-tuning stage where:

  • Teacher: Qwen3-ASR-1.7B FP16 model (frozen)
  • Student: the INT4-quantized 0.6B model (trainable QKV/MLP quantized weights β€” audio encoder and LM head are frozen)
  • Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the 1.7B teacher model
  • Loss: a combination of KL-divergence distillation loss (alpha_kd = 0.5) and cross-entropy loss
  • Optimizer: AdamW with cosine decay learning rate schedule

Benchmark Results (Trilingual Evaluation)

Model VIVOS (Viet) WER ↓ LibriSpeech (Eng) WER ↓ Chinese Fleurs CER ↓
Teacher 1.7B (FP16 base) 7.24% 2.32% 7.12%
INT4 AWQ (pre-QAD) 14.34% 3.47% 8.16%
INT4 + QAD (this model) 12.81% 3.41% 8.10%

QAD successfully recovers ~1.53% absolute WER in Vietnamese and stabilizes English/Chinese performance against aggressive 4-bit compression degradation, with no additional labeled data required.


Usage

Install dependencies:

pip install qwen-asr nvidia-modelopt soundfile numpy

Run inference:

import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel

# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()

model = Qwen3ASRModel.from_pretrained(
    "vrfai/Qwen3-ASR-0.6B-int4-QAD",
    dtype=torch.float16,
    device_map="cuda:0",
    max_new_tokens=256,
)

audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype(np.float32)

results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)

Model Details

Property Value
Base model Qwen/Qwen3-ASR-0.6B
Parameters ~0.6B
Quantization INT4 AWQ (awq_full via NVIDIA ModelOpt)
Audio encoder Frozen (FP16, not quantized)
LM head Frozen (FP16, not quantized)
Quantized scope Transformer decoder layers
Distillation data Multilingual unlabeled speech (30 languages)
License Apache 2.0

Citation

If you use this model, please also cite the original Qwen3-ASR work:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {[https://huggingface.co/Qwen/Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B)}
}

Acknowledgements

Downloads last month
-
Safetensors
Model size
0.9B params
Tensor type
BF16
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vrfai/Qwen3-ASR-0.6B-int4-QAD

Finetuned
(19)
this model

Collection including vrfai/Qwen3-ASR-0.6B-int4-QAD