Qwen3-ASR-0.6B β INT8 + Quantization-Aware Distillation
This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT8 SmoothQuant post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during compression β while preserving low-latency, low-memory inference.
Method Overview
Stage 1 β INT8 SmoothQuant (PTQ)
The base FP16 model is quantized to INT8 using SmoothQuant, a channel-wise activation smoothing technique that migrates quantization difficulty from activations to weights. This reduces model memory footprint and accelerates inference on hardware with INT8 tensor cores.
Stage 2 β Quantization-Aware Distillation (QAD)
To close the accuracy gap introduced by quantization, we apply a knowledge distillation fine-tuning stage where:
- Teacher: original FP16 base model (frozen)
- Student: the INT8-quantized model (trainable weights only β audio encoder and LM head are frozen)
- Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the teacher model
- Loss: a combination of KL-divergence distillation loss and cross-entropy loss, computed exclusively on response tokens (post audio-end token positions)
- Optimizer: AdamW with cosine decay learning rate schedule and linear warmup
The QAD stage teaches the quantized student to match the teacher's output distribution on diverse real-world speech, without requiring any manual transcription labels.
Benchmark Results
English β LibriSpeech dev-clean-2
| Model | WER β |
|---|---|
| Qwen3-ASR-0.6B (FP16 base) | 2.66% |
| INT8 SmoothQuant (pre-QAD) | 2.81% |
| INT8 + QAD (this model) | 2.76% |
Vietnamese β VIVOS test set
| Model | WER β |
|---|---|
| Qwen3-ASR-0.6B (FP16 base) | 10.53% |
| INT8 SmoothQuant (pre-QAD) | 11.75% |
| INT8 + QAD (this model) | 11.55% |
QAD recovers a meaningful portion of the WER degradation introduced by INT8 quantization, with no additional labeled data required.
Usage
Install dependencies:
pip install qwen-asr nvidia-modelopt soundfile numpy
Run inference:
import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel
# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()
model = Qwen3ASRModel.from_pretrained(
"vrfai/Qwen3-ASR-0.6B-int8-QAD",
dtype=torch.float16,
device_map="cuda:0",
max_new_tokens=256,
)
audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
audio = audio.astype(np.float32)
results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-ASR-0.6B |
| Parameters | ~0.6B |
| Quantization | INT8 SmoothQuant (via NVIDIA ModelOpt) |
| Audio encoder | Frozen (FP16, not quantized) |
| LM head | Frozen |
| Quantized scope | Transformer decoder layers |
| Distillation data | Multilingual unlabeled speech (30 languages) |
| License | Apache 2.0 |
Citation
If you use this model, please also cite the original Qwen3-ASR work:
@misc{qwen3asr2025,
title = {Qwen3-ASR},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-ASR-0.6B}
}
Acknowledgements
- Qwen Team for the base ASR model
- NVIDIA ModelOpt for quantization tooling
- Downloads last month
- 43
Model tree for vrfai/Qwen3-ASR-0.6B-int8-QAD
Base model
Qwen/Qwen3-ASR-0.6B