Qwen3-ASR-0.6B — INT8 + Quantization-Aware Distillation

This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT8 SmoothQuant post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during compression — while preserving low-latency, low-memory inference.

Method Overview

Stage 1 — INT8 SmoothQuant (PTQ)

The base FP16 model is quantized to INT8 using SmoothQuant, a channel-wise activation smoothing technique that migrates quantization difficulty from activations to weights. This reduces model memory footprint and accelerates inference on hardware with INT8 tensor cores.

Stage 2 — Quantization-Aware Distillation (QAD)

To close the accuracy gap introduced by quantization, we apply a knowledge distillation fine-tuning stage where:

Teacher: original FP16 base model (frozen)
Student: the INT8-quantized model (trainable weights only — audio encoder and LM head are frozen)
Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the teacher model
Loss: a combination of KL-divergence distillation loss and cross-entropy loss, computed exclusively on response tokens (post audio-end token positions)
Optimizer: AdamW with cosine decay learning rate schedule and linear warmup

The QAD stage teaches the quantized student to match the teacher's output distribution on diverse real-world speech, without requiring any manual transcription labels.

Benchmark Results

English — LibriSpeech dev-clean-2

Model	WER ↓
Qwen3-ASR-0.6B (FP16 base)	2.66%
INT8 SmoothQuant (pre-QAD)	2.81%
INT8 + QAD (this model)	2.76%

Vietnamese — VIVOS test set

Model	WER ↓
Qwen3-ASR-0.6B (FP16 base)	10.53%
INT8 SmoothQuant (pre-QAD)	11.75%
INT8 + QAD (this model)	11.55%

QAD recovers a meaningful portion of the WER degradation introduced by INT8 quantization, with no additional labeled data required.

Usage

Install dependencies:

pip install qwen-asr nvidia-modelopt soundfile numpy

Run inference:

import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel

# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()

model = Qwen3ASRModel.from_pretrained(
    "vrfai/Qwen3-ASR-0.6B-int8-QAD",
    dtype=torch.float16,
    device_map="cuda:0",
    max_new_tokens=256,
)

audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype(np.float32)

results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)

Model Details

Property	Value
Base model	Qwen/Qwen3-ASR-0.6B
Parameters	~0.6B
Quantization	INT8 SmoothQuant (via NVIDIA ModelOpt)
Audio encoder	Frozen (FP16, not quantized)
LM head	Frozen
Quantized scope	Transformer decoder layers
Distillation data	Multilingual unlabeled speech (30 languages)
License	Apache 2.0

Citation

If you use this model, please also cite the original Qwen3-ASR work:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-0.6B}
}

Acknowledgements

Qwen Team for the base ASR model
NVIDIA ModelOpt for quantization tooling

Downloads last month: 43

Safetensors

Model size

0.9B params

Tensor type

BF16

F16

Model tree for vrfai/Qwen3-ASR-0.6B-int8-QAD

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(19)

this model

Collection including vrfai/Qwen3-ASR-0.6B-int8-QAD

Qwen3-ASR-Optimized

Collection

6 items • Updated about 20 hours ago • 2