Qwen3-ASR-0.6B β€” INT8 + Quantization-Aware Distillation

This is a compressed and distillation-refined version of Qwen/Qwen3-ASR-0.6B.
The model applies INT8 SmoothQuant post-training quantization followed by Quantization-Aware Distillation (QAD) to recover accuracy lost during compression β€” while preserving low-latency, low-memory inference.


Method Overview

Stage 1 β€” INT8 SmoothQuant (PTQ)

The base FP16 model is quantized to INT8 using SmoothQuant, a channel-wise activation smoothing technique that migrates quantization difficulty from activations to weights. This reduces model memory footprint and accelerates inference on hardware with INT8 tensor cores.

Stage 2 β€” Quantization-Aware Distillation (QAD)

To close the accuracy gap introduced by quantization, we apply a knowledge distillation fine-tuning stage where:

  • Teacher: original FP16 base model (frozen)
  • Student: the INT8-quantized model (trainable weights only β€” audio encoder and LM head are frozen)
  • Data: unlabeled speech data spanning 30 languages, with pseudo-labels generated by the teacher model
  • Loss: a combination of KL-divergence distillation loss and cross-entropy loss, computed exclusively on response tokens (post audio-end token positions)
  • Optimizer: AdamW with cosine decay learning rate schedule and linear warmup

The QAD stage teaches the quantized student to match the teacher's output distribution on diverse real-world speech, without requiring any manual transcription labels.


Benchmark Results

English β€” LibriSpeech dev-clean-2

Model WER ↓
Qwen3-ASR-0.6B (FP16 base) 2.66%
INT8 SmoothQuant (pre-QAD) 2.81%
INT8 + QAD (this model) 2.76%

Vietnamese β€” VIVOS test set

Model WER ↓
Qwen3-ASR-0.6B (FP16 base) 10.53%
INT8 SmoothQuant (pre-QAD) 11.75%
INT8 + QAD (this model) 11.55%

QAD recovers a meaningful portion of the WER degradation introduced by INT8 quantization, with no additional labeled data required.


Usage

Install dependencies:

pip install qwen-asr nvidia-modelopt soundfile numpy

Run inference:

import soundfile as sf
import numpy as np
import torch
import modelopt.torch.opt as mto
from qwen_asr import Qwen3ASRModel

# Enable ModelOpt quantization state restore
mto.enable_huggingface_checkpointing()

model = Qwen3ASRModel.from_pretrained(
    "vrfai/Qwen3-ASR-0.6B-int8-QAD",
    dtype=torch.float16,
    device_map="cuda:0",
    max_new_tokens=256,
)

audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype(np.float32)

results = model.transcribe(audio=(audio, sr), language=None)
print(results[0].text)

Model Details

Property Value
Base model Qwen/Qwen3-ASR-0.6B
Parameters ~0.6B
Quantization INT8 SmoothQuant (via NVIDIA ModelOpt)
Audio encoder Frozen (FP16, not quantized)
LM head Frozen
Quantized scope Transformer decoder layers
Distillation data Multilingual unlabeled speech (30 languages)
License Apache 2.0

Citation

If you use this model, please also cite the original Qwen3-ASR work:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-0.6B}
}

Acknowledgements

Downloads last month
43
Safetensors
Model size
0.9B params
Tensor type
BF16
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vrfai/Qwen3-ASR-0.6B-int8-QAD

Finetuned
(19)
this model

Collection including vrfai/Qwen3-ASR-0.6B-int8-QAD