Qwen2.5-VL-3B-Instruct — AWQ INT4

Real AWQ (Activation-Aware Weight Quantization) applied to Qwen/Qwen2.5-VL-3B-Instruct.

Quantized using AutoAWQ with 64 calibration samples from the Pile dataset.

Quantization Details

Property	Value
Base model	Qwen/Qwen2.5-VL-3B-Instruct (3.8B params)
Method	AWQ (Activation-Aware Weight Quantization)
Bits	4-bit (INT4)
Group size	128
Quantized layers	252 (WQLinear_GEMM)
Vision encoder	Unquantized (FP16)
Format	AutoAWQ GEMM

Model Size

Format	Size
FP16 (original)	7.1 GB
AWQ INT4 (this)	3.2 GB
Compression ratio	2.2x

Benchmark Results (VQAv2, 30 samples)

Metric	AWQ INT4	FP16 Baseline
Exact Match	0.811	~0.85
Contains	0.900	—
Token F1	0.922	—
BLEU	0.912	—
ROUGE-L	0.922	—
GPU Memory	3,334 MB	7,503 MB
Latency	0.33 s/sample	0.33 s/sample

Compute Backend

Weights are stored as packed INT4 in (8 values per int32). At inference time, the Triton GEMM kernel unpacks INT4 weights per-tile and performs FP16 dot products in a fused kernel — weights never exist as a full FP16 matrix in GPU memory.

Usage

Part of VLM Compression Benchmark

This model was quantized as part of a research project on deploying Vision-Language Models on edge devices (NVIDIA Jetson Orin Nano 8GB). See the full benchmark at the project repository.

Note: While this AWQ model works on datacenter GPUs (A6000, etc.), the AWQ CUDA/Triton kernels are not available on Jetson (aarch64). For Jetson deployment, see our custom PyTorch INT8 quantization method which uses only standard torch.matmul operations.

Downloads last month: 34

Safetensors

Model size

4B params

Tensor type

I32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Azaz666/Qwen2.5-VL-3B-Instruct-AWQ-INT4

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Quantized

(77)

this model