Qwen2.5-VL-3B-Instruct β€” AWQ INT4

Real AWQ (Activation-Aware Weight Quantization) applied to Qwen/Qwen2.5-VL-3B-Instruct.

Quantized using AutoAWQ with 64 calibration samples from the Pile dataset.

Quantization Details

Property Value
Base model Qwen/Qwen2.5-VL-3B-Instruct (3.8B params)
Method AWQ (Activation-Aware Weight Quantization)
Bits 4-bit (INT4)
Group size 128
Quantized layers 252 (WQLinear_GEMM)
Vision encoder Unquantized (FP16)
Format AutoAWQ GEMM

Model Size

Format Size
FP16 (original) 7.1 GB
AWQ INT4 (this) 3.2 GB
Compression ratio 2.2x

Benchmark Results (VQAv2, 30 samples)

Metric AWQ INT4 FP16 Baseline
Exact Match 0.811 ~0.85
Contains 0.900 β€”
Token F1 0.922 β€”
BLEU 0.912 β€”
ROUGE-L 0.922 β€”
GPU Memory 3,334 MB 7,503 MB
Latency 0.33 s/sample 0.33 s/sample

Compute Backend

Weights are stored as packed INT4 in (8 values per int32). At inference time, the Triton GEMM kernel unpacks INT4 weights per-tile and performs FP16 dot products in a fused kernel β€” weights never exist as a full FP16 matrix in GPU memory.

Usage

Part of VLM Compression Benchmark

This model was quantized as part of a research project on deploying Vision-Language Models on edge devices (NVIDIA Jetson Orin Nano 8GB). See the full benchmark at the project repository.

Note: While this AWQ model works on datacenter GPUs (A6000, etc.), the AWQ CUDA/Triton kernels are not available on Jetson (aarch64). For Jetson deployment, see our custom PyTorch INT8 quantization method which uses only standard torch.matmul operations.

Downloads last month
34
Safetensors
Model size
4B params
Tensor type
I32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Azaz666/Qwen2.5-VL-3B-Instruct-AWQ-INT4

Quantized
(77)
this model