Qwen2.5-VL-3B-Instruct β AWQ INT4
Real AWQ (Activation-Aware Weight Quantization) applied to Qwen/Qwen2.5-VL-3B-Instruct.
Quantized using AutoAWQ with 64 calibration samples from the Pile dataset.
Quantization Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-VL-3B-Instruct (3.8B params) |
| Method | AWQ (Activation-Aware Weight Quantization) |
| Bits | 4-bit (INT4) |
| Group size | 128 |
| Quantized layers | 252 (WQLinear_GEMM) |
| Vision encoder | Unquantized (FP16) |
| Format | AutoAWQ GEMM |
Model Size
| Format | Size |
|---|---|
| FP16 (original) | 7.1 GB |
| AWQ INT4 (this) | 3.2 GB |
| Compression ratio | 2.2x |
Benchmark Results (VQAv2, 30 samples)
| Metric | AWQ INT4 | FP16 Baseline |
|---|---|---|
| Exact Match | 0.811 | ~0.85 |
| Contains | 0.900 | β |
| Token F1 | 0.922 | β |
| BLEU | 0.912 | β |
| ROUGE-L | 0.922 | β |
| GPU Memory | 3,334 MB | 7,503 MB |
| Latency | 0.33 s/sample | 0.33 s/sample |
Compute Backend
Weights are stored as packed INT4 in (8 values per int32). At inference time, the Triton GEMM kernel unpacks INT4 weights per-tile and performs FP16 dot products in a fused kernel β weights never exist as a full FP16 matrix in GPU memory.
Usage
Part of VLM Compression Benchmark
This model was quantized as part of a research project on deploying Vision-Language Models on edge devices (NVIDIA Jetson Orin Nano 8GB). See the full benchmark at the project repository.
Note: While this AWQ model works on datacenter GPUs (A6000, etc.), the AWQ CUDA/Triton kernels are not available on Jetson (aarch64). For Jetson deployment, see our custom PyTorch INT8 quantization method which uses only standard torch.matmul operations.
- Downloads last month
- 34
Model tree for Azaz666/Qwen2.5-VL-3B-Instruct-AWQ-INT4
Base model
Qwen/Qwen2.5-VL-3B-Instruct