Qwen2.5-VL-3B-Instruct-W4A16-generic
Quantized with the NOVA quantization pipeline on 2026-04-27. Base model: Qwen/Qwen2.5-VL-3B-Instruct
Quantization details
| Parameter | Value |
|---|---|
| Method | W4A16 |
| Group size | 128 |
| Calibration | generic |
| Ignored modules | re:.*lm_head, re:.*visual.* |
| Tool | llm-compressor >= 0.4.2 |
Benchmark results
| Metric | Value |
|---|---|
| Perplexity (wikitext-2, 20 samples) | 20.466 |
| OCR sanity check | ✅ PASS |
| Tokens / second | 1.1 |
| TTFT (exact, prefill only) | 1094.5 ms |
| TPOT (exact, per output token) | 915.9 ms |
| Inference VRAM | 9.15 GB |
| Disk size | 3.41 GB |
TTFT and TPOT measured with
BaseStreamerinjection (prompt-skip corrected).
Registry notes
- Use llm-compressor==0.5.1 + transformers==4.51.3 for quantization.
- Use vLLM>=0.7.2 for inference — Marlin kernels active on Ampere.
- Projector (model.visual.merger) kept at FP32 — matched by visual.* regex.
- OCR and bbox grounding regress 5x faster than MMMU under aggressive quant.
- Keep merger at FP32, not BF16, for best bbox coordinate precision.
- W8A8 requires A100 for calibration — activation statistics need >24GB VRAM.
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic")
Citation
If you use this model in research, please cite the NOVA project.
Pipeline source: Mohaaxa/nova-quant-pipeline
- Downloads last month
- 201
Model tree for Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic
Base model
Qwen/Qwen2.5-VL-3B-Instruct