Qwen2.5-VL-3B-Instruct-W4A16-generic

Quantized with the NOVA quantization pipeline on 2026-04-27. Base model: Qwen/Qwen2.5-VL-3B-Instruct

Quantization details

Parameter	Value
Method	`W4A16`
Group size	128
Calibration	`generic`
Ignored modules	`re:.lm_head, re:.visual.*`
Tool	`llm-compressor >= 0.4.2`

Benchmark results

Metric	Value
Perplexity (wikitext-2, 20 samples)	20.466
OCR sanity check	✅ PASS
Tokens / second	1.1
TTFT (exact, prefill only)	1094.5 ms
TPOT (exact, per output token)	915.9 ms
Inference VRAM	9.15 GB
Disk size	3.41 GB

TTFT and TPOT measured with BaseStreamer injection (prompt-skip corrected).

Registry notes

Use llm-compressor==0.5.1 + transformers==4.51.3 for quantization.
Use vLLM>=0.7.2 for inference — Marlin kernels active on Ampere.
Projector (model.visual.merger) kept at FP32 — matched by visual.* regex.
OCR and bbox grounding regress 5x faster than MMMU under aggressive quant.
Keep merger at FP32, not BF16, for best bbox coordinate precision.
W8A8 requires A100 for calibration — activation statistics need >24GB VRAM.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic")

Citation

If you use this model in research, please cite the NOVA project. Pipeline source: Mohaaxa/nova-quant-pipeline

Downloads last month: 201

Safetensors

Model size

4B params

Tensor type

I64

I32

BF16

Model tree for Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-generic

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Quantized

(81)

this model