Qwen2.5-VL-3B-Instruct-W4A16-nova_floor

Quantized with the NOVA quantization pipeline on 2026-04-27. Base model: Qwen/Qwen2.5-VL-3B-Instruct

Quantization details

Parameter Value
Method W4A16
Group size 128
Calibration nova_floor
Ignored modules re:.*lm_head, re:.*visual.*
Tool llm-compressor >= 0.4.2

Benchmark results

Metric Value
Perplexity (wikitext-2, 20 samples) 20.703
OCR sanity check ✅ PASS
Tokens / second 1.1
TTFT (exact, prefill only) 978.3 ms
TPOT (exact, per output token) 933.2 ms
Inference VRAM 9.13 GB
Disk size 3.41 GB

TTFT and TPOT measured with BaseStreamer injection (prompt-skip corrected).

Registry notes

  • Use llm-compressor==0.5.1 + transformers==4.51.3 for quantization.
  • Use vLLM>=0.7.2 for inference — Marlin kernels active on Ampere.
  • Projector (model.visual.merger) kept at FP32 — matched by visual.* regex.
  • OCR and bbox grounding regress 5x faster than MMMU under aggressive quant.
  • Keep merger at FP32, not BF16, for best bbox coordinate precision.
  • W8A8 requires A100 for calibration — activation statistics need >24GB VRAM.

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-nova_floor",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-nova_floor")

Citation

If you use this model in research, please cite the NOVA project. Pipeline source: Mohaaxa/nova-quant-pipeline

Downloads last month
42
Safetensors
Model size
4B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mohaaxa/Qwen2.5-VL-3B-Instruct-W4A16-nova_floor

Quantized
(81)
this model