Qwen2.5-VL-7B HQQ 4-bit Clean Model

This is a clean HQQ 4-bit quantized version of Qwen2.5-VL-7B-Instruct with no meta tensor issues.

🎯 Model Overview

Base Model: Qwen/Qwen2.5-VL-7B-Instruct
Quantization Method: HQQ (Half-Quadratic Quantization)
Precision: 4-bit
Compatible Backends: gemlite, torchao_int4, bitblas, marlin

📊 Quantization Configuration

HqqConfig(
    nbits=4,           # 4-bit precision
    group_size=64,     # Quantization group size
    axis=1             # Quantization axis
)

Quantization Method:

Quantized during model loading via quantization_config parameter
Uses transformers-native HQQ integration for maximum compatibility
Meta tensor prevention with _fast_init=False
Safe serialization format for reliable loading

🔬 About HQQ (Half-Quadratic Quantization)

HQQ is a fast, calibration-free quantization method that offers several advantages:

Key Features

⚡ Fast: Quantizes large models in minutes (50x faster than GPTQ)
🎯 No Calibration Data Required: Works without calibration datasets
🔧 Weight-Focused: Minimizes errors between original and dequantized weights
🌐 Universal: Compatible with any model modality (LLMs, Vision, Multimodal)
🚀 Optimized Inference: Supports fused kernels (torchao, marlin) for faster inference

Technical Highlights

Robust Optimization: Uses Half-Quadratic solver with non-convex lp<1-norm for optimal quantization parameters
Linear Dequantization: Compatible with optimized CUDA/Triton kernels
Quality Preservation: Maintains model performance while drastically reducing size
PEFT Compatible: Supports fine-tuning with Parameter-Efficient Fine-Tuning methods

Resources

💻 Usage

Basic Setup

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from hqq.utils.patching import prepare_for_inference

# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
    torch_dtype=torch.float16,
    device_map="cuda:0",
    trust_remote_code=True
)

# Apply HQQ patching with backend
prepare_for_inference(model, backend='gemlite', verbose=True)

# Load processor
processor = AutoProcessor.from_pretrained(
    "LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
    trust_remote_code=True
)

Inference Example

# Prepare messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Generate
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output[0])

🔧 Quantization Process

This model was quantized using the following approach:

Load with Quantization Config: Model quantized during loading using transformers' native HQQ integration
Meta Tensor Prevention: Used _fast_init=False to avoid meta tensor issues
Safe Serialization: Saved using safe_serialization=True for reliable model loading

from transformers import Qwen2_5_VLForConditionalGeneration, HqqConfig

quant_config = HqqConfig(nbits=4, group_size=64, axis=1)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map='cuda:0',
    quantization_config=quant_config,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    _fast_init=False,
).eval()

model.save_pretrained("output_dir", safe_serialization=True)

Supported Backends

gemlite: Recommended for best performance on consumer GPUs
torchao_int4: PyTorch native implementation
bitblas: Optimized for specific hardware
marlin: High-performance kernel for inference

🎯 Applications

This quantized model excels at:

Document Understanding: Extract structured data from documents
OCR and Text Extraction: Read and understand text in images
Invoice Processing: Parse invoices, receipts, and financial documents
Visual Question Answering: Answer questions about image content
Multimodal Tasks: General vision-language understanding

🔗 Related Resources

Base Model: Qwen/Qwen2.5-VL-7B-Instruct
HQQ Library: mobiusml/hqq
Qwen VL Utils: qwen-vl-utils

📜 License

Apache 2.0 (inherited from base model)

🙏 Acknowledgments

Qwen team for the excellent base model
MobiusML for the HQQ quantization method
Hugging Face for transformers integration

Downloads last month: 1

Safetensors

Model size

5B params

Tensor type

I64

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LumenAI/qwen2.5-vl-7b-hqq-4bit-clean

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Quantized

(134)

this model

Paper for LumenAI/qwen2.5-vl-7b-hqq-4bit-clean

Fast Inference of Mixture-of-Experts Language Models with Offloading

Paper • 2312.17238 • Published Dec 28, 2023 • 7