Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper • 2312.17238 • Published • 7
This is a clean HQQ 4-bit quantized version of Qwen2.5-VL-7B-Instruct with no meta tensor issues.
HqqConfig(
nbits=4, # 4-bit precision
group_size=64, # Quantization group size
axis=1 # Quantization axis
)
Quantization Method:
quantization_config parameter_fast_init=FalseHQQ is a fast, calibration-free quantization method that offers several advantages:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from hqq.utils.patching import prepare_for_inference
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
torch_dtype=torch.float16,
device_map="cuda:0",
trust_remote_code=True
)
# Apply HQQ patching with backend
prepare_for_inference(model, backend='gemlite', verbose=True)
# Load processor
processor = AutoProcessor.from_pretrained(
"LumenAI/qwen2.5-vl-7b-hqq-4bit-clean",
trust_remote_code=True
)
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")
# Generate
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output[0])
This model was quantized using the following approach:
_fast_init=False to avoid meta tensor issuessafe_serialization=True for reliable model loadingfrom transformers import Qwen2_5_VLForConditionalGeneration, HqqConfig
quant_config = HqqConfig(nbits=4, group_size=64, axis=1)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
device_map='cuda:0',
quantization_config=quant_config,
trust_remote_code=True,
low_cpu_mem_usage=True,
_fast_init=False,
).eval()
model.save_pretrained("output_dir", safe_serialization=True)
gemlite: Recommended for best performance on consumer GPUstorchao_int4: PyTorch native implementationbitblas: Optimized for specific hardwaremarlin: High-performance kernel for inferenceThis quantized model excels at:
Apache 2.0 (inherited from base model)
Base model
Qwen/Qwen2.5-VL-7B-Instruct