Gemma-4-31B-it 4-bit GPTQ Quantization (W4A16)
This is a 4-bit GPTQ quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment on NVIDIA GPUs with vLLM.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Quantization Method | GPTQ W4A16 (asymmetric) |
| Weight Precision | 4-bit int4 |
| Activation Precision | FP16/BF16 |
| Group Size | 128 |
| Quantization Library | llm-compressor 0.10.0.1 |
| Format | compressed-tensors (pack-quantized) |
| Architecture | Gemma4ForConditionalGeneration |
| Layers | 60 decoder layers |
| Hidden Size | 5376 |
| Context Window | 262K tokens |
| Vision Tower | SigLIP (27 layers, NOT quantized) |
| Quantized Components | Text decoder + projector (vision tower preserved in BF16) |
Hardware Requirements
- Verified Deployment: NVIDIA RTX PRO 6000 (96GB VRAM, Blackwell sm120)
- Actual VRAM Usage: ~35GB with
gpu_memory_utilization: 0.4(full 256K context) - CUDA Version: cu130 (CUDA 13.0)
- vLLM Version: 0.18.2+ (vllm/vllm-openai:gemma4-cu130 Docker image)
Note: The 4-bit GPTQ quantization significantly reduces VRAM requirements compared to BF16 (~62GB). The vision tower remains in BF16, but the quantized text decoder enables deployment on high-end GPUs with substantial headroom for KV cache.
Quantization Recipe
This model was quantized using the following configuration:
default_stage:
default_modifiers:
GPTQModifier:
targets: [Linear]
ignore:
- lm_head
- model.vision_tower.*
- model.multi_modal_projector.*
scheme: W4A16_ASYM
block_size: 128
dampening_frac: 0.01
actorder: static
offload_hessians: false
Calibration Dataset: Mixed dataset including:
- ise-uiuc/Magicoder-Evol-Instruct-110K (1024 samples)
- Salesforce/APIGen-MT-5k (512 samples)
- NousResearch/hermes-function-calling-v1 (512 samples)
Total: 2048 samples at 2048 tokens each (~4M tokens calibration set)
Usage
Loading with Transformers
from transformers import AutoModelForMultimodalLM, AutoTokenizer
model_id = "ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
Loading with vLLM
Example vLLM Configuration (adapted from actual deployment):
# vllm_config.yaml
model: /path/to/ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ
quantization: compressed-tensors # GPTQ W4A16
kv_cache_dtype: fp8_e4m3 # FP8 KV cache
gpu_memory_utilization: 0.4 # ~35GB VRAM usage on RTX PRO 6000; can be increased up to 0.95 if needed
max_model_len: 262144 # 256K context
tensor_parallel_size: 1 # Single GPU
enable_prefix_caching: true
enable_chunked_prefill: true
# Gemma4-specific (REQUIRED for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4
reasoning_parser: gemma4
Files in This Repository
| File | Description |
|---|---|
model.safetensors |
Quantized model weights (~19.2GB) |
config.json |
Model configuration with quantization_config |
tokenizer.json |
Tokenizer vocabulary |
tokenizer_config.json |
Tokenizer config with chat template |
chat_template.jinja |
Gemma4 native chat template |
generation_config.json |
Generation parameters |
processor_config.json |
Processor configuration |
recipe.yaml |
Quantization recipe |
LICENSE |
Apache 2.0 License |
License
This quantization is released under the Apache 2.0 License.
The base model google/gemma-4-31B-it is also licensed under Apache 2.0.
See LICENSE for the full text.
Citation
If you use this model in your research, please cite:
@misc{gemma4-31b-gptq-quantization,
title = {Gemma-4-31B-it 4-bit GPTQ Quantization},
author = {ebircak},
year = {2026},
howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ}},
note = {Quantized with llm-compressor 0.10.0.1}
}
Disclaimer
This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.
This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).
Model Card Version
This model card follows the Model Cards for Model Reporting standard.
Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
- Downloads last month
- 4,401
Model tree for ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ
Base model
google/gemma-4-31B-it