Gemma-4-31B-it 4-bit GPTQ Quantization (W4A16)

This is a 4-bit GPTQ quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment on NVIDIA GPUs with vLLM.

Model Details

Property	Value
Base Model	google/gemma-4-31B-it
Quantization Method	GPTQ W4A16 (asymmetric)
Weight Precision	4-bit int4
Activation Precision	FP16/BF16
Group Size	128
Quantization Library	llm-compressor 0.10.0.1
Format	compressed-tensors (pack-quantized)
Architecture	Gemma4ForConditionalGeneration
Layers	60 decoder layers
Hidden Size	5376
Context Window	262K tokens
Vision Tower	SigLIP (27 layers, NOT quantized)
Quantized Components	Text decoder + projector (vision tower preserved in BF16)

Hardware Requirements

Verified Deployment: NVIDIA RTX PRO 6000 (96GB VRAM, Blackwell sm120)
Actual VRAM Usage: ~35GB with gpu_memory_utilization: 0.4 (full 256K context)
CUDA Version: cu130 (CUDA 13.0)
vLLM Version: 0.18.2+ (vllm/vllm-openai:gemma4-cu130 Docker image)

Note: The 4-bit GPTQ quantization significantly reduces VRAM requirements compared to BF16 (~62GB). The vision tower remains in BF16, but the quantized text decoder enables deployment on high-end GPUs with substantial headroom for KV cache.

Quantization Recipe

This model was quantized using the following configuration:

default_stage:
  default_modifiers:
    GPTQModifier:
      targets: [Linear]
      ignore:
        - lm_head
        - model.vision_tower.*
        - model.multi_modal_projector.*
      scheme: W4A16_ASYM
      block_size: 128
      dampening_frac: 0.01
      actorder: static
      offload_hessians: false

Calibration Dataset: Mixed dataset including:

ise-uiuc/Magicoder-Evol-Instruct-110K (1024 samples)
Salesforce/APIGen-MT-5k (512 samples)
NousResearch/hermes-function-calling-v1 (512 samples)

Total: 2048 samples at 2048 tokens each (~4M tokens calibration set)

Usage

Loading with Transformers

from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

Loading with vLLM

Example vLLM Configuration (adapted from actual deployment):

# vllm_config.yaml
model: /path/to/ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ
quantization: compressed-tensors  # GPTQ W4A16
kv_cache_dtype: fp8_e4m3          # FP8 KV cache
gpu_memory_utilization: 0.4       # ~35GB VRAM usage on RTX PRO 6000; can be increased up to 0.95 if needed
max_model_len: 262144             # 256K context
tensor_parallel_size: 1           # Single GPU
enable_prefix_caching: true
enable_chunked_prefill: true

# Gemma4-specific (REQUIRED for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4          
reasoning_parser: gemma4

Files in This Repository

File	Description
`model.safetensors`	Quantized model weights (~19.2GB)
`config.json`	Model configuration with quantization_config
`tokenizer.json`	Tokenizer vocabulary
`tokenizer_config.json`	Tokenizer config with chat template
`chat_template.jinja`	Gemma4 native chat template
`generation_config.json`	Generation parameters
`processor_config.json`	Processor configuration
`recipe.yaml`	Quantization recipe
`LICENSE`	Apache 2.0 License

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-gptq-quantization,
  title = {Gemma-4-31B-it 4-bit GPTQ Quantization},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ}},
  note = {Quantized with llm-compressor 0.10.0.1}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).

Model Card Version

This model card follows the Model Cards for Model Reporting standard.

Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors

Downloads last month: 4,401

Safetensors

Model size

32B params

Tensor type

I64

I32

BF16

Model tree for ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ

Base model

google/gemma-4-31B-it

Quantized

(164)

this model

Paper for ebircak/gemma-4-31B-it-4bit-W4A16-GPTQ

Model Cards for Model Reporting

Paper • 1810.03993 • Published Oct 5, 2018 • 7