Gemma-4-31B-it 4-bit AWQ Quantization (W4A16)
This is a 4-bit AWQ quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment with vLLM.
Format note: This model was quantized using the AWQ algorithm via llm-compressor and is saved in compressed-tensors format. When loading with vLLM, use
--quantization compressed-tensors(not--quantization awq, which expects the AutoAWQ schema and will fail).
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Quantization Algorithm | AWQ (Activation-aware Weight Quantization) |
| Weight Scheme | W4A16_ASYM (4-bit asymmetric weights, 16-bit activations) |
| Weight Precision | 4-bit int4 |
| Activation Precision | FP16/BF16 |
| Group Size | 128 |
| Serialization Format | compressed-tensors (pack-quantized) |
| Quantization Library | llm-compressor (main branch, post-0.10.0.1) |
| Architecture | Gemma4ForConditionalGeneration |
| Decoder Layers | 60 |
| Hidden Size | 5376 |
| Context Window | 262K tokens (128K verified) |
| Vision Tower | SigLIP (27 layers, preserved in BF16 — NOT quantized) |
| Quantized Components | Text decoder only (vision tower + multimodal projector excluded) |
Hardware Requirements
- Verified context: 128K tokens on 48GB total VRAM (2× 24GB GPUs) and 256K on 96GB single GPU
- Minimum VRAM: ~24GB for short contexts; ~48GB for 128K context
- Recommended: 48GB+ for production use with 128K+ context and FP8 KV cache
Quantization Recipe
This model was quantized using the AWQ algorithm with the following configuration:
AWQModifier:
targets: [Linear]
scheme: W4A16_ASYM
group_size: 128
duo_scaling: both
ignore:
- lm_head
- model.vision_tower.*
- model.multi_modal_projector.*
- model.embed_vision.*
mappings: <per-layer hybrid-attention mappings — see recipe.yaml>
Calibration Dataset (512 samples total, 2048 tokens each):
| Dataset | Samples | Purpose |
|---|---|---|
ise-uiuc/Magicoder-Evol-Instruct-110K |
256 | Coding baseline |
Salesforce/APIGen-MT-5k |
128 | Tool calling / function calling |
nvidia/When2Call (llm_judge split) |
128 | Tool-call decision making |
Gemma4 hybrid-attention handling: Gemma4-31B uses interleaved sliding_attention (50 layers, have v_proj) and full_attention (10 layers, v_proj=None) decoder layers. The quantization script generates per-layer AWQ mappings to handle this correctly — a single regex pattern would fail due to AWQ's lowest-common-ancestor grouping logic.
Usage
vLLM Docker (Recommended)
vLLM publishes special Docker images with Gemma4 support. Use a gemma4-tagged image for guaranteed compatibility with Gemma4's tool calling parser, reasoning parser, and chat template handling.
Available Gemma4-compatible vLLM images (with transformers 5.5.0 and gemma4 parsers):
| Image | Backend | Notes |
|---|---|---|
vllm/vllm-openai:gemma4-cu130 |
CUDA 13.0 (Blackwell sm120+) | For NVIDIA Blackwell GPUs |
vllm/vllm-openai-rocm:gemma4 |
ROCm (RDNA3/CDNA) | For AMD GPUs |
Example docker run (RTX Pro 6000 GPU, 1× 96GB):
docker run -it --rm \
--network=host \
--gpus all \
--ipc=host \
--shm-size=32g \
-v /path/to/model:/model \
vllm/vllm-openai:gemma4-cu130 \
/model \
--quantization compressed-tensors \
--dtype auto \
--tensor-parallel-size 1 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 262144 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 4 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--port 8000
For ROCm (AMD GPUs), replace the image and add device flags:
docker run -it --rm \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--group-add=render \
--ipc=host \
--shm-size=32g \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
-v /path/to/model:/model \
vllm/vllm-openai-rocm:gemma4 \
/model \
--quantization compressed-tensors \
--dtype auto \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 131072 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 4 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--port 8000
HSA_OVERRIDE_GFX_VERSION=11.0.0is required for RDNA3 GPUs (RX 7900 XTX, RX 7900 XT, etc.) to enable ROCm compute support.
Loading with vLLM (CLI)
Use
--quantization compressed-tensors, not--quantization awq. llm-compressor saves in compressed-tensors format regardless of the quantization algorithm. The--quantization awqflag expects AutoAWQ schema and will fail.
Minimal example:
vllm serve ebircak/gemma-4-31B-it-4bit-W4A16-AWQ \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95
With tool calling (recommended):
vllm serve ebircak/gemma-4-31B-it-4bit-W4A16-AWQ \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
Multi-GPU (tensor parallel, 2 GPUs):
vllm serve ebircak/gemma-4-31B-it-4bit-W4A16-AWQ \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-batched-tokens 16384 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
Example vLLM config file:
# vllm_config.yaml
model: ebircak/gemma-4-31B-it-4bit-W4A16-AWQ
quantization: compressed-tensors # AWQ weights in compressed-tensors format
kv_cache_dtype: fp8_e4m3 # FP8 KV cache (applied at runtime by vLLM)
dtype: auto
gpu_memory_utilization: 0.95
max_model_len: 131072 # 128K context (verified)
tensor_parallel_size: 2 # Adjust to your GPU count
enable_prefix_caching: true
# Gemma4-specific (required for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4
reasoning_parser: gemma4
# Suppress thinking mode server-wide (clients can override per-request)
default_chat_template_kwargs:
enable_thinking: false
Loading with Transformers
from transformers import AutoModelForMultimodalLM, AutoTokenizer
model_id = "ebircak/gemma-4-31B-it-4bit-W4A16-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
Tool Calling
This model uses the Google Gemma4 chat template (baked into tokenizer_config.json), which is required for vLLM's gemma4 tool call parser. The native Gemma4 tool call format uses special tokens:
<|tool_call>call:function_name{key:<|"|>value<|"|>}<tool_call|>
vLLM's gemma4 tool_call_parser converts this to standard OpenAI tool_calls JSON automatically.
Files in This Repository
| File | Description |
|---|---|
model.safetensors |
Quantized model weights (~17.9 GB) |
config.json |
Model configuration with quantization_config |
tokenizer.json |
Tokenizer vocabulary |
tokenizer_config.json |
Tokenizer config with baked Gemma4 chat template |
chat_template.jinja |
Gemma4 native chat template |
generation_config.json |
Generation parameters |
processor_config.json |
Processor configuration |
recipe.yaml |
Full AWQ quantization recipe (per-layer mappings) |
LICENSE |
Apache 2.0 License |
License
This quantization is released under the Apache 2.0 License.
The base model google/gemma-4-31B-it is also licensed under Apache 2.0.
See LICENSE for the full text.
Citation
If you use this model in your research, please cite:
@misc{gemma4-31b-awq-quantization,
title = {Gemma-4-31B-it 4-bit AWQ Quantization},
author = {ebircak},
year = {2026},
howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-W4A16-AWQ}},
note = {Quantized with llm-compressor (main branch) using AWQ W4A16\_ASYM}
}
Disclaimer
This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.
This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).
Model Card Version
This model card follows the Model Cards for Model Reporting standard.
Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
- Downloads last month
- 4,014
Model tree for ebircak/gemma-4-31B-it-4bit-W4A16-AWQ
Base model
google/gemma-4-31B-it