Gemma 4 E4B IT — GPTQ 4-bit (auto-round)
GPTQ 4-bit quantization of google/gemma-4-E4B-it.
Fits in 12GB VRAM (RTX 3080 Ti, RTX 4070, etc.) with vLLM.
Model details
| Base model | google/gemma-4-E4B-it |
| Parameters | 8B total, 4B effective (PLE architecture) |
| Modalities | Text, Image, Audio, Video |
| Context | 128K native, 8K recommended for 12GB GPUs |
| License | Apache 2.0 |
Quantization details
| Method | auto-round (RTN mode, GPTQ-compatible output) |
| Bits | 4 |
| Group size | 128 |
| Symmetric | Yes |
| Format | auto_gptq (vLLM-compatible) |
| Quantized layers | Language model only (vision/audio towers kept at full precision) |
| Model loading VRAM | ~9.65 GiB |
Serving with vLLM
vllm serve ./gemma-4-E4B-it-W4A16 \
--quantization gptq \
--max-model-len 8192 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--port 8000
Using the service script
This model is supported as the e4b variant in the service script:
GEMMA_VARIANT=e4b ./service.sh up
GEMMA_VARIANT=e4b ./service.sh test
API usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="gemma-4-E4B-it",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"],
},
},
}],
tool_choice="auto",
)
Notes
- Vision/audio towers are kept at full precision (BF16) since vLLM's GPTQ loader only supports quantized Linear layers in the language model.
- softcap tensors from transformers 5.x have been removed from the safetensors files for vLLM 0.19.0 compatibility.
- For 12GB GPUs, use
--max-model-len 8192or lower. Reduce to4096if you hit OOM. - AutoAWQ and llm-compressor do not support the Gemma 4 architecture. auto-round RTN mode is the only working quantization path as of April 2026.
- Downloads last month
- 31,357
Model tree for ciocan/gemma-4-E4B-it-W4A16
Base model
google/gemma-4-E4B-it