Gemma 4 E4B IT — GPTQ 4-bit (auto-round)

GPTQ 4-bit quantization of google/gemma-4-E4B-it.

Fits in 12GB VRAM (RTX 3080 Ti, RTX 4070, etc.) with vLLM.

Model details


Base model	google/gemma-4-E4B-it
Parameters	8B total, 4B effective (PLE architecture)
Modalities	Text, Image, Audio, Video
Context	128K native, 8K recommended for 12GB GPUs
License	Apache 2.0

Quantization details


Method	auto-round (RTN mode, GPTQ-compatible output)
Bits	4
Group size	128
Symmetric	Yes
Format	auto_gptq (vLLM-compatible)
Quantized layers	Language model only (vision/audio towers kept at full precision)
Model loading VRAM	~9.65 GiB

Serving with vLLM

vllm serve ./gemma-4-E4B-it-W4A16 \
  --quantization gptq \
  --max-model-len 8192 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --port 8000

Using the service script

This model is supported as the e4b variant in the service script:

GEMMA_VARIANT=e4b ./service.sh up
GEMMA_VARIANT=e4b ./service.sh test

API usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="gemma-4-E4B-it",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"],
            },
        },
    }],
    tool_choice="auto",
)

Notes

Vision/audio towers are kept at full precision (BF16) since vLLM's GPTQ loader only supports quantized Linear layers in the language model.
softcap tensors from transformers 5.x have been removed from the safetensors files for vLLM 0.19.0 compatibility.
For 12GB GPUs, use --max-model-len 8192 or lower. Reduce to 4096 if you hit OOM.
AutoAWQ and llm-compressor do not support the Gemma 4 architecture. auto-round RTN mode is the only working quantization path as of April 2026.

Downloads last month: 31,357

Safetensors

Model size

4B params

Tensor type

I32

BF16

Model tree for ciocan/gemma-4-E4B-it-W4A16

Base model

google/gemma-4-E4B-it

Quantized

(104)

this model