Gemma 4 E4B IT — GPTQ 4-bit (auto-round)

GPTQ 4-bit quantization of google/gemma-4-E4B-it.

Fits in 12GB VRAM (RTX 3080 Ti, RTX 4070, etc.) with vLLM.

Model details

Base model google/gemma-4-E4B-it
Parameters 8B total, 4B effective (PLE architecture)
Modalities Text, Image, Audio, Video
Context 128K native, 8K recommended for 12GB GPUs
License Apache 2.0

Quantization details

Method auto-round (RTN mode, GPTQ-compatible output)
Bits 4
Group size 128
Symmetric Yes
Format auto_gptq (vLLM-compatible)
Quantized layers Language model only (vision/audio towers kept at full precision)
Model loading VRAM ~9.65 GiB

Serving with vLLM

vllm serve ./gemma-4-E4B-it-W4A16 \
  --quantization gptq \
  --max-model-len 8192 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --port 8000

Using the service script

This model is supported as the e4b variant in the service script:

GEMMA_VARIANT=e4b ./service.sh up
GEMMA_VARIANT=e4b ./service.sh test

API usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="gemma-4-E4B-it",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"],
            },
        },
    }],
    tool_choice="auto",
)

Notes

  • Vision/audio towers are kept at full precision (BF16) since vLLM's GPTQ loader only supports quantized Linear layers in the language model.
  • softcap tensors from transformers 5.x have been removed from the safetensors files for vLLM 0.19.0 compatibility.
  • For 12GB GPUs, use --max-model-len 8192 or lower. Reduce to 4096 if you hit OOM.
  • AutoAWQ and llm-compressor do not support the Gemma 4 architecture. auto-round RTN mode is the only working quantization path as of April 2026.
Downloads last month
31,357
Safetensors
Model size
4B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ciocan/gemma-4-E4B-it-W4A16

Quantized
(104)
this model