Gemma-4-26B-A4B-it-NVFP4

NVFP4 quantized version of google/gemma-4-26B-A4B-it for vLLM.

This checkpoint is a Gemma 4 MoE model. Current vLLM releases still need the bundled gemma4_patched.py for correct NVFP4 MoE scale-key loading on this model family.

Quantization Profile

Text experts: NVFP4
Text self-attention: NVFP4
Token embeddings: NVFP4
Router: higher precision
lm_head: higher precision
Vision tower and vision embeddings: higher precision
KV cache: FP8

Included Files

gemma4_patched.py: patched Gemma 4 vLLM model file required for this checkpoint today
quantize_gemma4_moe.py: quantization/export script redistributed under Apache-2.0

Attribution:

quantize_gemma4_moe.py is based on the community release published by bg-digitalservices / marioiseli
gemma4_patched.py is derived from vLLM's Apache-2.0 Gemma 4 implementation

Deployment

Run the commands below from the model directory after cloning or downloading the repo.

Docker, recommended

docker run -d \
  --name gemma4-a4b-nvfp4 \
  --gpus all \
  --ipc=host \
  --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v "$(pwd):/model:ro" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd)/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py:ro" \
  vllm/vllm-openai:gemma4-cu130 \
  /model \
  --served-model-name gemma-4 \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --moe-backend marlin \
  --trust-remote-code

Native `vllm serve`

source .venv-vllm/bin/activate

PATCH_TARGET="$(python -c 'import inspect, vllm.model_executor.models.gemma4 as m; print(inspect.getfile(m))')"
cp "$PATCH_TARGET" "${PATCH_TARGET}.bak"
cp ./gemma4_patched.py "$PATCH_TARGET"

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
  --served-model-name gemma-4 \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --moe-backend marlin \
  --trust-remote-code

To restore the original vLLM file later:

mv "${PATCH_TARGET}.bak" "$PATCH_TARGET"

Why these flags matter

--quantization modelopt: required for ModelOpt NVFP4 checkpoints
--moe-backend marlin: required for the MoE expert path
VLLM_NVFP4_GEMM_BACKEND=marlin: uses Marlin on the non-MoE NVFP4 path as well
--kv-cache-dtype fp8: reduces KV memory pressure
--trust-remote-code: required for Gemma 4

Official Gemma 4 vLLM recipe:

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Thinking and tool calling

To expose Gemma 4 reasoning parsing and tool calling on the server:

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
  --served-model-name gemma-4 \
  --generation-config vllm \
  --max-model-len 32768 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --moe-backend marlin \
  --trust-remote-code \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice

Important:

--reasoning-parser gemma4 parses Gemma 4 reasoning output
actual thinking mode is activated per request with chat_template_kwargs.enable_thinking=true
with the OpenAI Python client, pass extra_body={"chat_template_kwargs": {"enable_thinking": True}}

Tip: For text-only workloads, add --limit-mm-per-prompt '{"image":0,"video":0}' to skip multimodal encoder allocation. In practice, --gpu-memory-utilization 0.75 to 0.85 is safer than hard-coding 0.90.

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512
  }'

Downloads last month: 2,783

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3

Model tree for Neural-ICE/Gemma-4-26B-A4B-it-NVFP4

Base model

google/gemma-4-26B-A4B-it

Quantized

(153)

this model