Gemma-4-26B-A4B-it-NVFP4

NVFP4 quantized version of google/gemma-4-26B-A4B-it for vLLM.

This checkpoint is a Gemma 4 MoE model. Current vLLM releases still need the bundled gemma4_patched.py for correct NVFP4 MoE scale-key loading on this model family.

Quantization Profile

  • Text experts: NVFP4
  • Text self-attention: NVFP4
  • Token embeddings: NVFP4
  • Router: higher precision
  • lm_head: higher precision
  • Vision tower and vision embeddings: higher precision
  • KV cache: FP8

Included Files

  • gemma4_patched.py: patched Gemma 4 vLLM model file required for this checkpoint today
  • quantize_gemma4_moe.py: quantization/export script redistributed under Apache-2.0

Attribution:

  • quantize_gemma4_moe.py is based on the community release published by bg-digitalservices / marioiseli
  • gemma4_patched.py is derived from vLLM's Apache-2.0 Gemma 4 implementation

Deployment

Run the commands below from the model directory after cloning or downloading the repo.

Docker, recommended

docker run -d \
  --name gemma4-a4b-nvfp4 \
  --gpus all \
  --ipc=host \
  --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v "$(pwd):/model:ro" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -v "$(pwd)/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py:ro" \
  vllm/vllm-openai:gemma4-cu130 \
  /model \
  --served-model-name gemma-4 \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --moe-backend marlin \
  --trust-remote-code

Native vllm serve

source .venv-vllm/bin/activate

PATCH_TARGET="$(python -c 'import inspect, vllm.model_executor.models.gemma4 as m; print(inspect.getfile(m))')"
cp "$PATCH_TARGET" "${PATCH_TARGET}.bak"
cp ./gemma4_patched.py "$PATCH_TARGET"

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
  --served-model-name gemma-4 \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --moe-backend marlin \
  --trust-remote-code

To restore the original vLLM file later:

mv "${PATCH_TARGET}.bak" "$PATCH_TARGET"

Why these flags matter

  • --quantization modelopt: required for ModelOpt NVFP4 checkpoints
  • --moe-backend marlin: required for the MoE expert path
  • VLLM_NVFP4_GEMM_BACKEND=marlin: uses Marlin on the non-MoE NVFP4 path as well
  • --kv-cache-dtype fp8: reduces KV memory pressure
  • --trust-remote-code: required for Gemma 4

Official Gemma 4 vLLM recipe:

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Thinking and tool calling

To expose Gemma 4 reasoning parsing and tool calling on the server:

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
  --served-model-name gemma-4 \
  --generation-config vllm \
  --max-model-len 32768 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --moe-backend marlin \
  --trust-remote-code \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice

Important:

  • --reasoning-parser gemma4 parses Gemma 4 reasoning output
  • actual thinking mode is activated per request with chat_template_kwargs.enable_thinking=true
  • with the OpenAI Python client, pass extra_body={"chat_template_kwargs": {"enable_thinking": True}}

Tip: For text-only workloads, add --limit-mm-per-prompt '{"image":0,"video":0}' to skip multimodal encoder allocation. In practice, --gpu-memory-utilization 0.75 to 0.85 is safer than hard-coding 0.90.

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512
  }'
Downloads last month
2,783
Safetensors
Model size
15B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Neural-ICE/Gemma-4-26B-A4B-it-NVFP4

Quantized
(153)
this model