Gemma-4-26B-A4B-it-NVFP4
NVFP4 quantized version of google/gemma-4-26B-A4B-it for vLLM.
This checkpoint is a Gemma 4 MoE model. Current vLLM releases still need the bundled gemma4_patched.py for correct NVFP4 MoE scale-key loading on this model family.
Quantization Profile
- Text experts:
NVFP4 - Text self-attention:
NVFP4 - Token embeddings:
NVFP4 - Router: higher precision
lm_head: higher precision- Vision tower and vision embeddings: higher precision
- KV cache:
FP8
Included Files
gemma4_patched.py: patched Gemma 4vLLMmodel file required for this checkpoint todayquantize_gemma4_moe.py: quantization/export script redistributed underApache-2.0
Attribution:
quantize_gemma4_moe.pyis based on the community release published bybg-digitalservices/marioiseligemma4_patched.pyis derived fromvLLM's Apache-2.0 Gemma 4 implementation
Deployment
Run the commands below from the model directory after cloning or downloading the repo.
Docker, recommended
docker run -d \
--name gemma4-a4b-nvfp4 \
--gpus all \
--ipc=host \
--network host \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-v "$(pwd):/model:ro" \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
-v "$(pwd)/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py:ro" \
vllm/vllm-openai:gemma4-cu130 \
/model \
--served-model-name gemma-4 \
--host 0.0.0.0 \
--port 8000 \
--generation-config vllm \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--max-num-seqs 4 \
--moe-backend marlin \
--trust-remote-code
Native vllm serve
source .venv-vllm/bin/activate
PATCH_TARGET="$(python -c 'import inspect, vllm.model_executor.models.gemma4 as m; print(inspect.getfile(m))')"
cp "$PATCH_TARGET" "${PATCH_TARGET}.bak"
cp ./gemma4_patched.py "$PATCH_TARGET"
VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
--served-model-name gemma-4 \
--host 0.0.0.0 \
--port 8000 \
--generation-config vllm \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--max-num-seqs 4 \
--moe-backend marlin \
--trust-remote-code
To restore the original vLLM file later:
mv "${PATCH_TARGET}.bak" "$PATCH_TARGET"
Why these flags matter
--quantization modelopt: required for ModelOpt NVFP4 checkpoints--moe-backend marlin: required for the MoE expert pathVLLM_NVFP4_GEMM_BACKEND=marlin: uses Marlin on the non-MoE NVFP4 path as well--kv-cache-dtype fp8: reduces KV memory pressure--trust-remote-code: required for Gemma 4
Official Gemma 4 vLLM recipe:
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
Thinking and tool calling
To expose Gemma 4 reasoning parsing and tool calling on the server:
VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve . \
--served-model-name gemma-4 \
--generation-config vllm \
--max-model-len 32768 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--moe-backend marlin \
--trust-remote-code \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice
Important:
--reasoning-parser gemma4parses Gemma 4 reasoning output- actual thinking mode is activated per request with
chat_template_kwargs.enable_thinking=true - with the OpenAI Python client, pass
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
Tip: For text-only workloads, add --limit-mm-per-prompt '{"image":0,"video":0}' to skip multimodal encoder allocation. In practice, --gpu-memory-utilization 0.75 to 0.85 is safer than hard-coding 0.90.
Text Generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4",
"messages": [
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
],
"max_tokens": 512
}'
- Downloads last month
- 2,783
Model tree for Neural-ICE/Gemma-4-26B-A4B-it-NVFP4
Base model
google/gemma-4-26B-A4B-it