About quantization
How do you do the quantiazation? do you have the scripts?
ModelOpt NVFP4 quantization results for Gemma 4 (non-MoE models)
We tried quantizing Gemma 4 E2B, E4B, and 26B-A4B using NVIDIA ModelOpt. Here's what we found:
Setup
Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
modelopt: 0.43.0rc2 from pypi.nvidia.com
transformers: 5.5.0 (installed after modelopt, requires PIP_CONSTRAINT=/dev/null to bypass NGC dill constraint)
Config: NVFP4_MLP_ONLY_CFG (same approach as nvidia/Gemma-4-31B-IT-NVFP4)
Calibration: 500 samples
All 3 models quantize and export successfully
E2B: 9.5 GB โ 7.3 GB (MLP_ONLY), calibration + export in ~3 min
Output includes hf_quant_config.json with correct exclude_modules for self_attn, vision, audio
Serving: fails (shape mismatch)
When serving with vllm/vllm-openai:gemma4-cu130 (v0.18.2rc1) + --quantization modelopt:
AssertionError: Attempted to load weight (torch.Size([768, 1536]))
into parameter (torch.Size([768, 3072]))
The MLP weights are exported with half dimensions (FP4 values packed, reducing the input dim by 2x). vLLM's default_weight_loader expects full-dimension tensors and fails the shape assert.
We verified that nvidia/Gemma-4-31B-IT-NVFP4 (produced with modelopt 0.37.0) uses the exact same half-dimension format โ so the packing hasn't changed between versions. The 31B loads fine, but our custom-quantized E2B does not.
We also tried:
--quantization modelopt_fp4 โ same shape mismatch
gemma4_patched.py from bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 โ same error (patch only fixes MoE expert scale suffix mapping, not the general weight loading for non-MoE layers)
Questions
Why does nvidia/Gemma-4-31B-IT-NVFP4 load fine with the same half-dim format but our E2B doesn't? Is there something specific about the 31B checkpoint structure vs what 0.43.0rc2 produces?
Has anyone successfully served a non-MoE Gemma 4 model quantized with modelopt 0.43.0rc2 on vLLM?
Dear @pachePizza - of course, just have a look in the repo: https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/blob/main/quantize_gemma4_moe.py
That's exactly the "magic" we've done, performed a bit of surgery on the 3D tensors -> transform them to 2D matrices, quant them, and glue them together again ๐ ฐ
I sincerely hope you can save some time by using the provided quantization!
Thanks, finally i could do it with that.
Were you able to serve it with vllm? Or were you unable to?
Thanks for the feedback! We saw your struggle with E2B as a feature request โ here you go:
- Gemma-4-E2B-it-NVFP4 (W4A4)
- Gemma-4-E2B-it-NVFP4A16 (W4A16)
- Gemma-4-E2B-NVFP4 (W4A4, base)
- Gemma-4-E2B-NVFP4A16 (W4A16, base)
For serving with vLLM, you need two things:
A vLLM build with
transformers >= 5.4(stock vLLM ships 4.57 which doesn't know gemma4). If you're on DGX Spark, spark-vllm-docker with--tf5works.The included
gemma4_patched.pyโ mount it over vLLM's gemma4.py to fix the NVFP4 MoE scale key mapping bug (filed as vllm-project/vllm#38912).
docker run -d \
--gpus all --ipc=host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-v /path/to/model:/model \
-v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
<your-vllm-tf5-image> \
vllm serve /model \
--served-model-name gemma-4 \
--host 0.0.0.0 --port 8888 \
--quantization modelopt \
--dtype auto --kv-cache-dtype fp8 \
--moe-backend marlin \
--trust-remote-code
We've tested all of these on DGX Spark, works well. The E2B models have a large vision encoder (~5.8GB in BF16) that we excluded from quantization, so disk size is ~7GB rather than the ~3.4GB you'd expect. Text inference still benefits from the quantized MoE experts.
P.S. We spun up an H200 for an hour just for you โ if this saves you time, the coffee link is in the model card โ๐ฎ๐น