bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4

About quantization

by pachePizza - opened 19 days ago

Discussion

pachePizza

19 days ago

How do you do the quantiazation? do you have the scripts?

pachePizza

19 days ago

ModelOpt NVFP4 quantization results for Gemma 4 (non-MoE models)
We tried quantizing Gemma 4 E2B, E4B, and 26B-A4B using NVIDIA ModelOpt. Here's what we found:

Setup
Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
modelopt: 0.43.0rc2 from pypi.nvidia.com
transformers: 5.5.0 (installed after modelopt, requires PIP_CONSTRAINT=/dev/null to bypass NGC dill constraint)
Config: NVFP4_MLP_ONLY_CFG (same approach as nvidia/Gemma-4-31B-IT-NVFP4)
Calibration: 500 samples
All 3 models quantize and export successfully
E2B: 9.5 GB → 7.3 GB (MLP_ONLY), calibration + export in ~3 min
Output includes hf_quant_config.json with correct exclude_modules for self_attn, vision, audio
Serving: fails (shape mismatch)
When serving with vllm/vllm-openai:gemma4-cu130 (v0.18.2rc1) + --quantization modelopt:

AssertionError: Attempted to load weight (torch.Size([768, 1536]))
into parameter (torch.Size([768, 3072]))
The MLP weights are exported with half dimensions (FP4 values packed, reducing the input dim by 2x). vLLM's default_weight_loader expects full-dimension tensors and fails the shape assert.

We verified that nvidia/Gemma-4-31B-IT-NVFP4 (produced with modelopt 0.37.0) uses the exact same half-dimension format — so the packing hasn't changed between versions. The 31B loads fine, but our custom-quantized E2B does not.

We also tried:

--quantization modelopt_fp4 → same shape mismatch
gemma4_patched.py from bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 → same error (patch only fixes MoE expert scale suffix mapping, not the general weight loading for non-MoE layers)
Questions
Why does nvidia/Gemma-4-31B-IT-NVFP4 load fine with the same half-dim format but our E2B doesn't? Is there something specific about the 31B checkpoint structure vs what 0.43.0rc2 produces?
Has anyone successfully served a non-MoE Gemma 4 model quantized with modelopt 0.43.0rc2 on vLLM?

marioiseli

B&G Digital Services org 19 days ago

Dear @pachePizza - of course, just have a look in the repo: https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/blob/main/quantize_gemma4_moe.py

That's exactly the "magic" we've done, performed a bit of surgery on the 3D tensors -> transform them to 2D matrices, quant them, and glue them together again 🅰

I sincerely hope you can save some time by using the provided quantization!

pachePizza

19 days ago

Thanks, finally i could do it with that.
Were you able to serve it with vllm? Or were you unable to?

marioiseli

B&G Digital Services org 19 days ago

Thanks for the feedback! We saw your struggle with E2B as a feature request — here you go:

For serving with vLLM, you need two things:

A vLLM build with transformers >= 5.4 (stock vLLM ships 4.57 which doesn't know gemma4). If you're on DGX Spark, spark-vllm-docker with --tf5 works.
The included gemma4_patched.py — mount it over vLLM's gemma4.py to fix the NVFP4 MoE scale key mapping bug (filed as vllm-project/vllm#38912).

docker run -d \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/model:/model \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  <your-vllm-tf5-image> \
  vllm serve /model \
    --served-model-name gemma-4 \
    --host 0.0.0.0 --port 8888 \
    --quantization modelopt \
    --dtype auto --kv-cache-dtype fp8 \
    --moe-backend marlin \
    --trust-remote-code

We've tested all of these on DGX Spark, works well. The E2B models have a large vision encoder (~5.8GB in BF16) that we excluded from quantization, so disk size is ~7GB rather than the ~3.4GB you'd expect. Text inference still benefits from the quantized MoE experts.

P.S. We spun up an H200 for an hour just for you — if this saves you time, the coffee link is in the model card ☕🇮🇹

marioiseli changed discussion status to closed 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment