About quantization

#1
by pachePizza - opened

How do you do the quantiazation? do you have the scripts?

ModelOpt NVFP4 quantization results for Gemma 4 (non-MoE models)
We tried quantizing Gemma 4 E2B, E4B, and 26B-A4B using NVIDIA ModelOpt. Here's what we found:

Setup
Image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
modelopt: 0.43.0rc2 from pypi.nvidia.com
transformers: 5.5.0 (installed after modelopt, requires PIP_CONSTRAINT=/dev/null to bypass NGC dill constraint)
Config: NVFP4_MLP_ONLY_CFG (same approach as nvidia/Gemma-4-31B-IT-NVFP4)
Calibration: 500 samples
All 3 models quantize and export successfully
E2B: 9.5 GB โ†’ 7.3 GB (MLP_ONLY), calibration + export in ~3 min
Output includes hf_quant_config.json with correct exclude_modules for self_attn, vision, audio
Serving: fails (shape mismatch)
When serving with vllm/vllm-openai:gemma4-cu130 (v0.18.2rc1) + --quantization modelopt:

AssertionError: Attempted to load weight (torch.Size([768, 1536]))
into parameter (torch.Size([768, 3072]))
The MLP weights are exported with half dimensions (FP4 values packed, reducing the input dim by 2x). vLLM's default_weight_loader expects full-dimension tensors and fails the shape assert.

We verified that nvidia/Gemma-4-31B-IT-NVFP4 (produced with modelopt 0.37.0) uses the exact same half-dimension format โ€” so the packing hasn't changed between versions. The 31B loads fine, but our custom-quantized E2B does not.

We also tried:

--quantization modelopt_fp4 โ†’ same shape mismatch
gemma4_patched.py from bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 โ†’ same error (patch only fixes MoE expert scale suffix mapping, not the general weight loading for non-MoE layers)
Questions
Why does nvidia/Gemma-4-31B-IT-NVFP4 load fine with the same half-dim format but our E2B doesn't? Is there something specific about the 31B checkpoint structure vs what 0.43.0rc2 produces?
Has anyone successfully served a non-MoE Gemma 4 model quantized with modelopt 0.43.0rc2 on vLLM?

B&G Digital Services org

Dear @pachePizza - of course, just have a look in the repo: https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/blob/main/quantize_gemma4_moe.py

That's exactly the "magic" we've done, performed a bit of surgery on the 3D tensors -> transform them to 2D matrices, quant them, and glue them together again ๐Ÿ…ฐ

I sincerely hope you can save some time by using the provided quantization!

Thanks, finally i could do it with that.
Were you able to serve it with vllm? Or were you unable to?

B&G Digital Services org

Thanks for the feedback! We saw your struggle with E2B as a feature request โ€” here you go:

For serving with vLLM, you need two things:

  1. A vLLM build with transformers >= 5.4 (stock vLLM ships 4.57 which doesn't know gemma4). If you're on DGX Spark, spark-vllm-docker with --tf5 works.

  2. The included gemma4_patched.py โ€” mount it over vLLM's gemma4.py to fix the NVFP4 MoE scale key mapping bug (filed as vllm-project/vllm#38912).

docker run -d \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/model:/model \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  <your-vllm-tf5-image> \
  vllm serve /model \
    --served-model-name gemma-4 \
    --host 0.0.0.0 --port 8888 \
    --quantization modelopt \
    --dtype auto --kv-cache-dtype fp8 \
    --moe-backend marlin \
    --trust-remote-code

We've tested all of these on DGX Spark, works well. The E2B models have a large vision encoder (~5.8GB in BF16) that we excluded from quantization, so disk size is ~7GB rather than the ~3.4GB you'd expect. Text inference still benefits from the quantized MoE experts.

P.S. We spun up an H200 for an hour just for you โ€” if this saves you time, the coffee link is in the model card โ˜•๐Ÿ‡ฎ๐Ÿ‡น

marioiseli changed discussion status to closed

Sign up or log in to comment