First community NVFP4 quantization of Gemma 4 26B-A4B-it (49GB → 16.5GB)

by marioiseli - opened 17 days ago

We published NVFP4 quantizations of this model, both W4A4 and W4A16:

Quantized and validated on a single NVIDIA DGX Spark (128GB, GB10 Blackwell). Chat, code generation, and tool calling all working via vLLM.

One thing worth noting for the Gemma team: the fused 3D expert tensor format (nn.Parameter of shape [128, dim, dim] for gate_up_proj and down_proj) breaks all existing quantization tools. NVIDIA's modelopt, llm-compressor, and TensorRT-LLM all expect nn.ModuleList of nn.Linear — they silently skip the expert parameters, which are 91% of the model.

We wrote a custom modelopt plugin to unfuse the experts into individual nn.Linear layers before quantization. It works, but native tool support would help the community. Filed: https://github.com/NVIDIA/Model-Optimizer/issues/1173

Full quantization script and vLLM patches are included in the repos.

raphaelamorim

15 days ago

Hi Mario, thanks for your contribution. Will take a look at your quantization. Have you tried to build a recipe ( https://github.com/eugr/spark-vllm-docker/blob/main/recipes/README.md) and submit it to https://spark-arena.com already?

sonali-kumari11

Google org 12 days ago

Hi @raphaelamorim -

Thanks for sharing this with the community! The NVFP4 quantizations and validation with vLLM are incredibly valuable for the community and your note on fused expert tensors breaking existing tooling is especially insightful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment