First community NVFP4 quantization of Gemma 4 26B-A4B-it (49GB β 16.5GB)
We published NVFP4 quantizations of this model, both W4A4 and W4A16:
- https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4
- https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16
Quantized and validated on a single NVIDIA DGX Spark (128GB, GB10 Blackwell). Chat, code generation, and tool calling all working via vLLM.
One thing worth noting for the Gemma team: the fused 3D expert tensor format (nn.Parameter of shape [128, dim, dim] for gate_up_proj and down_proj) breaks all existing quantization tools. NVIDIA's modelopt, llm-compressor, and TensorRT-LLM all expect nn.ModuleList of nn.Linear β they silently skip the expert parameters, which are 91% of the model.
We wrote a custom modelopt plugin to unfuse the experts into individual nn.Linear layers before quantization. It works, but native tool support would help the community. Filed: https://github.com/NVIDIA/Model-Optimizer/issues/1173
Full quantization script and vLLM patches are included in the repos.
Hi Mario, thanks for your contribution. Will take a look at your quantization. Have you tried to build a recipe ( https://github.com/eugr/spark-vllm-docker/blob/main/recipes/README.md) and submit it to https://spark-arena.com already?
Hi @raphaelamorim -
Thanks for sharing this with the community! The NVFP4 quantizations and validation with vLLM are incredibly valuable for the community and your note on fused expert tensors breaking existing tooling is especially insightful.