Can not run in llm-compresor
Hi! Is it already possible to run llm-compressor on Gemma 4 to get NVFP4 quantizations?
I'm trying to quantize Gemma 4 models (E2B, E4B, 26B-A4B) to NVFP4 using llm-compressor, but I'm hitting a dependency conflict.
Setup:
Running quantization as K8s Jobs on NVIDIA GPUs
Using vllm/vllm-openai:gemma4-cu130 as base image (which successfully serves Gemma 4 for inference)
The problem:
Gemma 4 (gemma4 model type) requires transformers >= 5.5.0
The latest llmcompressor release (v0.10.0.1) pins transformers >= 4.56.1, <= 4.57.6
transformers v5.0 removed TORCH_INIT_FUNCTIONS from modeling_utils.py, which llmcompressor imports at llmcompressor/utils/dev.py:15
What I tried:
vllm/vllm-openai:nightly + pip install llmcompressor → KeyError: 'gemma4' (transformers too old)
vllm/vllm-openai:gemma4-cu130 + pip install llmcompressor → same KeyError: 'gemma4' (llmcompressor installs its own transformers <= 4.57.6)
Installing transformers from GitHub main after llmcompressor → ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils'
So there's no version combination that currently works: gemma4 support requires transformers v5.5.0+, but llmcompressor is incompatible with transformers v5.x.
How do you do it?
My PR covers this exact setup: https://github.com/vllm-project/llm-compressor/pull/2561 — working example + install instructions in the comments.
I see you quantized it with llm-compressor from source using scheme NVFP4A16 (weight-only). Did you also try NVFP4 (weights + activations)?
When I try NVFP4 on Gemma 4, llm-compressor fails during torch.fx tracing with:
Proxy object cannot be iterated. This can be attempted when the Proxy is used in a loop
or as a *args or **kwargs function argument.
It seems the multimodal forward pass (vision/audio mask operations) can't be traced by torch.fx, which is needed for activation quantization in NVFP4 but not for weight-only NVFP4A16.
Did you hit the same issue, or did you find a way to make NVFP4 work with Gemma 4?
Hey @pachePizza — didn’t try full NVFP4. For reference, NVIDIA’s own checkpoint (nvidia/Gemma-4-31B-IT-NVFP4) also skips quantizing vision/audio towers — only the language model weights are quantized, which is why params drop from 31B to 21B. Vision/audio use custom clippable linear ops and are more sensitive to quantization; quantizing them will break vllm inference. So NVFP4A16 weight-only on LM layers only is the right approach, not a limitation.