Will it work on 3090

docker run --rm
--gpus all
--ipc=host
--network host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm/vllm-openai:gemma4-cu130
Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192

Important:

we do not force --kv-cache-dtype fp8
we do not force any fp8e4nv dtype manually

The checkpoint already stores:

quant_algo = NVFP4
kv_cache_quant_algo = FP8

So on our side, --quantization modelopt is enough for vLLM to read hf_quant_config.json automatically.

Version plus courte si tu veux:

We only validated this checkpoint in Docker with vllm/vllm-openai:gemma4-cu130.

We do not pass --kv-cache-dtype fp8 manually. vLLM reads the ModelOpt config from hf_quant_config.json.

Neural-ICE

Owner 14 days ago

We reproduced this locally with vllm serve, and the key point is that the checkpoint should be loaded as a ModelOpt checkpoint directly.

Working command on our side:

source .venv-vllm/bin/activate
vllm serve /path/to/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192

Important:

do not force --kv-cache-dtype fp8
do not manually force any fp8e4nv dtype

This checkpoint already contains:

quant_algo = NVFP4
kv_cache_quant_algo = FP8

So vLLM should read that automatically from hf_quant_config.json when you pass --quantization modelopt.

Also, on our side this only worked correctly once the local stack recognized gemma4 properly. In practice, that meant using a recent vLLM together with a Transformers stack that supports Gemma 4. Our successful local environment was:

vllm==0.19.0
transformers==5.5.0
huggingface_hub==1.9.2

If your local install still throws fp8e4nv not supported in this architecture, I would first check:

exact vllm --version
exact python -c "import transformers; print(transformers.__version__)"
whether you are passing any extra KV-cache dtype override manually
your GPU architecture / CUDA stack

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment