Will it work on 3090

#1
by faheemraza1 - opened

Hi, as I made this model for 24GB vram limitation it should work yes :)

This is the error I am facing "type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')" with vLLM

We validated this checkpoint with Docker, using vllm/vllm-openai:gemma4-cu130, not with a bare-metal vllm serve install.

Working command on our side:

docker run --rm
--gpus all
--ipc=host
--network host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm/vllm-openai:gemma4-cu130
Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192

Important:

  • we do not force --kv-cache-dtype fp8
  • we do not force any fp8e4nv dtype manually

The checkpoint already stores:

  • quant_algo = NVFP4
  • kv_cache_quant_algo = FP8

So on our side, --quantization modelopt is enough for vLLM to read hf_quant_config.json automatically.

Version plus courte si tu veux:

We only validated this checkpoint in Docker with vllm/vllm-openai:gemma4-cu130.

docker run --rm
--gpus all
--ipc=host
--network host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm/vllm-openai:gemma4-cu130
Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192

We do not pass --kv-cache-dtype fp8 manually. vLLM reads the ModelOpt config from hf_quant_config.json.

We reproduced this locally with vllm serve, and the key point is that the checkpoint should be loaded as a ModelOpt checkpoint directly.

Working command on our side:

source .venv-vllm/bin/activate
vllm serve /path/to/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192

Important:

  • do not force --kv-cache-dtype fp8
  • do not manually force any fp8e4nv dtype

This checkpoint already contains:

  • quant_algo = NVFP4
  • kv_cache_quant_algo = FP8

So vLLM should read that automatically from hf_quant_config.json when you pass --quantization modelopt.

Also, on our side this only worked correctly once the local stack recognized gemma4 properly. In practice, that meant using a recent vLLM together with a Transformers stack that supports Gemma 4. Our successful local environment was:

  • vllm==0.19.0
  • transformers==5.5.0
  • huggingface_hub==1.9.2

If your local install still throws fp8e4nv not supported in this architecture, I would first check:

  1. exact vllm --version
  2. exact python -c "import transformers; print(transformers.__version__)"
  3. whether you are passing any extra KV-cache dtype override manually
  4. your GPU architecture / CUDA stack

Sign up or log in to comment