Will it work on 3090
Will it?
Hi, as I made this model for 24GB vram limitation it should work yes :)
This is the error I am facing "type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')" with vLLM
We validated this checkpoint with Docker, using vllm/vllm-openai:gemma4-cu130, not with a bare-metal vllm serve install.
Working command on our side:
docker run --rm
--gpus all
--ipc=host
--network host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm/vllm-openai:gemma4-cu130
Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192
Important:
- we do not force
--kv-cache-dtype fp8 - we do not force any
fp8e4nvdtype manually
The checkpoint already stores:
quant_algo = NVFP4kv_cache_quant_algo = FP8
So on our side, --quantization modelopt is enough for vLLM to read hf_quant_config.json automatically.
Version plus courte si tu veux:
We only validated this checkpoint in Docker with vllm/vllm-openai:gemma4-cu130.
docker run --rm
--gpus all
--ipc=host
--network host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm/vllm-openai:gemma4-cu130
Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192
We do not pass --kv-cache-dtype fp8 manually. vLLM reads the ModelOpt config from hf_quant_config.json.
We reproduced this locally with vllm serve, and the key point is that the checkpoint should be loaded as a ModelOpt checkpoint directly.
Working command on our side:
source .venv-vllm/bin/activate
vllm serve /path/to/Gemma-4-31B-IT-NVFP4-24GB-compact
--quantization modelopt
--generation-config vllm
--gpu-memory-utilization 0.90
--max-model-len 8192
Important:
- do not force
--kv-cache-dtype fp8 - do not manually force any
fp8e4nvdtype
This checkpoint already contains:
quant_algo = NVFP4kv_cache_quant_algo = FP8
So vLLM should read that automatically from hf_quant_config.json when you pass --quantization modelopt.
Also, on our side this only worked correctly once the local stack recognized gemma4 properly. In practice, that meant using a recent vLLM together with a Transformers stack that supports Gemma 4. Our successful local environment was:
vllm==0.19.0transformers==5.5.0huggingface_hub==1.9.2
If your local install still throws fp8e4nv not supported in this architecture, I would first check:
- exact
vllm --version - exact
python -c "import transformers; print(transformers.__version__)" - whether you are passing any extra KV-cache dtype override manually
- your GPU architecture / CUDA stack