Gemma-4-31B-IT-NVFP4-24GB-compact
Compact NVFP4 quantized version of google/gemma-4-31B-it for vLLM.
Quantization Profile
- Text MLP:
NVFP4 - Text self-attention:
NVFP4 - Token embeddings:
NVFP4 lm_head: higher precision- Vision tower and vision embeddings: higher precision
- KV cache:
FP8
Usage
vllm serve Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact \
--quantization modelopt \
--gpu-memory-utilization 0.90
Official Gemma 4 vLLM recipe:
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
Text Generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact",
"messages": [
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
],
"max_tokens": 512
}'
- Downloads last month
- 942
Model tree for Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact
Base model
google/gemma-4-31B-it