Gemma-4-31B-IT-NVFP4-24GB-compact

Compact NVFP4 quantized version of google/gemma-4-31B-it for vLLM.

Quantization Profile

  • Text MLP: NVFP4
  • Text self-attention: NVFP4
  • Token embeddings: NVFP4
  • lm_head: higher precision
  • Vision tower and vision embeddings: higher precision
  • KV cache: FP8

Usage

vllm serve Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact \
  --quantization modelopt \
  --gpu-memory-utilization 0.90

Official Gemma 4 vLLM recipe:

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512
  }'
Downloads last month
942
Safetensors
Model size
17B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact

Quantized
(165)
this model