Gemma-4-31B-IT-NVFP4-24GB-compact

Compact NVFP4 quantized version of google/gemma-4-31B-it for vLLM.

Quantization Profile

Text MLP: NVFP4
Text self-attention: NVFP4
Token embeddings: NVFP4
lm_head: higher precision
Vision tower and vision embeddings: higher precision
KV cache: FP8

Usage

vllm serve Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact \
  --quantization modelopt \
  --gpu-memory-utilization 0.90

Official Gemma 4 vLLM recipe:

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Text Generation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    "max_tokens": 512
  }'

Downloads last month: 942

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3

Model tree for Neural-ICE/Gemma-4-31B-IT-NVFP4-24GB-compact

Base model

google/gemma-4-31B-it

Quantized

(165)

this model