---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
base_model:
- google/gemma-4-31B-it
- nvidia/Gemma-4-31B-IT-NVFP4
pipeline_tag: text-generation
tags:
- gemma4
- gemma-4-31b-it
- nvfp4
- modelopt
- vllm
- quantized
- nvidia
- lighthouse
model-index:
- name: gemma-4-31B-it-NVFP4-turbo
results:
- task:
type: text-generation
dataset:
name: GPQA Diamond
type: Idavidrein/gpqa
config: gpqa_diamond
metrics:
- name: Accuracy
type: accuracy
value: 72.73
- task:
type: text-generation
dataset:
name: MMLU Pro
type: TIGER-Lab/MMLU-Pro
metrics:
- name: Accuracy
type: accuracy
value: 83.93
---
⚡ Gemma 4 31B IT NVFP4 Turbo
A repackaged [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) that is **68% smaller** in GPU memory and **~2.5× faster** than the [base model](https://huggingface.co/google/gemma-4-31B-it), while retaining **nearly identical quality** (1-3% loss). Fits on a *single* RTX 5090 (🎉).
It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for **~2× higher concurrent throughput** than other quants like [prithivMLmods/gemma-4-31B-it-NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) or [cyankiwi/gemma-4-31B-it-AWQ-4bit](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit).
This variant is **text-only**, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.
## Benchmark

> [!NOTE]
> RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh).
>
> Note: We also ran ***⚡Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.
| | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***⚡ Turbo*** (this model) |
|------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------|
| GPU memory | 58.9 GiB | 31 GiB | **18.5 GiB** *(-68% base, -40% nvidia)* |
| GPQA Diamond | 75.71% | 75.46% | **72.73%** *(-2.98% base, -2.73% nvidia)* |
| MMLU Pro | 85.25% | 84.94% | **83.93%** *(-1.32% base, -1.01% nvidia)* |
| Prefill | 6352 tok/s | 11069 tok/s | **15359 tok/s** *(+142% base, +39% nvidia)* |
| Decode (single) | 24.1 tok/s | 39.2 tok/s | **51 tok/s** *(+112% base, +30% nvidia)* |
| Decode (batched) | 494 tok/s | 913 tok/s | **1244 tok/s** *(+152% base, +36% nvidia)* |
| Concurrency | 2.47 tok/s | 4.56 req/s | **6.22 req/s** *(+152% base, +36% nvidia)* |
Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:
| | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***⚡ Turbo*** (this model) |
|------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------|
| GPU memory | 19.6 GiB | 19.6 GiB | **18.5 GiB** |
| Prefill | 6647 tok/s | 6626 tok/s | **15359 tok/s** |
| Decode (single) | 64.3 tok/s | 64.4 tok/s | **51 tok/s** |
| Decode (batched) | 757 tok/s | 757 tok/s | **1244 tok/s** |
| Concurrency | 3.79 req/s | 3.78 req/s | **6.22 req/s** |
## Usage
Requirements:
- A **Blackwell GPU** (see [Compatibility](#compatibility))
- `transformers >= 5.5.0`
- `vllm >= 0.19` with CUDA 13.0
> **Note:** `pip install vllm` installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.
### Docker (recommended)
We recommend using the `vllm/vllm-openai:cu130-nightly` Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.
```bash
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
```
> If you get `model type gemma4 not recognized`, run `pip install transformers>=5.5.0` inside the container.
### pip (CUDA 13.0 wheel)
```bash
pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0
vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
```
### Key flags
- `--quantization modelopt` — required, activates NVIDIA's optimized CUTLASS kernels
- `--kv-cache-dtype fp8` — halves KV cache memory on Blackwell
- `--max-model-len 16384` — maximum context length per request. See [Compatibility](#compatibility) for max value per GPU.
## Tuning
The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:
- **High-throughput classification / short output** — Reduce `--max-model-len` and limit output tokens (`max_tokens` in the API request). Less KV cache pressure means more concurrent requests. Expect **14+ req/s** on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
- **Long context** — Increase `--max-model-len` (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
- **Latency-sensitive** — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
- **Batch processing** — Push `--max-num-seqs` higher and use `--request-rate inf` with `--max-concurrency` to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.
## Compatibility
**Blackwell (SM 12.0+) — full FP4 tensor core support:**
| GPU | VRAM | Works? | Max context | Notes |
|--------------------|--------|--------|-------------|-------------------------------------------------------|
| RTX 5090 | 32 GB | ✅ | ~25k | Primary target |
| RTX PRO 6000 | 96 GB | ✅ | ~180K | Ideal for high-concurrency or long-context workloads. |
| B200 | 192 GB | ✅ | 262k (full) | Datacenter, untested |
| B100 | 192 GB | ✅ | 262k (full) | Datacenter, untested |
| RTX 5080 and lower | ≤16 GB | ❌ | — | Not enough VRAM |
Older GPUs (H100, A100, RTX 4090, etc.) may work without `--quantization modelopt` but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.
## Approach
Three changes were made:
1. **Quantized** all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
2. **Updated** architecture to `Gemma4ForCausalLM` and quantization config accordingly
3. **Stripped** the vision and audio encoder
Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, `embed_tokens` stays BF16, all norms preserved, so we retain all the [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) optimizations.
#### Why RTN didn't hurt quality
RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:
- FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
- Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
- MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
- `embed_tokens` stays BF16, preventing noise from propagating through all layers
## License
Apache 2.0 — same as the [base model](https://ai.google.dev/gemma/docs/gemma_4_license).
## Credits
- [Google DeepMind](https://deepmind.google/models/gemma/) for Gemma 4
- [NVIDIA](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) for the modelopt NVFP4 checkpoint