File size: 10,608 Bytes
2dac6d0 a1cd74d 2dac6d0 9ce932a 1828229 9ce932a 2dac6d0 9ce932a 2dac6d0 819f9e2 841eca5 43a17b7 2dac6d0 286c99c 3d3e520 286c99c 2dac6d0 91a2969 2dac6d0 ed0e4c7 2dac6d0 389cad4 2dac6d0 3d3e520 389cad4 09d1ab0 389cad4 43a17b7 389cad4 18ecfe6 43a17b7 2dac6d0 43a17b7 2dac6d0 389cad4 09d1ab0 2dac6d0 819f9e2 2dac6d0 389cad4 2dac6d0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
base_model:
- google/gemma-4-31B-it
- nvidia/Gemma-4-31B-IT-NVFP4
pipeline_tag: text-generation
tags:
- gemma4
- gemma-4-31b-it
- nvfp4
- modelopt
- vllm
- quantized
- nvidia
- lighthouse
model-index:
- name: gemma-4-31B-it-NVFP4-turbo
results:
- task:
type: text-generation
dataset:
name: GPQA Diamond
type: Idavidrein/gpqa
config: gpqa_diamond
metrics:
- name: Accuracy
type: accuracy
value: 72.73
- task:
type: text-generation
dataset:
name: MMLU Pro
type: TIGER-Lab/MMLU-Pro
metrics:
- name: Accuracy
type: accuracy
value: 83.93
---
<div align="center">
<img src="https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo/resolve/main/banner.png">
</div>
<h1 align="center">β‘ Gemma 4 31B IT NVFP4 <i>Turbo</i></h1>
A repackaged [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) that is **68% smaller** in GPU memory and **~2.5Γ faster** than the [base model](https://huggingface.co/google/gemma-4-31B-it), while retaining **nearly identical quality** (1-3% loss). Fits on a *single* RTX 5090 (π).
It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with β₯20 GB VRAM) for **~2Γ higher concurrent throughput** than other quants like [prithivMLmods/gemma-4-31B-it-NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) or [cyankiwi/gemma-4-31B-it-AWQ-4bit](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit).
This variant is **text-only**, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.
## Benchmark

> [!NOTE]
> RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh).
>
> Note: We also ran ***β‘Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.
| | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***β‘ Turbo*** (this model) |
|------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------|
| GPU memory | 58.9 GiB | 31 GiB | **18.5 GiB** *(-68% base, -40% nvidia)* |
| GPQA Diamond | 75.71% | 75.46% | **72.73%** *(-2.98% base, -2.73% nvidia)* |
| MMLU Pro | 85.25% | 84.94% | **83.93%** *(-1.32% base, -1.01% nvidia)* |
| Prefill | 6352 tok/s | 11069 tok/s | **15359 tok/s** *(+142% base, +39% nvidia)* |
| Decode (single) | 24.1 tok/s | 39.2 tok/s | **51 tok/s** *(+112% base, +30% nvidia)* |
| Decode (batched) | 494 tok/s | 913 tok/s | **1244 tok/s** *(+152% base, +36% nvidia)* |
| Concurrency | 2.47 tok/s | 4.56 req/s | **6.22 req/s** *(+152% base, +36% nvidia)* |
Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:
| | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***β‘ Turbo*** (this model) |
|------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------|
| GPU memory | 19.6 GiB | 19.6 GiB | **18.5 GiB** |
| Prefill | 6647 tok/s | 6626 tok/s | **15359 tok/s** |
| Decode (single) | 64.3 tok/s | 64.4 tok/s | **51 tok/s** |
| Decode (batched) | 757 tok/s | 757 tok/s | **1244 tok/s** |
| Concurrency | 3.79 req/s | 3.78 req/s | **6.22 req/s** |
## Usage
Requirements:
- A **Blackwell GPU** (see [Compatibility](#compatibility))
- `transformers >= 5.5.0`
- `vllm >= 0.19` with CUDA 13.0
> **Note:** `pip install vllm` installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.
### Docker (recommended)
We recommend using the `vllm/vllm-openai:cu130-nightly` Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.
```bash
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
```
> If you get `model type gemma4 not recognized`, run `pip install transformers>=5.5.0` inside the container.
### pip (CUDA 13.0 wheel)
```bash
pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0
vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
```
### Key flags
- `--quantization modelopt` β required, activates NVIDIA's optimized CUTLASS kernels
- `--kv-cache-dtype fp8` β halves KV cache memory on Blackwell
- `--max-model-len 16384` β maximum context length per request. See [Compatibility](#compatibility) for max value per GPU.
## Tuning
The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:
- **High-throughput classification / short output** β Reduce `--max-model-len` and limit output tokens (`max_tokens` in the API request). Less KV cache pressure means more concurrent requests. Expect **14+ req/s** on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
- **Long context** β Increase `--max-model-len` (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
- **Latency-sensitive** β Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms β fast enough for interactive use.
- **Batch processing** β Push `--max-num-seqs` higher and use `--request-rate inf` with `--max-concurrency` to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.
## Compatibility
**Blackwell (SM 12.0+) β full FP4 tensor core support:**
| GPU | VRAM | Works? | Max context | Notes |
|--------------------|--------|--------|-------------|-------------------------------------------------------|
| RTX 5090 | 32 GB | β
| ~25k | Primary target |
| RTX PRO 6000 | 96 GB | β
| ~180K | Ideal for high-concurrency or long-context workloads. |
| B200 | 192 GB | β
| 262k (full) | Datacenter, untested |
| B100 | 192 GB | β
| 262k (full) | Datacenter, untested |
| RTX 5080 and lower | β€16 GB | β | β | Not enough VRAM |
Older GPUs (H100, A100, RTX 4090, etc.) may work without `--quantization modelopt` but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.
## Approach
Three changes were made:
1. **Quantized** all self-attention weights from BF16 β FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
2. **Updated** architecture to `Gemma4ForCausalLM` and quantization config accordingly
3. **Stripped** the vision and audio encoder
Everything else is untouched β MLP layers keep NVIDIA's calibrated FP4, `embed_tokens` stays BF16, all norms preserved, so we retain all the [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) optimizations.
#### Why RTN didn't hurt quality
RTN (Round-To-Nearest) is the simplest quantization method β no calibration data, fully reproducible. It worked here because:
- FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
- Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
- MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
- `embed_tokens` stays BF16, preventing noise from propagating through all layers
## License
Apache 2.0 β same as the [base model](https://ai.google.dev/gemma/docs/gemma_4_license).
## Credits
- [Google DeepMind](https://deepmind.google/models/gemma/) for Gemma 4
- [NVIDIA](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) for the modelopt NVFP4 checkpoint
|