--- license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license library_name: transformers base_model: - google/gemma-4-31B-it - nvidia/Gemma-4-31B-IT-NVFP4 pipeline_tag: text-generation tags: - gemma4 - gemma-4-31b-it - nvfp4 - modelopt - vllm - quantized - nvidia - lighthouse model-index: - name: gemma-4-31B-it-NVFP4-turbo results: - task: type: text-generation dataset: name: GPQA Diamond type: Idavidrein/gpqa config: gpqa_diamond metrics: - name: Accuracy type: accuracy value: 72.73 - task: type: text-generation dataset: name: MMLU Pro type: TIGER-Lab/MMLU-Pro metrics: - name: Accuracy type: accuracy value: 83.93 ---

⚡ Gemma 4 31B IT NVFP4 Turbo

A repackaged [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) that is **68% smaller** in GPU memory and **~2.5× faster** than the [base model](https://huggingface.co/google/gemma-4-31B-it), while retaining **nearly identical quality** (1-3% loss). Fits on a *single* RTX 5090 (🎉). It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for **~2× higher concurrent throughput** than other quants like [prithivMLmods/gemma-4-31B-it-NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) or [cyankiwi/gemma-4-31B-it-AWQ-4bit](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit). This variant is **text-only**, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR. ## Benchmark ![Benchmark chart](bench/chart/benchmark.png) > [!NOTE] > RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh). > > Note: We also ran ***⚡Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory. | | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***⚡ Turbo*** (this model) | |------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------| | GPU memory | 58.9 GiB | 31 GiB | **18.5 GiB** *(-68% base, -40% nvidia)* | | GPQA Diamond | 75.71% | 75.46% | **72.73%** *(-2.98% base, -2.73% nvidia)* | | MMLU Pro | 85.25% | 84.94% | **83.93%** *(-1.32% base, -1.01% nvidia)* | | Prefill | 6352 tok/s | 11069 tok/s | **15359 tok/s** *(+142% base, +39% nvidia)* | | Decode (single) | 24.1 tok/s | 39.2 tok/s | **51 tok/s** *(+112% base, +30% nvidia)* | | Decode (batched) | 494 tok/s | 913 tok/s | **1244 tok/s** *(+152% base, +36% nvidia)* | | Concurrency | 2.47 tok/s | 4.56 req/s | **6.22 req/s** *(+152% base, +36% nvidia)* | Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput: | | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***⚡ Turbo*** (this model) | |------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------| | GPU memory | 19.6 GiB | 19.6 GiB | **18.5 GiB** | | Prefill | 6647 tok/s | 6626 tok/s | **15359 tok/s** | | Decode (single) | 64.3 tok/s | 64.4 tok/s | **51 tok/s** | | Decode (batched) | 757 tok/s | 757 tok/s | **1244 tok/s** | | Concurrency | 3.79 req/s | 3.78 req/s | **6.22 req/s** | ## Usage Requirements: - A **Blackwell GPU** (see [Compatibility](#compatibility)) - `transformers >= 5.5.0` - `vllm >= 0.19` with CUDA 13.0 > **Note:** `pip install vllm` installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below. ### Docker (recommended) We recommend using the `vllm/vllm-openai:cu130-nightly` Docker image, which ships with CUDA 13.0 and Blackwell support out of the box. ```bash docker run --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm/vllm-openai:cu130-nightly \ --model LilaRest/gemma-4-31B-it-NVFP4-turbo \ --quantization modelopt \ --max-model-len 16384 \ --max-num-seqs 128 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-prefix-caching \ --trust-remote-code ``` > If you get `model type gemma4 not recognized`, run `pip install transformers>=5.5.0` inside the container. ### pip (CUDA 13.0 wheel) ```bash pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl pip install transformers>=5.5.0 vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \ --quantization modelopt \ --max-model-len 16384 \ --max-num-seqs 128 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-prefix-caching \ --trust-remote-code ``` ### Key flags - `--quantization modelopt` — required, activates NVIDIA's optimized CUTLASS kernels - `--kv-cache-dtype fp8` — halves KV cache memory on Blackwell - `--max-model-len 16384` — maximum context length per request. See [Compatibility](#compatibility) for max value per GPU. ## Tuning The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case: - **High-throughput classification / short output** — Reduce `--max-model-len` and limit output tokens (`max_tokens` in the API request). Less KV cache pressure means more concurrent requests. Expect **14+ req/s** on RTX 5090 for classification workloads (~1K input, ~10 output tokens). - **Long context** — Increase `--max-model-len` (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences. - **Latency-sensitive** — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use. - **Batch processing** — Push `--max-num-seqs` higher and use `--request-rate inf` with `--max-concurrency` to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload. ## Compatibility **Blackwell (SM 12.0+) — full FP4 tensor core support:** | GPU | VRAM | Works? | Max context | Notes | |--------------------|--------|--------|-------------|-------------------------------------------------------| | RTX 5090 | 32 GB | ✅ | ~25k | Primary target | | RTX PRO 6000 | 96 GB | ✅ | ~180K | Ideal for high-concurrency or long-context workloads. | | B200 | 192 GB | ✅ | 262k (full) | Datacenter, untested | | B100 | 192 GB | ✅ | 262k (full) | Datacenter, untested | | RTX 5080 and lower | ≤16 GB | ❌ | — | Not enough VRAM | Older GPUs (H100, A100, RTX 4090, etc.) may work without `--quantization modelopt` but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse. ## Approach Three changes were made: 1. **Quantized** all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format) 2. **Updated** architecture to `Gemma4ForCausalLM` and quantization config accordingly 3. **Stripped** the vision and audio encoder Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, `embed_tokens` stays BF16, all norms preserved, so we retain all the [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) optimizations. #### Why RTN didn't hurt quality RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because: - FP4 with group_size=16 and per-group scaling preserves relative weight distributions well - Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5) - MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4 - `embed_tokens` stays BF16, preventing noise from propagating through all layers ## License Apache 2.0 — same as the [base model](https://ai.google.dev/gemma/docs/gemma_4_license). ## Credits - [Google DeepMind](https://deepmind.google/models/gemma/) for Gemma 4 - [NVIDIA](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) for the modelopt NVFP4 checkpoint