| --- |
| |
| license: apache-2.0 |
| license_link: https://ai.google.dev/gemma/docs/gemma_4_license |
| library_name: transformers |
| base_model: |
| - google/gemma-4-31B-it |
| - nvidia/Gemma-4-31B-IT-NVFP4 |
| pipeline_tag: text-generation |
| tags: |
| - gemma4 |
| - gemma-4-31b-it |
| - nvfp4 |
| - modelopt |
| - vllm |
| - quantized |
| - nvidia |
| - lighthouse |
| model-index: |
| - name: gemma-4-31B-it-NVFP4-turbo |
| results: |
| - task: |
| type: text-generation |
| dataset: |
| name: GPQA Diamond |
| type: Idavidrein/gpqa |
| config: gpqa_diamond |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 72.73 |
| - task: |
| type: text-generation |
| dataset: |
| name: MMLU Pro |
| type: TIGER-Lab/MMLU-Pro |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 83.93 |
|
|
| --- |
| |
| <div align="center"> |
| <img src="https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo/resolve/main/banner.png"> |
| </div> |
|
|
| <h1 align="center">β‘ Gemma 4 31B IT NVFP4 <i>Turbo</i></h1> |
|
|
| A repackaged [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) that is **68% smaller** in GPU memory and **~2.5Γ faster** than the [base model](https://huggingface.co/google/gemma-4-31B-it), while retaining **nearly identical quality** (1-3% loss). Fits on a *single* RTX 5090 (π). |
|
|
| It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with β₯20 GB VRAM) for **~2Γ higher concurrent throughput** than other quants like [prithivMLmods/gemma-4-31B-it-NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) or [cyankiwi/gemma-4-31B-it-AWQ-4bit](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit). |
|
|
| This variant is **text-only**, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR. |
|
|
| ## Benchmark |
|
|
|  |
|
|
| > [!NOTE] |
| > RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh). |
| > |
| > Note: We also ran ***β‘Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory. |
|
|
| | | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***β‘ Turbo*** (this model) | |
| |------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------| |
| | GPU memory | 58.9 GiB | 31 GiB | **18.5 GiB** *(-68% base, -40% nvidia)* | |
| | GPQA Diamond | 75.71% | 75.46% | **72.73%** *(-2.98% base, -2.73% nvidia)* | |
| | MMLU Pro | 85.25% | 84.94% | **83.93%** *(-1.32% base, -1.01% nvidia)* | |
| | Prefill | 6352 tok/s | 11069 tok/s | **15359 tok/s** *(+142% base, +39% nvidia)* | |
| | Decode (single) | 24.1 tok/s | 39.2 tok/s | **51 tok/s** *(+112% base, +30% nvidia)* | |
| | Decode (batched) | 494 tok/s | 913 tok/s | **1244 tok/s** *(+152% base, +36% nvidia)* | |
| | Concurrency | 2.47 tok/s | 4.56 req/s | **6.22 req/s** *(+152% base, +36% nvidia)* | |
|
|
|
|
| Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput: |
|
|
|
|
| | | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***β‘ Turbo*** (this model) | |
| |------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------| |
| | GPU memory | 19.6 GiB | 19.6 GiB | **18.5 GiB** | |
| | Prefill | 6647 tok/s | 6626 tok/s | **15359 tok/s** | |
| | Decode (single) | 64.3 tok/s | 64.4 tok/s | **51 tok/s** | |
| | Decode (batched) | 757 tok/s | 757 tok/s | **1244 tok/s** | |
| | Concurrency | 3.79 req/s | 3.78 req/s | **6.22 req/s** | |
|
|
|
|
| ## Usage |
|
|
| Requirements: |
|
|
| - A **Blackwell GPU** (see [Compatibility](#compatibility)) |
| - `transformers >= 5.5.0` |
| - `vllm >= 0.19` with CUDA 13.0 |
| > **Note:** `pip install vllm` installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below. |
|
|
| ### Docker (recommended) |
|
|
| We recommend using the `vllm/vllm-openai:cu130-nightly` Docker image, which ships with CUDA 13.0 and Blackwell support out of the box. |
|
|
| ```bash |
| docker run --gpus all \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| -p 8000:8000 \ |
| vllm/vllm-openai:cu130-nightly \ |
| --model LilaRest/gemma-4-31B-it-NVFP4-turbo \ |
| --quantization modelopt \ |
| --max-model-len 16384 \ |
| --max-num-seqs 128 \ |
| --max-num-batched-tokens 8192 \ |
| --gpu-memory-utilization 0.95 \ |
| --kv-cache-dtype fp8 \ |
| --enable-prefix-caching \ |
| --trust-remote-code |
| ``` |
|
|
| > If you get `model type gemma4 not recognized`, run `pip install transformers>=5.5.0` inside the container. |
|
|
| ### pip (CUDA 13.0 wheel) |
|
|
| ```bash |
| pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl |
| pip install transformers>=5.5.0 |
| |
| vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \ |
| --quantization modelopt \ |
| --max-model-len 16384 \ |
| --max-num-seqs 128 \ |
| --max-num-batched-tokens 8192 \ |
| --gpu-memory-utilization 0.95 \ |
| --kv-cache-dtype fp8 \ |
| --enable-prefix-caching \ |
| --trust-remote-code |
| ``` |
|
|
| ### Key flags |
|
|
| - `--quantization modelopt` β required, activates NVIDIA's optimized CUTLASS kernels |
| - `--kv-cache-dtype fp8` β halves KV cache memory on Blackwell |
| - `--max-model-len 16384` β maximum context length per request. See [Compatibility](#compatibility) for max value per GPU. |
|
|
| ## Tuning |
|
|
| The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case: |
|
|
| - **High-throughput classification / short output** β Reduce `--max-model-len` and limit output tokens (`max_tokens` in the API request). Less KV cache pressure means more concurrent requests. Expect **14+ req/s** on RTX 5090 for classification workloads (~1K input, ~10 output tokens). |
| - **Long context** β Increase `--max-model-len` (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences. |
| - **Latency-sensitive** β Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms β fast enough for interactive use. |
| - **Batch processing** β Push `--max-num-seqs` higher and use `--request-rate inf` with `--max-concurrency` to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload. |
|
|
| ## Compatibility |
|
|
| **Blackwell (SM 12.0+) β full FP4 tensor core support:** |
|
|
|
|
| | GPU | VRAM | Works? | Max context | Notes | |
| |--------------------|--------|--------|-------------|-------------------------------------------------------| |
| | RTX 5090 | 32 GB | β
| ~25k | Primary target | |
| | RTX PRO 6000 | 96 GB | β
| ~180K | Ideal for high-concurrency or long-context workloads. | |
| | B200 | 192 GB | β
| 262k (full) | Datacenter, untested | |
| | B100 | 192 GB | β
| 262k (full) | Datacenter, untested | |
| | RTX 5080 and lower | β€16 GB | β | β | Not enough VRAM | |
|
|
|
|
| Older GPUs (H100, A100, RTX 4090, etc.) may work without `--quantization modelopt` but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse. |
|
|
| ## Approach |
|
|
| Three changes were made: |
|
|
| 1. **Quantized** all self-attention weights from BF16 β FP4 (RTN, group_size=16, matching modelopt NVFP4 format) |
| 2. **Updated** architecture to `Gemma4ForCausalLM` and quantization config accordingly |
| 3. **Stripped** the vision and audio encoder |
| |
| Everything else is untouched β MLP layers keep NVIDIA's calibrated FP4, `embed_tokens` stays BF16, all norms preserved, so we retain all the [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) optimizations. |
|
|
| #### Why RTN didn't hurt quality |
|
|
| RTN (Round-To-Nearest) is the simplest quantization method β no calibration data, fully reproducible. It worked here because: |
|
|
| - FP4 with group_size=16 and per-group scaling preserves relative weight distributions well |
| - Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5) |
| - MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4 |
| - `embed_tokens` stays BF16, preventing noise from propagating through all layers |
|
|
| ## License |
|
|
| Apache 2.0 β same as the [base model](https://ai.google.dev/gemma/docs/gemma_4_license). |
|
|
| ## Credits |
|
|
| - [Google DeepMind](https://deepmind.google/models/gemma/) for Gemma 4 |
| - [NVIDIA](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) for the modelopt NVFP4 checkpoint |
|
|
|
|