README.md · LilaRest/gemma-4-31B-it-NVFP4-turbo at main

File size: 10,608 Bytes

2dac6d0
 
 
a1cd74d
2dac6d0
9ce932a
 
1828229
9ce932a
2dac6d0
 
9ce932a
 
2dac6d0
 
 
 
 
819f9e2
 
841eca5
43a17b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dac6d0
 
 
286c99c
 
 
3d3e520
286c99c
2dac6d0
91a2969
2dac6d0
ed0e4c7
2dac6d0
389cad4
2dac6d0
3d3e520
389cad4
09d1ab0
 
389cad4
 
43a17b7
389cad4
18ecfe6
43a17b7
 
 
 
 
 
 
 
 
2dac6d0
 
43a17b7
2dac6d0
389cad4
09d1ab0
 
 
 
 
 
 
2dac6d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
819f9e2
2dac6d0
389cad4
 
 
 
 
 
 
 
 
2dac6d0

---

license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
base_model:
- google/gemma-4-31B-it
- nvidia/Gemma-4-31B-IT-NVFP4
pipeline_tag: text-generation
tags:
- gemma4
- gemma-4-31b-it
- nvfp4
- modelopt
- vllm
- quantized
- nvidia
- lighthouse
model-index:
- name: gemma-4-31B-it-NVFP4-turbo
  results:
  - task:
      type: text-generation
    dataset:
      name: GPQA Diamond
      type: Idavidrein/gpqa
      config: gpqa_diamond
    metrics:
    - name: Accuracy
      type: accuracy
      value: 72.73
  - task:
      type: text-generation
    dataset:
      name: MMLU Pro
      type: TIGER-Lab/MMLU-Pro
    metrics:
    - name: Accuracy
      type: accuracy
      value: 83.93

---

<div align="center">
  <img src="https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo/resolve/main/banner.png">
</div>

<h1 align="center">⚡ Gemma 4 31B IT NVFP4 <i>Turbo</i></h1>

A repackaged [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) that is **68% smaller** in GPU memory and **~2.5× faster** than the [base model](https://huggingface.co/google/gemma-4-31B-it), while retaining **nearly identical quality** (1-3% loss). Fits on a *single* RTX 5090 (🎉).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for **~2× higher concurrent throughput** than other quants like [prithivMLmods/gemma-4-31B-it-NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) or [cyankiwi/gemma-4-31B-it-AWQ-4bit](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit).

This variant is **text-only**, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

## Benchmark

![Benchmark chart](bench/chart/benchmark.png)

> [!NOTE]  
> RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh).
>
> Note: We also ran ***⚡Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

|                  | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***⚡ Turbo*** (this model)                  |
|------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------|
| GPU memory       | 58.9 GiB                                                   | 31 GiB                                                             | **18.5 GiB** *(-68% base, -40% nvidia)*     |
| GPQA Diamond     | 75.71%                                                     | 75.46%                                                             | **72.73%** *(-2.98% base, -2.73% nvidia)*   |
| MMLU Pro         | 85.25%                                                     | 84.94%                                                             | **83.93%** *(-1.32% base, -1.01% nvidia)*   |
| Prefill          | 6352 tok/s                                                 | 11069 tok/s                                                        | **15359 tok/s** *(+142% base, +39% nvidia)* |
| Decode (single)  | 24.1 tok/s                                                 | 39.2 tok/s                                                         | **51 tok/s** *(+112% base, +30% nvidia)*    |
| Decode (batched) | 494 tok/s                                                  | 913 tok/s                                                          | **1244 tok/s** *(+152% base, +36% nvidia)*  |
| Concurrency      | 2.47 tok/s                                                 | 4.56 req/s                                                         | **6.22 req/s** *(+152% base, +36% nvidia)*  |


Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:


|                  | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***⚡ Turbo*** (this model) |
|------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------|
| GPU memory       | 19.6 GiB                                                                         | 19.6 GiB                                                                | **18.5 GiB**               |
| Prefill          | 6647 tok/s                                                                       | 6626 tok/s                                                              | **15359 tok/s**            |
| Decode (single)  | 64.3 tok/s                                                                       | 64.4 tok/s                                                              | **51 tok/s**               |
| Decode (batched) | 757 tok/s                                                                        | 757 tok/s                                                               | **1244 tok/s**             |
| Concurrency      | 3.79 req/s                                                                       | 3.78 req/s                                                              | **6.22 req/s**             |


## Usage

Requirements:

- A **Blackwell GPU** (see [Compatibility](#compatibility))
- `transformers >= 5.5.0`
- `vllm >= 0.19` with CUDA 13.0
  > **Note:** `pip install vllm` installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

### Docker (recommended)

We recommend using the `vllm/vllm-openai:cu130-nightly` Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

```bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code
```

> If you get `model type gemma4 not recognized`, run `pip install transformers>=5.5.0` inside the container.

### pip (CUDA 13.0 wheel)

```bash
pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code
```

### Key flags

- `--quantization modelopt` — required, activates NVIDIA's optimized CUTLASS kernels
- `--kv-cache-dtype fp8` — halves KV cache memory on Blackwell
- `--max-model-len 16384` — maximum context length per request. See [Compatibility](#compatibility) for max value per GPU.

## Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

- **High-throughput classification / short output** — Reduce `--max-model-len` and limit output tokens (`max_tokens` in the API request). Less KV cache pressure means more concurrent requests. Expect **14+ req/s** on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
- **Long context** — Increase `--max-model-len` (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
- **Latency-sensitive** — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
- **Batch processing** — Push `--max-num-seqs` higher and use `--request-rate inf` with `--max-concurrency` to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

## Compatibility

**Blackwell (SM 12.0+) — full FP4 tensor core support:**


| GPU                | VRAM   | Works? | Max context | Notes                                                 |
|--------------------|--------|--------|-------------|-------------------------------------------------------|
| RTX 5090           | 32 GB  | ✅      | ~25k        | Primary target                                        |
| RTX PRO 6000       | 96 GB  | ✅      | ~180K       | Ideal for high-concurrency or long-context workloads. |
| B200               | 192 GB | ✅      | 262k (full) | Datacenter, untested                                  |
| B100               | 192 GB | ✅      | 262k (full) | Datacenter, untested                                  |
| RTX 5080 and lower | ≤16 GB | ❌      | —           | Not enough VRAM                                       |


Older GPUs (H100, A100, RTX 4090, etc.) may work without `--quantization modelopt` but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

## Approach

Three changes were made:

1. **Quantized** all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
2. **Updated** architecture to `Gemma4ForCausalLM` and quantization config accordingly
3. **Stripped** the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, `embed_tokens` stays BF16, all norms preserved, so we retain all the [nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) optimizations.

#### Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

- FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
- Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
- MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
- `embed_tokens` stays BF16, preventing noise from propagating through all layers

## License

Apache 2.0 — same as the [base model](https://ai.google.dev/gemma/docs/gemma_4_license).

## Credits

- [Google DeepMind](https://deepmind.google/models/gemma/) for Gemma 4
- [NVIDIA](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) for the modelopt NVFP4 checkpoint