update README

ed0e4c7 3 days ago

10.6 kB

license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
base_model:
  - google/gemma-4-31B-it
  - nvidia/Gemma-4-31B-IT-NVFP4
pipeline_tag: text-generation
tags:
  - gemma4
  - gemma-4-31b-it
  - nvfp4
  - modelopt
  - vllm
  - quantized
  - nvidia
  - lighthouse
model-index:
  - name: gemma-4-31B-it-NVFP4-turbo
    results:
      - task:
          type: text-generation
        dataset:
          name: GPQA Diamond
          type: Idavidrein/gpqa
          config: gpqa_diamond
        metrics:
          - name: Accuracy
            type: accuracy
            value: 72.73
      - task:
          type: text-generation
        dataset:
          name: MMLU Pro
          type: TIGER-Lab/MMLU-Pro
        metrics:
          - name: Accuracy
            type: accuracy
            value: 83.93

⚡ Gemma 4 31B IT NVFP4 Turbo

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for ~2× higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.

This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

Benchmark

RTX PRO 6000, vllm bench @ 1K input / 200 output tokens. See bench.sh.

Note: We also ran ⚡Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

	Base model	NVIDIA quant	⚡ Turbo (this model)
GPU memory	58.9 GiB	31 GiB	18.5 GiB (-68% base, -40% nvidia)
GPQA Diamond	75.71%	75.46%	72.73% (-2.98% base, -2.73% nvidia)
MMLU Pro	85.25%	84.94%	83.93% (-1.32% base, -1.01% nvidia)
Prefill	6352 tok/s	11069 tok/s	15359 tok/s (+142% base, +39% nvidia)
Decode (single)	24.1 tok/s	39.2 tok/s	51 tok/s (+112% base, +30% nvidia)
Decode (batched)	494 tok/s	913 tok/s	1244 tok/s (+152% base, +36% nvidia)
Concurrency	2.47 tok/s	4.56 req/s	6.22 req/s (+152% base, +36% nvidia)

Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:

	prithivMLmods NVFP4	cyankiwi AWQ	⚡ Turbo (this model)
GPU memory	19.6 GiB	19.6 GiB	18.5 GiB
Prefill	6647 tok/s	6626 tok/s	15359 tok/s
Decode (single)	64.3 tok/s	64.4 tok/s	51 tok/s
Decode (batched)	757 tok/s	757 tok/s	1244 tok/s
Concurrency	3.79 req/s	3.78 req/s	6.22 req/s

Usage

Requirements:

A Blackwell GPU (see Compatibility)
transformers >= 5.5.0
vllm >= 0.19 with CUDA 13.0

Note: pip install vllm installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

Docker (recommended)

We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

If you get model type gemma4 not recognized, run pip install transformers>=5.5.0 inside the container.

pip (CUDA 13.0 wheel)

pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

Key flags

--quantization modelopt — required, activates NVIDIA's optimized CUTLASS kernels
--kv-cache-dtype fp8 — halves KV cache memory on Blackwell
--max-model-len 16384 — maximum context length per request. See Compatibility for max value per GPU.

Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

High-throughput classification / short output — Reduce --max-model-len and limit output tokens (max_tokens in the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
Long context — Increase --max-model-len (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
Latency-sensitive — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
Batch processing — Push --max-num-seqs higher and use --request-rate inf with --max-concurrency to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

Compatibility

Blackwell (SM 12.0+) — full FP4 tensor core support:

GPU	VRAM	Works?	Max context	Notes
RTX 5090	32 GB	✅	~25k	Primary target
RTX PRO 6000	96 GB	✅	~180K	Ideal for high-concurrency or long-context workloads.
B200	192 GB	✅	262k (full)	Datacenter, untested
B100	192 GB	✅	262k (full)	Datacenter, untested
RTX 5080 and lower	≤16 GB	❌	—	Not enough VRAM

Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

Approach

Three changes were made:

Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
Updated architecture to Gemma4ForCausalLM and quantization config accordingly
Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 — same as the base model.

Credits

Google DeepMind for Gemma 4
NVIDIA for the modelopt NVFP4 checkpoint