LilaRest's picture
update README
ed0e4c7
metadata
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: transformers
base_model:
  - google/gemma-4-31B-it
  - nvidia/Gemma-4-31B-IT-NVFP4
pipeline_tag: text-generation
tags:
  - gemma4
  - gemma-4-31b-it
  - nvfp4
  - modelopt
  - vllm
  - quantized
  - nvidia
  - lighthouse
model-index:
  - name: gemma-4-31B-it-NVFP4-turbo
    results:
      - task:
          type: text-generation
        dataset:
          name: GPQA Diamond
          type: Idavidrein/gpqa
          config: gpqa_diamond
        metrics:
          - name: Accuracy
            type: accuracy
            value: 72.73
      - task:
          type: text-generation
        dataset:
          name: MMLU Pro
          type: TIGER-Lab/MMLU-Pro
        metrics:
          - name: Accuracy
            type: accuracy
            value: 83.93

⚑ Gemma 4 31B IT NVFP4 Turbo

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5Γ— faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (πŸŽ‰).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with β‰₯20 GB VRAM) for ~2Γ— higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.

This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

Benchmark

Benchmark chart

RTX PRO 6000, vllm bench @ 1K input / 200 output tokens. See bench.sh.

Note: We also ran ⚑Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

Base model NVIDIA quant ⚑ Turbo (this model)
GPU memory 58.9 GiB 31 GiB 18.5 GiB (-68% base, -40% nvidia)
GPQA Diamond 75.71% 75.46% 72.73% (-2.98% base, -2.73% nvidia)
MMLU Pro 85.25% 84.94% 83.93% (-1.32% base, -1.01% nvidia)
Prefill 6352 tok/s 11069 tok/s 15359 tok/s (+142% base, +39% nvidia)
Decode (single) 24.1 tok/s 39.2 tok/s 51 tok/s (+112% base, +30% nvidia)
Decode (batched) 494 tok/s 913 tok/s 1244 tok/s (+152% base, +36% nvidia)
Concurrency 2.47 tok/s 4.56 req/s 6.22 req/s (+152% base, +36% nvidia)

Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:

prithivMLmods NVFP4 cyankiwi AWQ ⚑ Turbo (this model)
GPU memory 19.6 GiB 19.6 GiB 18.5 GiB
Prefill 6647 tok/s 6626 tok/s 15359 tok/s
Decode (single) 64.3 tok/s 64.4 tok/s 51 tok/s
Decode (batched) 757 tok/s 757 tok/s 1244 tok/s
Concurrency 3.79 req/s 3.78 req/s 6.22 req/s

Usage

Requirements:

  • A Blackwell GPU (see Compatibility)
  • transformers >= 5.5.0
  • vllm >= 0.19 with CUDA 13.0

    Note: pip install vllm installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

Docker (recommended)

We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

If you get model type gemma4 not recognized, run pip install transformers>=5.5.0 inside the container.

pip (CUDA 13.0 wheel)

pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

Key flags

  • --quantization modelopt β€” required, activates NVIDIA's optimized CUTLASS kernels
  • --kv-cache-dtype fp8 β€” halves KV cache memory on Blackwell
  • --max-model-len 16384 β€” maximum context length per request. See Compatibility for max value per GPU.

Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

  • High-throughput classification / short output β€” Reduce --max-model-len and limit output tokens (max_tokens in the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
  • Long context β€” Increase --max-model-len (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
  • Latency-sensitive β€” Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms β€” fast enough for interactive use.
  • Batch processing β€” Push --max-num-seqs higher and use --request-rate inf with --max-concurrency to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

Compatibility

Blackwell (SM 12.0+) β€” full FP4 tensor core support:

GPU VRAM Works? Max context Notes
RTX 5090 32 GB βœ… ~25k Primary target
RTX PRO 6000 96 GB βœ… ~180K Ideal for high-concurrency or long-context workloads.
B200 192 GB βœ… 262k (full) Datacenter, untested
B100 192 GB βœ… 262k (full) Datacenter, untested
RTX 5080 and lower ≀16 GB ❌ β€” Not enough VRAM

Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

Approach

Three changes were made:

  1. Quantized all self-attention weights from BF16 β†’ FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
  2. Updated architecture to Gemma4ForCausalLM and quantization config accordingly
  3. Stripped the vision and audio encoder

Everything else is untouched β€” MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method β€” no calibration data, fully reproducible. It worked here because:

  • FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
  • Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
  • MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
  • embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 β€” same as the base model.

Credits