gemma-4-E2B-it-NVFP4A16

Model Overview

Base Model: google/gemma-4-E2B-it
Quantization: NVFP4A16 (4-bit NV floating point weights, 16-bit activations)
Quantization Tool: llm-compressor

Description

This model is a weight-only NVFP4A16 quantized version of Google's Gemma 4 E2B instruction-tuned model. Weights are quantized to FP4 with per-group (group size 16) quantization using the NVIDIA FP4 format, while activations remain in FP16.

The following modules are kept in their original precision:

lm_head
Vision encoder (vision_tower)
Audio encoder (audio_tower)
Vision embedding projection (embed_vision)
Audio embedding projection (embed_audio)

How It Was Made

from transformers import AutoModelForImageTextToText, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "google/gemma-4-E2B-it"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto")

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4A16",
    ignore=[
        "lm_head",
        "re:.*vision_tower.*",
        "re:.*audio_tower.*",
        "re:.*embed_vision.*",
        "re:.*embed_audio.*",
    ],
)

oneshot(model=model, recipe=recipe)
model.save_pretrained("gemma-4-E2B-it-NVFP4A16", save_compressed=True)

Evaluation

Sample generation after quantization:

Prompt: Hello my name is

Response: Hello! It's nice to meet you. What is your name?

Downloads last month: 1,235

Safetensors

Model size

5B params

Tensor type

F32

BF16

F8_E4M3

Model tree for 2imi9/gemma-4-E2B-it-NVFP4A16

Base model

google/gemma-4-E2B-it

Quantized

(129)

this model

Collection including 2imi9/gemma-4-E2B-it-NVFP4A16

NVFP4

Collection

NVFP4 is an innovative 4-bit floating point format introduced with the NVIDIA Blackwell GPU architecture • 6 items • Updated 10 days ago • 1