gemma-4-E2B-it-NVFP4A16

Model Overview

Description

This model is a weight-only NVFP4A16 quantized version of Google's Gemma 4 E2B instruction-tuned model. Weights are quantized to FP4 with per-group (group size 16) quantization using the NVIDIA FP4 format, while activations remain in FP16.

The following modules are kept in their original precision:

  • lm_head
  • Vision encoder (vision_tower)
  • Audio encoder (audio_tower)
  • Vision embedding projection (embed_vision)
  • Audio embedding projection (embed_audio)

How It Was Made

from transformers import AutoModelForImageTextToText, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "google/gemma-4-E2B-it"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto")

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4A16",
    ignore=[
        "lm_head",
        "re:.*vision_tower.*",
        "re:.*audio_tower.*",
        "re:.*embed_vision.*",
        "re:.*embed_audio.*",
    ],
)

oneshot(model=model, recipe=recipe)
model.save_pretrained("gemma-4-E2B-it-NVFP4A16", save_compressed=True)

Evaluation

Sample generation after quantization:

Prompt: Hello my name is

Response: Hello! It's nice to meet you. What is your name?

Downloads last month
1,235
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 2imi9/gemma-4-E2B-it-NVFP4A16

Quantized
(129)
this model

Collection including 2imi9/gemma-4-E2B-it-NVFP4A16