NVFP4
Collection
NVFP4 is an innovative 4-bit floating point format introduced with the NVIDIA Blackwell GPU architecture • 6 items • Updated • 1
This model is a weight-only NVFP4A16 quantized version of Google's Gemma 4 E2B instruction-tuned model. Weights are quantized to FP4 with per-group (group size 16) quantization using the NVIDIA FP4 format, while activations remain in FP16.
The following modules are kept in their original precision:
lm_headvision_tower)audio_tower)embed_vision)embed_audio)from transformers import AutoModelForImageTextToText, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "google/gemma-4-E2B-it"
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto")
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4A16",
ignore=[
"lm_head",
"re:.*vision_tower.*",
"re:.*audio_tower.*",
"re:.*embed_vision.*",
"re:.*embed_audio.*",
],
)
oneshot(model=model, recipe=recipe)
model.save_pretrained("gemma-4-E2B-it-NVFP4A16", save_compressed=True)
Sample generation after quantization:
Prompt: Hello my name is
Response: Hello! It's nice to meet you. What is your name?
Base model
google/gemma-4-E2B-it