TheHouseOfTheDude
/

gemma-4-31B_PTQ

Image-Text-to-Text

text-generation

image-generation

compressed-tensors

Model card Files Files and versions

Gemma-4-31B_PTQ — Compressed-Tensors (vLLM runtime)

Overview

This repository provides Post-Training Quantized (PTQ) versions of google/gemma-4-31B using compressed-tensors.

This is a TRUE PTQ pipeline:

No calibration dataset
One-shot quantization
Fast + scalable

Model Architecture

60-layer transformer (~30.7B params)
SigLIP vision encoder (~550M params)
Multimodal projector
256K context length
Hybrid attention

Preserved using Gemma4ForConditionalGeneration fileciteturn10file0

Quant Variants

W8A16 (INT8 weights, best quality)
W4A16 (INT4 weights, smaller size)

KLD Results

Base: google/gemma-4-31B (BF16) Dataset: wikitext-2 Context: 2048 / stride 512

W8A16:

Mean KLD: 0.013013
Disk: 34GB

W4A16:

Mean KLD: 0.187820
Disk: 21GB

Quantization Details

Method: llmcompressor.oneshot
No calibration dataset
Linear layers only

Ignored:

lm_head
vision_tower
multi_modal_projector
embed_vision fileciteturn10file1

Setup (Required)

Fresh venv
Build vLLM (Clone from VLLM Repo, cd vllm, "VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto")
Install custom compressed-tensors (clone "git clone --branch Gemma4 --single-branch https://github.com/phaelon74/compressed-tensors.git compressed-tensors" then "cd compressed-tensors" and then "pip install -e . --no-deps")
Install transformers 5.5 ("pip install transformers -U")

Run

Follow: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

Notes

PTQ = no dataset bias
Multimodal preserved
Requires custom runtime

Credits

google/gemma-4-31B
TheHouseOfTheDude

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/gemma-4-31B_PTQ

Base model

google/gemma-4-31B

Quantized

(29)

this model