Gemma-4-31B_PTQ — Compressed-Tensors (vLLM runtime)
Overview
This repository provides Post-Training Quantized (PTQ) versions of google/gemma-4-31B using compressed-tensors.
This is a TRUE PTQ pipeline:
- No calibration dataset
- One-shot quantization
- Fast + scalable
Model Architecture
- 60-layer transformer (~30.7B params)
- SigLIP vision encoder (~550M params)
- Multimodal projector
- 256K context length
- Hybrid attention
Preserved using Gemma4ForConditionalGeneration fileciteturn10file0
Quant Variants
- W8A16 (INT8 weights, best quality)
- W4A16 (INT4 weights, smaller size)
KLD Results
Base: google/gemma-4-31B (BF16) Dataset: wikitext-2 Context: 2048 / stride 512
W8A16:
- Mean KLD: 0.013013
- Disk: 34GB
W4A16:
- Mean KLD: 0.187820
- Disk: 21GB
Quantization Details
- Method: llmcompressor.oneshot
- No calibration dataset
- Linear layers only
Ignored:
- lm_head
- vision_tower
- multi_modal_projector
- embed_vision fileciteturn10file1
Setup (Required)
- Fresh venv
- Build vLLM (Clone from VLLM Repo, cd vllm, "VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto")
- Install custom compressed-tensors (clone "git clone --branch Gemma4 --single-branch https://github.com/phaelon74/compressed-tensors.git compressed-tensors" then "cd compressed-tensors" and then "pip install -e . --no-deps")
- Install transformers 5.5 ("pip install transformers -U")
Run
Follow: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
Notes
- PTQ = no dataset bias
- Multimodal preserved
- Requires custom runtime
Credits
- google/gemma-4-31B
- TheHouseOfTheDude
Model tree for TheHouseOfTheDude/gemma-4-31B_PTQ
Base model
google/gemma-4-31B