Gemma-4-31B_PTQ — Compressed-Tensors (vLLM runtime)

Overview

This repository provides Post-Training Quantized (PTQ) versions of google/gemma-4-31B using compressed-tensors.

This is a TRUE PTQ pipeline:

  • No calibration dataset
  • One-shot quantization
  • Fast + scalable

Model Architecture

  • 60-layer transformer (~30.7B params)
  • SigLIP vision encoder (~550M params)
  • Multimodal projector
  • 256K context length
  • Hybrid attention

Preserved using Gemma4ForConditionalGeneration fileciteturn10file0


Quant Variants

  • W8A16 (INT8 weights, best quality)
  • W4A16 (INT4 weights, smaller size)

KLD Results

Base: google/gemma-4-31B (BF16) Dataset: wikitext-2 Context: 2048 / stride 512

W8A16:

  • Mean KLD: 0.013013
  • Disk: 34GB

W4A16:

  • Mean KLD: 0.187820
  • Disk: 21GB

Quantization Details

  • Method: llmcompressor.oneshot
  • No calibration dataset
  • Linear layers only

Ignored:

  • lm_head
  • vision_tower
  • multi_modal_projector
  • embed_vision fileciteturn10file1

Setup (Required)

  1. Fresh venv
  2. Build vLLM (Clone from VLLM Repo, cd vllm, "VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto")
  3. Install custom compressed-tensors (clone "git clone --branch Gemma4 --single-branch https://github.com/phaelon74/compressed-tensors.git compressed-tensors" then "cd compressed-tensors" and then "pip install -e . --no-deps")
  4. Install transformers 5.5 ("pip install transformers -U")

Run

Follow: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html


Notes

  • PTQ = no dataset bias
  • Multimodal preserved
  • Requires custom runtime

Credits

  • google/gemma-4-31B
  • TheHouseOfTheDude
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/gemma-4-31B_PTQ

Quantized
(29)
this model