Gemma 4 E2B - RotorQuant KV Cache

RotorQuant KV-cache quantization applied to google/gemma-4-E2B, enabling dramatically reduced memory usage during inference without modifying model weights. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant.

This repository provides the RotorQuant KV-cache configuration for Gemma 4 E2B. The model weights remain at their original precision; only the key-value cache is quantized at runtime.

Model Specifications

Property	Value
Base Model	google/gemma-4-E2B
Parameters	~2 billion
Architecture	Dense transformer
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Quantization	RotorQuant KV-cache only (weights unchanged)

Quickstart

from rotorquant import RotorQuantCache
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "google/gemma-4-E2B"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")

# Apply RotorQuant KV-cache quantization
cache = RotorQuantCache(model)

inputs = processor("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, past_key_values=cache)
print(processor.decode(outputs[0], skip_special_tokens=True))

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method that achieves significantly better throughput than TurboQuant. Instead of quantizing the model weights, RotorQuant targets the memory bottleneck of the KV cache, which grows linearly with sequence length and batch size.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
No weight modification -- model weights stay at original precision
Reduced inference memory -- KV cache is compressed significantly
Longer context windows -- fit more tokens in the same GPU memory

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	Baseline	Baseline	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Gemma 4 E2B)

Precision	Approximate Size
FP16 (original)	~4 GB
8-bit quantized	~2 GB
4-bit quantized	~1.2 GB
2-bit quantized	~0.6 GB

Note: These estimates are for weight quantization. This repository applies KV-cache quantization only, so model weight memory remains at the precision you load the model in. The KV-cache memory savings are realized during generation.

Model tree for majentik/gemma-4-E2B-RotorQuant

Base model

google/gemma-4-E2B

Finetuned

(11)

this model

Paper for majentik/gemma-4-E2B-RotorQuant

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32

majentik
/

gemma-4-E2B-RotorQuant