Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 PolarQuant Unified: Gemma-4-31B-it-PolarQuant-Q5

Run Google's Gemma 4 31B-it on consumer GPUs with PolarQuant full-stack compression.

Component Method Result
Weights PolarQuant Q5 + torchao INT4 62.5 GB β†’ 21.5 GB
KV Cache PolarQuant Q3 (Hadamard + Lloyd-Max) 5.3x compression

🎯 Key Results

Metric Value
VRAM 21.5 GB (streaming loader)
Speed 24.9 tok/s
Dequant time 37s (per-module GPU-accelerated)
Compression BF16 62.5 GB β†’ 21.5 GB (2.9x)

Note: WikiText-2 PPL is not reported β€” Gemma 4 is a multimodal instruct model where raw-text PPL is not meaningful (BF16 baseline PPL = 1002). Generation quality is excellent (see examples below).

πŸš€ Quick Start β€” One-Click Colab

Notebook Description GPU
▢️ Inference Chat Gradio chat UI, streaming loader L4 / RTX 4090 (24 GB)
Quantization How this model was quantized A100 80 GB

Streaming Loader (fits on 24 GB GPUs!)

The inference notebook uses a per-module streaming loader that never loads the full BF16 model on GPU:

  1. Load BF16 on CPU (~62 GB RAM)
  2. For each layer: move to GPU β†’ PQ5 dequant β†’ INT4 quantize β†’ keep on GPU
  3. Peak VRAM: ~21.5 GB (accumulated INT4 only)

πŸ† GPU Compatibility

GPU VRAM Fits?
RTX 4090 24 GB βœ… Yes (2.5 GB headroom)
RTX 5090 32 GB βœ… Comfortable
L4 24 GB βœ… Yes
A6000 48 GB βœ… Plenty
A100 40/80 GB 40-80 GB βœ… Full headroom
T4 16 GB ❌ Too small for 31B

πŸ“Š KV Cache Compression

Method Bits Compression tok/s
FP16 (baseline) 16 1.0x 24.9
PolarQuant Q4 4 4.0x 24.8
PolarQuant Q3 3 5.3x 24.8
PolarQuant Q2 2 8.0x 24.8

KV Speed Context

πŸ’¬ Generation Quality

The model produces high-quality, coherent responses:

Q: Explain quantum computing in simple terms.

To understand quantum computing, you first have to understand how a regular computer works... A classical bit is like a coin lying on a table β€” it's either Heads (1) or Tails (0). A qubit is like a coin spinning on the table β€” it is effectively both at the same time. This state of being in multiple states at once is called Superposition...

Q: Write a Python function for binary search.

Correct implementation with proper docstring, O(log n) complexity, edge case handling.

Q: What causes aurora borealis?

The aurora borealis is caused by charged particles from the sun colliding with Earth's magnetic field...

πŸ”§ Technical Details

  • Architecture: Gemma 4 (60 layers, 32 attn heads, 16 KV heads, head_dim=256)
  • Hybrid attention: Sliding window (1024) + global attention
  • Weight quantization: Hadamard rotation (128Γ—128) + Lloyd-Max Q5 + torchao INT4
  • KV cache: Hadamard rotation (256Γ—256) + Lloyd-Max Q3 + real bit-packing
  • Streaming loader: Per-module INT4 via nn.Sequential wrapper (vLLM pattern)
  • Quantized layers: 602 (text model + vision encoder projections)
  • Base model: google/gemma-4-31B-it (Apache 2.0)

πŸ“– Citation

@article{polarquant2025,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025},
  url={https://arxiv.org/abs/2603.29078}
}

πŸ”— Resources

πŸ™ Acknowledgements

Built on Google's Gemma 4 (Apache 2.0). Quantization by PolarQuant with torchao.


πŸš€ Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β€” fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5 --share

πŸ“¦ Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β€” mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

πŸ”— Links

Downloads last month
1,308
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5

Quantized
(107)
this model

Collections including caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5

Papers for caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5