Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision

Claude Opus distilled Gemma 4 31B + Vision on consumer GPUs.

Download: 21.8 GB (vs 62.5 GB BF16 — 2.9x compression)

Component Method Result
Text weights HLWQ Q5 + torchao INT4 21.8 GB
Vision encoder BF16 (full quality) included
KV Cache HLWQ Q3 (5.3x) longer context
Reasoning Claude Opus 4.6 distilled high-effort

🎯 Key Results

Metric Value
VRAM 22.8 GB (streaming loader)
Speed ~24.9 tok/s
Download 21.8 GB
Vision ✅ Golden Gate Bridge
Compression 2.9x
Quantized layers 602

📊 Charts

Compression VRAM Family Context

🏆 GPU Support

GPU VRAM Fits?
RTX 4090 24 GB
L4 24 GB
RTX 5090 32 GB
A100 40-80 GB

🚀 Quick Start

pip install polarquant[all]
polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision

🔬 KV Cache Compression

Method Bits Compression Max Context (4GB)
FP16 16 1.0x 4K
HLWQ Q4 4 4.0x 17K
HLWQ Q3 3 5.3x 22K
HLWQ Q2 2 8.0x 35K

🔧 Technical Details

  • Architecture: Gemma 4 (60 layers, 32 attn heads, 16 KV heads, head_dim=256)
  • Hybrid attention: Sliding window (1024) + global attention
  • Weight quantization: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
  • KV cache: Hadamard rotation (256x256) + Lloyd-Max Q3 + real bit-packing
  • Streaming loader: Per-module INT4 via nn.Sequential wrapper — fits 24GB GPUs
  • Base model: TeichAI/gemma-4-31B-it-Claude-Opus-Distill

📖 Citation

@article{polarquant2025,
  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}

📄 Paper · 💻 GitHub · 📦 PyPI


🚀 Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import HLWQModel

model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory — fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --share

📦 Method: HLWQ

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.

HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

🔗 Links

Downloads last month
836
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision

Quantized
(13)
this model

Collections including caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision

Papers for caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision