PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Paper • 2603.29078 • Published
CompressedTensors INT4 of Qwen/Qwen3.6-35B-A3B via HLWQ
Run a 35B 256-expert MoE on an RTX 3060 (12 GB) with expert offloading
| Metric | Value |
|---|---|
| 📦 Format | CompressedTensors INT4 symmetric (gs=128) |
| 💾 Model size | 19.43 GB (5 shards, 62,303 tensors) |
| 📉 Compression | 72% (70.2 → 19.4 GB) |
| ⚡ Kernel | Marlin (fused dequant+matmul) |
| 🧩 Expert keys | Per-expert 2D (vLLM-native) |
| 🔢 Breakdown | 30,720 expert + 250 linear + 363 BF16 |
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
--language-model-only \
--enforce-eager \
--moe-expert-cache-size 8
| GPU | VRAM | Expert Cache | Status |
|---|---|---|---|
| RTX PRO 6000 | 96 GB | all-in | ✅ Full speed |
| A100 / H100 | 80 GB | all-in | ✅ Full speed |
| RTX 4090 | 24 GB | cache=4 | ✅ ~4 GB |
| RTX 4070 Ti | 16 GB | cache=3 | ✅ ~3.5 GB |
| RTX 3060 | 12 GB | cache=2 | ✅ ~3 GB |
| RTX 3050 | 8 GB | cache=1 | ⚠️ Tight |
RTX PRO 6000 Blackwell (96 GB). vLLM + Marlin will be significantly faster than these notebook benchmarks.
| Spec | Value |
|---|---|
| Layers | 40 (30 GDN + 10 Full Attention) |
| Experts | 256/layer (8 routed + 1 shared) |
| Hidden | 2048 |
| Expert intermediate | 512 |
| Context | 262,144 tokens |
Per-expert 2D keys (gate_up_proj split → separate gate + up):
model.layers.{L}.mlp.experts.{E}.gate_proj.weight_packed INT4→int32
model.layers.{L}.mlp.experts.{E}.gate_proj.weight_scale BF16
model.layers.{L}.mlp.experts.{E}.up_proj.weight_packed INT4→int32
model.layers.{L}.mlp.experts.{E}.up_proj.weight_scale BF16
model.layers.{L}.mlp.experts.{E}.down_proj.weight_packed INT4→int32
model.layers.{L}.mlp.experts.{E}.down_proj.weight_scale BF16
BF16 preserved: norms, GDN gates (in_proj_a/b), routers, A_log, conv1d, dt_bias, embeddings, lm_head
BF16 (70.2 GB)
│
▼
[1] HLWQ Q5: Hadamard rotation + Lloyd-Max 5-bit
│ (better distribution before INT4)
▼
[2] PQ5 dequant → BF16
│
▼
[3] INT4 symmetric (gs=128): scale = absmax/7
│
▼
[4] Pack 8×INT4 → int32 (CompressedTensors)
│
▼
CT INT4 (19.4 GB) → Marlin → vLLM serve
@misc{hlwq2026,
title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
author={Caio Vicentino},
year={2026},
url={https://arxiv.org/abs/2603.29078}
}
| Resource | Link |
|---|---|
| 📄 Paper | arXiv:2603.29078 |
| 🔧 Code | GitHub |
| 📦 PyPI | pip install polarquant |
| 🗂️ Q5 codes | Qwen3.6-35B-A3B-HLWQ-Q5 |
| 🔀 Expert Offload | vllm-expert-offload |
| 🏠 Base model | Qwen/Qwen3.6-35B-A3B |
Base model
Qwen/Qwen3.6-35B-A3B