🏆 EOQ v3 -- Qwen3.5-9B (PolarQuant + AWQ)
Near-lossless quantization: PPL 6.43 -- only +0.06 from FP16 6.37. The best quality result in the PolarQuant family.
EOQ v3 combines PolarQuant (Hadamard + Lloyd-Max) with AWQ (Activation-Aware Weight Quantization) to achieve 93% reduction in quantization error vs standard absmax. This is practically indistinguishable from the full-precision model.
🎯 Key Results
| Metric | Value |
|---|---|
| Method | PolarQuant Q5 + AWQ |
| Perplexity (WikiText-2) | 6.43 |
| FP16 Baseline | 6.37 |
| Delta from FP16 | +0.06 (near-lossless!) |
| Download Size | ~5 GB (3.6x compression) |
| Load Time | 9s (5x faster than FP16's 53s) |
| Throughput | 45.8 tok/s (identical to FP16) |
| GPU Dequant | 3.5s (one-time) |
📊 Quality Evolution
| Version | Technique | PPL | Delta | Improvement |
|---|---|---|---|---|
| v1 | Absmax uniform Q5 | 7.31 | +0.94 | Baseline |
| v2 | AWQ + mixed-bit | 7.05 | +0.68 | 28% better |
| v3 | PolarQuant + AWQ | 6.43 | +0.06 | 94% better |
From v1 to v3: 93% reduction in quality loss (0.94 -> 0.06 PPL delta). PolarQuant + AWQ is the key combination.
Cross-Model Results
| Model | FP16 PPL | EOQ v3 PPL | Delta |
|---|---|---|---|
| Qwen3.5-9B | 6.37 | 6.43 | +0.06 |
| Qwen3.5-35B-A3B (MoE) | 5.19 | 5.36 | +0.17 |
🔬 How It Works
EOQ v3 combines two complementary techniques:
1. AWQ (Activation-Aware Scaling)
Protects important weight channels by pre-scaling them before quantization. Channels that carry more activation energy get higher precision.
2. PolarQuant (Hadamard + Lloyd-Max)
Transforms weight blocks to Gaussian via Hadamard rotation, then applies MSE-optimal Lloyd-Max quantization.
AWQ Pre-Scaling
|
v
Original Weights --> Scale Important Channels --> Normalize --> Hadamard Rotate
|
v
Lloyd-Max Quantize
|
v
Store Codes +
AWQ Scales +
Block Norms +
Centroid Table
Why They Combine Well
- AWQ operates on channels (column-level scaling)
- PolarQuant operates on blocks (128-element sub-vectors)
- They address orthogonal sources of error: AWQ handles channel sensitivity, PolarQuant handles within-block distribution
🚀 Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"caiovicentino1/Qwen3.5-9B-EOQ-v3",
dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-EOQ-v3")
output = model.generate(
**tokenizer("Write a detailed explanation of neural network quantization:", return_tensors="pt").to("cuda"),
max_new_tokens=300
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
With torchao INT4 (for maximum speed)
from torchao.quantization import quantize_, Int4WeightOnlyConfig
# After loading EOQ v3 model (already dequanted to BF16):
quantize_(model, Int4WeightOnlyConfig(group_size=128))
# Now runs at 43+ tok/s with 6.5 GB VRAM
🔧 Technical Details
| Component | Details |
|---|---|
| Quantization | PolarQuant Q5 + AWQ (5-bit, block_size=128) |
| AWQ | Activation-aware per-channel scaling (FP16 scales stored) |
| Rotation | 128x128 Walsh-Hadamard (self-inverse, deterministic) |
| Centroids | Pre-computed MSE-optimal for N(0,1), stored in metadata (no scipy needed) |
| Storage | Bit-packed uint8 codes + fp16 norms + fp16 AWQ scales + fp32 centroids |
| GPU Dequant | unpack -> centroid lookup -> inverse Hadamard -> scale by norm -> undo AWQ |
| Dequant Time | 3.5s (100x faster than CPU numpy) |
| Compression | 3.6x (17.9 GB -> ~5 GB) |
Storage Format
{layer_name}.packed -- bit-packed uint8 quantization codes
{layer_name}.norms -- fp16 per-block normalization factors
{layer_name}.awq_scales -- fp16 per-channel AWQ importance scales
metadata:
centroids -- fp32 Lloyd-Max optimal centroid table (shared)
bits_per_tensor -- quantization bits (5 for Q5)
📊 Ablation
| Configuration | PPL | Delta |
|---|---|---|
| Absmax Q5 (baseline) | 7.31 | +0.94 |
| AWQ only | 7.05 | +0.68 |
| PolarQuant only | 6.56 | +0.19 |
| PolarQuant + AWQ | 6.43 | +0.06 |
AWQ alone reduces error by 28%. PolarQuant alone reduces error by 80%. Together they reduce error by 94% -- the effects are complementary, not redundant.
🔗 Links
- \U0001f4c4 Paper (arXiv) -- PolarQuant: Optimal Gaussian Weight Quantization
- 💻 Code (GitHub) -- Full research codebase
- \U0001f50c vLLM Plugin -- Production inference integration
- 🧊 PolarQuant Q5 -- Without AWQ (simpler, slightly less quality)
- 📊 35B MoE Version -- PPL 5.36, 4.44x compression
📖 Citation
@article{vicentino2026polarquant,
title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.7424577},
year={2026}
}
🙏 Acknowledgements
Built with PyTorch, torchao, AWQ methodology from MIT HAN Lab, and the Qwen team's open-weight models.
- Downloads last month
- 325

