Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
Huihui-Qwopus3.5-27B-abliterated — PolarQuant INT4
Native vLLM. Marlin kernel. Zero plugin.
PolarQuant Q5 preprocessing produces better INT4 weights than direct quantization — stored in CompressedTensors format for native vLLM inference.
Quick Start — vLLM (one command)
pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5 --language-model-only --enforce-eager
That's it. No plugin, no pip install polarquant, no custom code.
Tested results:
| GPU | tok/s |
|---|---|
| A100 80GB | 168 tok/s (9B) |
| RTX PRO 6000 96GB | 44 tok/s (9B) / 18 tok/s (27B) |
Quick Start — HuggingFace Transformers
pip install polarquant
import polarengine_vllm # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", trust_remote_code=True)
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Consumer GPU Compatibility
| GPU | VRAM | Works? | Expected tok/s |
|---|---|---|---|
| RTX 4090 | 24 GB | YES (tight) | ~10 |
| A100 / H100 | 80 GB | YES | ~18-50 |
| RTX PRO 6000 | 96 GB | YES | ~18 |
Why PolarQuant INT4 is Better
Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
PolarQuant adds a preprocessing step:
- Hadamard rotation — distributes weight energy uniformly (eliminates outliers)
- Lloyd-Max Q5 — MSE-optimal quantization for the resulting Gaussian distribution
- Dequant → INT4 — the cleaned weights produce better INT4 than direct quantization
| Method | PPL (lower = better) |
|---|---|
| BF16 baseline | 6.37 |
| PolarQuant → INT4 | 6.56 |
| Direct INT4 | 6.68 |
Same speed as GPTQ/AWQ, better quality.
Important Flags
| Flag | Why |
|---|---|
--language-model-only |
Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text) |
--enforce-eager |
Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it) |
Links
- Paper: arxiv.org/abs/2603.29078
- GitHub: github.com/caiovicentino/polarengine-vllm
- PyPI:
pip install polarquant - Base model: Jackrong/Qwopus3.5-27B-v3
- Downloads last month
- 2,046