Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Huihui-Qwopus3.5-27B-abliterated — PolarQuant INT4

Native vLLM. Marlin kernel. Zero plugin.

PolarQuant Q5 preprocessing produces better INT4 weights than direct quantization — stored in CompressedTensors format for native vLLM inference.

Quick Start — vLLM (one command)

pip install vllm
vllm serve caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5 --language-model-only --enforce-eager

That's it. No plugin, no pip install polarquant, no custom code.

Tested results:

GPU tok/s
A100 80GB 168 tok/s (9B)
RTX PRO 6000 96GB 44 tok/s (9B) / 18 tok/s (27B)

Quick Start — HuggingFace Transformers

pip install polarquant
import polarengine_vllm  # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5", trust_remote_code=True)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Consumer GPU Compatibility

GPU VRAM Works? Expected tok/s
RTX 4090 24 GB YES (tight) ~10
A100 / H100 80 GB YES ~18-50
RTX PRO 6000 96 GB YES ~18

Why PolarQuant INT4 is Better

Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.

PolarQuant adds a preprocessing step:

  1. Hadamard rotation — distributes weight energy uniformly (eliminates outliers)
  2. Lloyd-Max Q5 — MSE-optimal quantization for the resulting Gaussian distribution
  3. Dequant → INT4 — the cleaned weights produce better INT4 than direct quantization
Method PPL (lower = better)
BF16 baseline 6.37
PolarQuant → INT4 6.56
Direct INT4 6.68

Same speed as GPTQ/AWQ, better quality.

Important Flags

Flag Why
--language-model-only Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text)
--enforce-eager Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it)

Links

Downloads last month
2,046
Safetensors
Model size
26B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5

Base model

Qwen/Qwen3.5-27B
Quantized
(25)
this model

Collection including caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5

Papers for caiovicentino1/Huihui-Qwopus3.5-27B-v3-abliterated-PolarQuant-Q5