Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Qwopus3.5-9B-v3 — GPTQ Calibrated INT4

9B hybrid model (Qwen3.5 architecture) quantized to INT4 with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.

🏆 NEW: PolarQuant v7 — INT4 that BEATS BF16

We found the optimal config: group_size=64 + FOEM = 67.07% HumanEval (vs 66.87% BF16)

👉 Download PolarQuant v7 (gs64+FOEM) — same Marlin kernel, 8.7 GB

Method	HumanEval	Size	Kernel
PolarQuant v7 (gs64+FOEM)	67.07%	8.7 GB	Marlin
BF16 Base	66.87%	19.3 GB	—
FOEM INT4 gs128 (Arien0)	62.80%	8.6 GB	Marlin
This model (GPTQ gs128)	60.98%	8.6 GB	Marlin
Naive INT4 (old)	55.49%	6.5 GB	Marlin

📊 This Model's Benchmarks

Metric	GPTQ INT4	BF16 Original	Improvement
HumanEval	60.98%	66.87%	-5.9pp (calibrated)
Speed	113 tok/s	~40 tok/s	2.8x faster
Size	8.6 GB	18 GB	2.1x smaller
WikiText-2 PPL	6.56	6.37	+0.19

Previously naive INT4 scored 55.49% — GPTQ calibration improved by +5.5pp.

🚀 Quick Start

pip install vllm

vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
  --language-model-only \
  --enforce-eager

No plugins, no custom code. Just vLLM.

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
    trust_remote_code=True,
    enforce_eager=True,
)

output = llm.generate(
    ["Write a Python function for binary search."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)

📈 HumanEval Evolution

#	Method	HumanEval	Notes
1	Naive INT4 (RTN)	55.49%	Round-to-nearest, no calibration
2	This model (GPTQ gs128)	60.98%	Calibrated, desc_act=True
3	FOEM gs128	61.59%	+FOEM error correction
4	FOEM gs128 (Arien0)	62.80%	Different calibration data
5	BF16 Base	66.87%	Original unquantized
6	PolarQuant v7 gs64+FOEM	67.07%	BEATS BF16!

Speed (RTX 3090, 24 GB)

Confirmed by @Arien0:

Metric	Value
Throughput	113 tok/s
Kernel	Marlin (gptq_marlin)
VRAM	~8 GB

🔧 Architecture

Property	Value
Base Model	Jackrong/Qwopus3.5-9B-v3
Architecture	Qwen3.5 — hybrid (linear attention + full attention)
Parameters	9B
Layers	32 (24 linear attention + 8 full attention)
Hidden Size	4096

🔬 Quantization Details

Property	Value
Method	GPTQ (calibrated)
Tool	GPTQModel v6.0.3
Bits	4
Group Size	128
Symmetric	Yes
desc_act	True (activation order)
Calibration	512 samples from neuralmagic/LLM_compression_calibration
Format	GPTQ (native vLLM Marlin kernel)

💡 Want better quality? Use PolarQuant v7 with gs64+FOEM for 67.07% HumanEval.

⚙️ Key Flags

Flag	Why
`--language-model-only`	Skips vision encoder (4304 dim not Marlin-compatible)
`--enforce-eager`	Recommended for stability

🔗 Links

🏆 PolarQuant v7 (BEST) — 67.07% HumanEval, beats BF16
📜 Paper: PolarQuant — arXiv:2603.29078
💻 GitHub: polarengine-vllm
📦 PyPI: pip install polarquant
🔧 Expert Offloading: vllm-expert-offload — LFRU cache for consumer GPUs

📖 Citation

@article{vicentino2026polarquant,
  title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}