Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
🍎 HLWQ MLX 4-bit -- Qwen3.5-9B
Run Qwen3.5-9B on your Mac with HLWQ quality -- 19.7 tok/s at only 4.8 GB memory.
HLWQ MLX brings the benefits of Hadamard rotation + Lloyd-Max optimal quantization to Apple Silicon via the MLX framework. Get better-than-naive-Q4 quality with native Metal acceleration.
🎯 Key Results
| Metric | Value |
|---|---|
| Method | HLWQ Q4 (MLX) |
| Perplexity (WikiText-2) | 6.90 |
| Throughput | 19.7 tok/s |
| Memory | 4.8 GB |
| Platform | Mac mini M4 |
| Quantization | 4-bit with Hadamard rotation |
📊 Cross-Platform Comparison
| Method | tok/s | Memory | PPL | Platform |
|---|---|---|---|---|
| FP16 baseline | 45.7 | 17.9 GB | 6.37 | RTX PRO 6000 |
| HLWQ Q5 + torchao | 43.1 | 6.5 GB | 6.56 | RTX PRO 6000 |
| torchao INT4 (absmax) | 43.3 | 6.3 GB | 6.68 | RTX PRO 6000 |
| HLWQ MLX Q4 | 19.7 | 4.8 GB | 6.90 | Mac mini M4 |
Runs comfortably on any Mac with 8 GB+ unified memory. No GPU required -- Metal handles everything.
🚀 Quick Start
Option 1: mlx-lm (Easiest)
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit")
response = generate(
model, tokenizer,
prompt="Explain the theory of relativity in simple terms:",
max_tokens=200,
verbose=True # Shows tok/s
)
print(response)
Option 2: Command Line
mlx_lm.generate \
--model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
--prompt "Write a Python function to sort a list:" \
--max-tokens 300
Option 3: Chat Mode
mlx_lm.chat \
--model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
--max-tokens 500
🔬 Why HLWQ on MLX?
Standard MLX quantization uses simple absmax rounding. HLWQ improves on this by:
- Hadamard Rotation: Transforms weight blocks to follow a Gaussian distribution using a deterministic 128x128 Walsh-Hadamard matrix
- Lloyd-Max Centroids: Uses MSE-optimal quantization levels for Gaussian data instead of uniform spacing
This combination reduces quantization error by up to 54% compared to absmax at the same bit width.
Original Weights --> Normalize --> Hadamard Rotate --> Lloyd-Max Q4 --> MLX Format
|
Metal Acceleration <-- MLX Inference <-- Dequant --+
🔧 Technical Details
| Component | Details |
|---|---|
| Framework | MLX (Apple's ML framework for Apple Silicon) |
| Quantization | HLWQ Q4 (4-bit, block_size=128) |
| Rotation | 128x128 Walsh-Hadamard (self-inverse, deterministic) |
| Centroids | Pre-computed MSE-optimal for N(0,1) |
| Acceleration | Metal Performance Shaders (MPS) |
| Compatibility | Mac M1/M2/M3/M4 (8 GB+ unified memory) |
💻 System Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| macOS | 13.5+ (Ventura) | 14.0+ (Sonoma) |
| Apple Silicon | M1 | M4 |
| Unified Memory | 8 GB | 16 GB |
| Python | 3.10+ | 3.11+ |
| mlx | 0.5.0+ | Latest |
🔗 Links
- \U0001f4c4 Paper (arXiv) -- HLWQ: Optimal Gaussian Weight Quantization
- 💻 Code (GitHub) -- Full research codebase
- 🖥️ CUDA Version -- For NVIDIA GPUs (43 tok/s)
- \U0001f50c vLLM Plugin -- Production inference integration
📖 Citation
@article{vicentino2026polarquant,
title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.7424577},
year={2026}
}
🙏 Acknowledgements
Built with MLX by Apple, mlx-lm, and the Qwen team's open-weight models.
- Downloads last month
- 3,047
4-bit

