Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🍎 HLWQ MLX 4-bit -- Qwen3.5-9B

Run Qwen3.5-9B on your Mac with HLWQ quality -- 19.7 tok/s at only 4.8 GB memory.

HLWQ MLX brings the benefits of Hadamard rotation + Lloyd-Max optimal quantization to Apple Silicon via the MLX framework. Get better-than-naive-Q4 quality with native Metal acceleration.

🎯 Key Results

Metric	Value
Method	HLWQ Q4 (MLX)
Perplexity (WikiText-2)	6.90
Throughput	19.7 tok/s
Memory	4.8 GB
Platform	Mac mini M4
Quantization	4-bit with Hadamard rotation

📊 Cross-Platform Comparison

Method	tok/s	Memory	PPL	Platform
FP16 baseline	45.7	17.9 GB	6.37	RTX PRO 6000
HLWQ Q5 + torchao	43.1	6.5 GB	6.56	RTX PRO 6000
torchao INT4 (absmax)	43.3	6.3 GB	6.68	RTX PRO 6000
HLWQ MLX Q4	19.7	4.8 GB	6.90	Mac mini M4

Runs comfortably on any Mac with 8 GB+ unified memory. No GPU required -- Metal handles everything.

🚀 Quick Start

Option 1: mlx-lm (Easiest)

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit")

response = generate(
    model, tokenizer,
    prompt="Explain the theory of relativity in simple terms:",
    max_tokens=200,
    verbose=True  # Shows tok/s
)
print(response)

Option 2: Command Line

mlx_lm.generate \
    --model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
    --prompt "Write a Python function to sort a list:" \
    --max-tokens 300

Option 3: Chat Mode

mlx_lm.chat \
    --model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
    --max-tokens 500

🔬 Why HLWQ on MLX?

Standard MLX quantization uses simple absmax rounding. HLWQ improves on this by:

Hadamard Rotation: Transforms weight blocks to follow a Gaussian distribution using a deterministic 128x128 Walsh-Hadamard matrix
Lloyd-Max Centroids: Uses MSE-optimal quantization levels for Gaussian data instead of uniform spacing

This combination reduces quantization error by up to 54% compared to absmax at the same bit width.

Original Weights --> Normalize --> Hadamard Rotate --> Lloyd-Max Q4 --> MLX Format
                                                                         |
                       Metal Acceleration <-- MLX Inference <-- Dequant --+

🔧 Technical Details

Component	Details
Framework	MLX (Apple's ML framework for Apple Silicon)
Quantization	HLWQ Q4 (4-bit, block_size=128)
Rotation	128x128 Walsh-Hadamard (self-inverse, deterministic)
Centroids	Pre-computed MSE-optimal for N(0,1)
Acceleration	Metal Performance Shaders (MPS)
Compatibility	Mac M1/M2/M3/M4 (8 GB+ unified memory)

💻 System Requirements

Requirement	Minimum	Recommended
macOS	13.5+ (Ventura)	14.0+ (Sonoma)
Apple Silicon	M1	M4
Unified Memory	8 GB	16 GB
Python	3.10+	3.11+
mlx	0.5.0+	Latest

🔗 Links

\U0001f4c4 Paper (arXiv) -- HLWQ: Optimal Gaussian Weight Quantization
💻 Code (GitHub) -- Full research codebase
🖥️ CUDA Version -- For NVIDIA GPUs (43 tok/s)
\U0001f50c vLLM Plugin -- Production inference integration

📖 Citation

@article{vicentino2026polarquant,
  title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.7424577},
  year={2026}
}

🙏 Acknowledgements

Built with MLX by Apple, mlx-lm, and the Qwen team's open-weight models.

Downloads last month: 3,047

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(176)

this model

Collections including caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit

Papers for caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 24 days ago

PolarQuant: Quantizing KV Caches with Polar Transformation

Paper • 2502.02617 • Published Feb 4, 2025 • 1