Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🍎 HLWQ MLX 4-bit -- Qwen3.5-9B

Run Qwen3.5-9B on your Mac with HLWQ quality -- 19.7 tok/s at only 4.8 GB memory.

HLWQ MLX brings the benefits of Hadamard rotation + Lloyd-Max optimal quantization to Apple Silicon via the MLX framework. Get better-than-naive-Q4 quality with native Metal acceleration.


🎯 Key Results

Metric Value
Method HLWQ Q4 (MLX)
Perplexity (WikiText-2) 6.90
Throughput 19.7 tok/s
Memory 4.8 GB
Platform Mac mini M4
Quantization 4-bit with Hadamard rotation

📊 Cross-Platform Comparison

Speed vs VRAM

PPL Comparison

Method tok/s Memory PPL Platform
FP16 baseline 45.7 17.9 GB 6.37 RTX PRO 6000
HLWQ Q5 + torchao 43.1 6.5 GB 6.56 RTX PRO 6000
torchao INT4 (absmax) 43.3 6.3 GB 6.68 RTX PRO 6000
HLWQ MLX Q4 19.7 4.8 GB 6.90 Mac mini M4

Runs comfortably on any Mac with 8 GB+ unified memory. No GPU required -- Metal handles everything.


🚀 Quick Start

Option 1: mlx-lm (Easiest)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit")

response = generate(
    model, tokenizer,
    prompt="Explain the theory of relativity in simple terms:",
    max_tokens=200,
    verbose=True  # Shows tok/s
)
print(response)

Option 2: Command Line

mlx_lm.generate \
    --model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
    --prompt "Write a Python function to sort a list:" \
    --max-tokens 300

Option 3: Chat Mode

mlx_lm.chat \
    --model caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit \
    --max-tokens 500

🔬 Why HLWQ on MLX?

Standard MLX quantization uses simple absmax rounding. HLWQ improves on this by:

  1. Hadamard Rotation: Transforms weight blocks to follow a Gaussian distribution using a deterministic 128x128 Walsh-Hadamard matrix
  2. Lloyd-Max Centroids: Uses MSE-optimal quantization levels for Gaussian data instead of uniform spacing

This combination reduces quantization error by up to 54% compared to absmax at the same bit width.

Original Weights --> Normalize --> Hadamard Rotate --> Lloyd-Max Q4 --> MLX Format
                                                                         |
                       Metal Acceleration <-- MLX Inference <-- Dequant --+

🔧 Technical Details

Component Details
Framework MLX (Apple's ML framework for Apple Silicon)
Quantization HLWQ Q4 (4-bit, block_size=128)
Rotation 128x128 Walsh-Hadamard (self-inverse, deterministic)
Centroids Pre-computed MSE-optimal for N(0,1)
Acceleration Metal Performance Shaders (MPS)
Compatibility Mac M1/M2/M3/M4 (8 GB+ unified memory)

💻 System Requirements

Requirement Minimum Recommended
macOS 13.5+ (Ventura) 14.0+ (Sonoma)
Apple Silicon M1 M4
Unified Memory 8 GB 16 GB
Python 3.10+ 3.11+
mlx 0.5.0+ Latest

🔗 Links


📖 Citation

@article{vicentino2026polarquant,
  title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.7424577},
  year={2026}
}

🙏 Acknowledgements

Built with MLX by Apple, mlx-lm, and the Qwen team's open-weight models.

Downloads last month
3,047
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit

Finetuned
Qwen/Qwen3.5-9B
Quantized
(176)
this model

Collections including caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit

Papers for caiovicentino1/Qwen3.5-9B-HLWQ-MLX-4bit