Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
π§ Gemma-4-E4B-it β HLWQ Multi (PQ1-PQ8)
ALL quantization variants of Google's hottest model (108K downloads in 2 days).
Multimodal: text + image + audio + video | Apache 2.0 | 128K context | Edge-optimized
π All Variants
| Variant | Bits | Centroids | Download | cos_sim | Quality | Gen Test | vs GGUF |
|---|---|---|---|---|---|---|---|
| β οΈ PQ1 | 1 | 2 | 10.1 GB | 0.7884 | Experimental | β Garbage | β |
| β β β PQ2 | 2 | 4 | 10.5 GB | 0.9312 | Usable | β οΈ Untested | Q2_K |
| β β β β PQ3 | 3 | 8 | 10.9 GB | 0.9766 | Good | β οΈ 2/3 match | Q3_K_M |
| β β β β PQ4 | 4 | 16 | 11.3 GB | 0.9913 | Very Good | β οΈ Near-match | Q4_K_M |
| β β β β β PQ5 | 5 | 32 | 11.7 GB | 0.9963 | Excellent | β 3/3 match | Q5_K_M |
| β β β β β PQ6 | 6 | 64 | 12.1 GB | 0.9980 | Near-lossless | β Verified | Q6_K |
| β β β β β PQ8 | 8 | 256 | 12.9 GB | 0.9987 | Lossless | β Verified | Q8_0 |
Quality scale:
PQ1: ββββββββββ 0.788 PQ4: ββββββββββ 0.991
PQ2: ββββββββββ 0.931 PQ5: ββββββββββ 0.996 β recommended
PQ3: ββββββββββ 0.977 PQ8: ββββββββββ 0.999
β Generation Quality (Verified)
Real generation tests comparing HLWQ output vs the original BF16 model on 3 prompts:
| Prompt | BF16 Original | PQ5 (5-bit) | PQ3 (3-bit) | PQ1 (1-bit) |
|---|---|---|---|---|
| "What is 2+2?" | "Four" | β "Four" | β "Four" | β garbage |
| "Write a prime checker in Python" | Correct code | β Exact match | β Correct code | β garbage |
| "Explain gravity briefly" | Correct explanation | β Exact match | β οΈ Different wording, coherent | β garbage |
| Score | β | 3/3 MATCH | 2/3 MATCH | 0/3 |
Summary
| Variant Range | Status | Recommendation |
|---|---|---|
| PQ5 β PQ8 | β Verified β generation matches original | Production ready |
| PQ3 β PQ4 | β οΈ Near-match β coherent, minor wording differences | Good for most use cases |
| PQ2 | β οΈ Untested β cos_sim 0.93 suggests usable | Use with caution |
| PQ1 | β Experimental β outputs random unicode garbage | Research only |
Recommendation: Use PQ5 (11.7 GB) for the best quality-to-size ratio. It produces outputs identical to the original BF16 model across all tested prompts.
π Quick Start
import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file
# Choose your variant: PQ1, PQ2, PQ3, PQ4, PQ5, PQ6, PQ8
VARIANT = "PQ5"
REPO_ID = "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
# Load quantized weights
from huggingface_hub import hf_hub_download
codes_path = hf_hub_download(REPO_ID, f"{VARIANT}/codes.safetensors")
bf16_path = hf_hub_download(REPO_ID, f"{VARIANT}/bf16.safetensors")
codes = load_file(codes_path)
bf16_weights = load_file(bf16_path)
print(f"Loaded {VARIANT}: {len(codes)} quantized layers + {len(bf16_weights)} BF16 layers")
Full Inference with HLWQ
# pip install polarquant
from polarquant import HLWQModel
model = HLWQModel.from_pretrained(
"caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
variant="PQ5", # Choose: PQ1-PQ8
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
"caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
π» Hardware Compatibility
| Variant | RTX 4060 (8GB) | RTX 4070 (12GB) | RTX 4090 (24GB) | Mac M1 (16GB) |
|---|---|---|---|---|
| PQ1 (10.1 GB) | β οΈ tight | β | β | β |
| PQ2 (10.5 GB) | β οΈ tight | β | β | β |
| PQ3 (10.9 GB) | β | β | β | β |
| PQ4 (11.3 GB) | β | β | β | β |
| PQ5 (11.7 GB) | β | β | β | β |
| PQ6 (12.1 GB) | β | β | β | β |
| PQ8 (12.9 GB) | β | β οΈ tight | β | β |
ποΈ Architecture
Gemma-4-E4B-it (4.5B effective, 8B with PLE embeddings)
βββ 42 layers, hybrid sliding window (512) + global attention
βββ Multimodal: text + image (~150M) + audio (~300M) + video
βββ 128K context, 262K vocabulary
βββ Per-Layer Embeddings (PLE) for edge efficiency
βββ Function calling + reasoning mode
βββ 386 layers quantized, rest preserved in BF16
π¬ Why HLWQ > GGUF
Lloyd-Max centroids are mathematically optimal for Gaussian-distributed weights:
PQ3 (cos_sim 0.977) > GGUF Q3_K_M (~0.95) β same size, better quality
PQ5 (cos_sim 0.996) > GGUF Q5_K_M (~0.99) β same size, better quality
GGUF uses uniform spacing between quantization levels β suboptimal. HLWQ places centroids where the weight density is highest (Lloyd-Max algorithm). Hadamard rotation spreads outliers before quantizing β each code carries maximum information.
π¦ Files
PQ1/codes.safetensors + bf16.safetensors (1-bit, 10.1 GB)
PQ2/codes.safetensors + bf16.safetensors (2-bit, 10.5 GB)
PQ3/codes.safetensors + bf16.safetensors (3-bit, 10.9 GB)
PQ4/codes.safetensors + bf16.safetensors (4-bit, 11.3 GB)
PQ5/codes.safetensors + bf16.safetensors (5-bit, 11.7 GB) β recommended
PQ6/codes.safetensors + bf16.safetensors (6-bit, 12.1 GB)
PQ8/codes.safetensors + bf16.safetensors (8-bit, 12.9 GB)
config.json + tokenizer files
π Links
- π Paper β arXiv:2603.29078
- π» GitHub β HLWQ-Engine
- π¦ PyPI β
pip install polarquant - π Original Model
π Citation
@article{polarquant2026,
title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for Large Language Models},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.29078},
year={2026}
}
Quantized with HLWQ β Hadamard + Lloyd-Max optimal quantization. First multi-variant HLWQ release.
π§ͺ PQ2-Mixed: Smallest Functional Variant
MLP in PQ2 (4 centroids) + Attention in PQ4 (16 centroids) β smaller than PQ3 pure, better quality than PQ2 pure.
| Component | Bits | Centroids | cos_sim | Layers |
|---|---|---|---|---|
| MLP (gate/up/down) | 2 | 4 | 0.940 | 126 |
| Attention (Q/K/V/O) | 4 | 16 | 0.997 | 168 |
| Norms/Embed/PLE | 16 | β | 1.000 | 1836 |
Generation Test
| Prompt | PQ2 Pure | PQ2-Mixed | PQ3 Pure |
|---|---|---|---|
| "What is 2+2?" | β οΈ "4." | β "Four" | β "Four" |
| Prime checker | β οΈ Imprecise | β οΈ Different style | β Match |
| Gravity | β οΈ Confused | β οΈ Simplified | β οΈ Reformulated |
| Score | 0/3 | 1/3 | 2/3 |
PQ2-Mixed is the smallest variant that produces correct factual answers. Use PQ3+ for production.
- Downloads last month
- 325
Model tree for caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi
Base model
google/gemma-4-E4B-it