Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Gemma-4-E4B-it — HLWQ Multi (PQ1-PQ8)

ALL quantization variants of Google's hottest model (108K downloads in 2 days).

Multimodal: text + image + audio + video | Apache 2.0 | 128K context | Edge-optimized

📊 All Variants

Variant	Bits	Centroids	Download	cos_sim	Quality	Gen Test	vs GGUF
⚠️ PQ1	1	2	10.1 GB	0.7884	Experimental	❌ Garbage	—
★★★ PQ2	2	4	10.5 GB	0.9312	Usable	⚠️ Untested	Q2_K
★★★★ PQ3	3	8	10.9 GB	0.9766	Good	⚠️ 2/3 match	Q3_K_M
★★★★ PQ4	4	16	11.3 GB	0.9913	Very Good	⚠️ Near-match	Q4_K_M
★★★★★ PQ5	5	32	11.7 GB	0.9963	Excellent	✅ 3/3 match	Q5_K_M
★★★★★ PQ6	6	64	12.1 GB	0.9980	Near-lossless	✅ Verified	Q6_K
★★★★★ PQ8	8	256	12.9 GB	0.9987	Lossless	✅ Verified	Q8_0

Quality scale:
PQ1: ██░░░░░░░░ 0.788   PQ4: █████████░ 0.991
PQ2: █████████░ 0.931   PQ5: ██████████ 0.996 ← recommended
PQ3: █████████░ 0.977   PQ8: ██████████ 0.999

✅ Generation Quality (Verified)

Real generation tests comparing HLWQ output vs the original BF16 model on 3 prompts:

Prompt	BF16 Original	PQ5 (5-bit)	PQ3 (3-bit)	PQ1 (1-bit)
"What is 2+2?"	"Four"	✅ "Four"	✅ "Four"	❌ garbage
"Write a prime checker in Python"	Correct code	✅ Exact match	✅ Correct code	❌ garbage
"Explain gravity briefly"	Correct explanation	✅ Exact match	⚠️ Different wording, coherent	❌ garbage
Score	—	3/3 MATCH	2/3 MATCH	0/3

Summary

Variant Range	Status	Recommendation
PQ5 — PQ8	✅ Verified — generation matches original	Production ready
PQ3 — PQ4	⚠️ Near-match — coherent, minor wording differences	Good for most use cases
PQ2	⚠️ Untested — cos_sim 0.93 suggests usable	Use with caution
PQ1	❌ Experimental — outputs random unicode garbage	Research only

Recommendation: Use PQ5 (11.7 GB) for the best quality-to-size ratio. It produces outputs identical to the original BF16 model across all tested prompts.

🚀 Quick Start

import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file

# Choose your variant: PQ1, PQ2, PQ3, PQ4, PQ5, PQ6, PQ8
VARIANT = "PQ5"
REPO_ID = "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)

# Load quantized weights
from huggingface_hub import hf_hub_download

codes_path = hf_hub_download(REPO_ID, f"{VARIANT}/codes.safetensors")
bf16_path = hf_hub_download(REPO_ID, f"{VARIANT}/bf16.safetensors")

codes = load_file(codes_path)
bf16_weights = load_file(bf16_path)

print(f"Loaded {VARIANT}: {len(codes)} quantized layers + {len(bf16_weights)} BF16 layers")

Full Inference with HLWQ

# pip install polarquant
from polarquant import HLWQModel

model = HLWQModel.from_pretrained(
    "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
    variant="PQ5",          # Choose: PQ1-PQ8
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=128)
    
print(tokenizer.decode(output[0], skip_special_tokens=True))

💻 Hardware Compatibility

Variant	RTX 4060 (8GB)	RTX 4070 (12GB)	RTX 4090 (24GB)	Mac M1 (16GB)
PQ1 (10.1 GB)	⚠️ tight	✅	✅	✅
PQ2 (10.5 GB)	⚠️ tight	✅	✅	✅
PQ3 (10.9 GB)	❌	✅	✅	✅
PQ4 (11.3 GB)	❌	✅	✅	✅
PQ5 (11.7 GB)	❌	✅	✅	✅
PQ6 (12.1 GB)	❌	✅	✅	✅
PQ8 (12.9 GB)	❌	⚠️ tight	✅	✅

🏗️ Architecture

Gemma-4-E4B-it (4.5B effective, 8B with PLE embeddings)
├── 42 layers, hybrid sliding window (512) + global attention
├── Multimodal: text + image (~150M) + audio (~300M) + video
├── 128K context, 262K vocabulary
├── Per-Layer Embeddings (PLE) for edge efficiency
├── Function calling + reasoning mode
└── 386 layers quantized, rest preserved in BF16

🔬 Why HLWQ > GGUF

Lloyd-Max centroids are mathematically optimal for Gaussian-distributed weights:

PQ3 (cos_sim 0.977) > GGUF Q3_K_M (~0.95)  — same size, better quality
PQ5 (cos_sim 0.996) > GGUF Q5_K_M (~0.99)  — same size, better quality

GGUF uses uniform spacing between quantization levels — suboptimal. HLWQ places centroids where the weight density is highest (Lloyd-Max algorithm). Hadamard rotation spreads outliers before quantizing — each code carries maximum information.

📦 Files

PQ1/codes.safetensors + bf16.safetensors   (1-bit, 10.1 GB)
PQ2/codes.safetensors + bf16.safetensors   (2-bit, 10.5 GB)
PQ3/codes.safetensors + bf16.safetensors   (3-bit, 10.9 GB)
PQ4/codes.safetensors + bf16.safetensors   (4-bit, 11.3 GB)
PQ5/codes.safetensors + bf16.safetensors   (5-bit, 11.7 GB) ← recommended
PQ6/codes.safetensors + bf16.safetensors   (6-bit, 12.1 GB)
PQ8/codes.safetensors + bf16.safetensors   (8-bit, 12.9 GB)
config.json + tokenizer files

🔗 Links

📖 Citation

@article{polarquant2026,
  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for Large Language Models},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

Quantized with HLWQ — Hadamard + Lloyd-Max optimal quantization. First multi-variant HLWQ release.

🧪 PQ2-Mixed: Smallest Functional Variant

MLP in PQ2 (4 centroids) + Attention in PQ4 (16 centroids) — smaller than PQ3 pure, better quality than PQ2 pure.

Component	Bits	Centroids	cos_sim	Layers
MLP (gate/up/down)	2	4	0.940	126
Attention (Q/K/V/O)	4	16	0.997	168
Norms/Embed/PLE	16	—	1.000	1836

Generation Test

Prompt	PQ2 Pure	PQ2-Mixed	PQ3 Pure
"What is 2+2?"	⚠️ "4."	✅ "Four"	✅ "Four"
Prime checker	⚠️ Imprecise	⚠️ Different style	✅ Match
Gravity	⚠️ Confused	⚠️ Simplified	⚠️ Reformulated
Score	0/3	1/3	2/3