Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Gemma-4-E4B-it β€” HLWQ Multi (PQ1-PQ8)

ALL quantization variants of Google's hottest model (108K downloads in 2 days).

Multimodal: text + image + audio + video | Apache 2.0 | 128K context | Edge-optimized

πŸ“Š All Variants

Variant Bits Centroids Download cos_sim Quality Gen Test vs GGUF
⚠️ PQ1 1 2 10.1 GB 0.7884 Experimental ❌ Garbage β€”
β˜…β˜…β˜… PQ2 2 4 10.5 GB 0.9312 Usable ⚠️ Untested Q2_K
β˜…β˜…β˜…β˜… PQ3 3 8 10.9 GB 0.9766 Good ⚠️ 2/3 match Q3_K_M
β˜…β˜…β˜…β˜… PQ4 4 16 11.3 GB 0.9913 Very Good ⚠️ Near-match Q4_K_M
β˜…β˜…β˜…β˜…β˜… PQ5 5 32 11.7 GB 0.9963 Excellent βœ… 3/3 match Q5_K_M
β˜…β˜…β˜…β˜…β˜… PQ6 6 64 12.1 GB 0.9980 Near-lossless βœ… Verified Q6_K
β˜…β˜…β˜…β˜…β˜… PQ8 8 256 12.9 GB 0.9987 Lossless βœ… Verified Q8_0
Quality scale:
PQ1: β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.788   PQ4: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ 0.991
PQ2: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ 0.931   PQ5: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.996 ← recommended
PQ3: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ 0.977   PQ8: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.999

βœ… Generation Quality (Verified)

Real generation tests comparing HLWQ output vs the original BF16 model on 3 prompts:

Prompt BF16 Original PQ5 (5-bit) PQ3 (3-bit) PQ1 (1-bit)
"What is 2+2?" "Four" βœ… "Four" βœ… "Four" ❌ garbage
"Write a prime checker in Python" Correct code βœ… Exact match βœ… Correct code ❌ garbage
"Explain gravity briefly" Correct explanation βœ… Exact match ⚠️ Different wording, coherent ❌ garbage
Score β€” 3/3 MATCH 2/3 MATCH 0/3

Summary

Variant Range Status Recommendation
PQ5 β€” PQ8 βœ… Verified β€” generation matches original Production ready
PQ3 β€” PQ4 ⚠️ Near-match β€” coherent, minor wording differences Good for most use cases
PQ2 ⚠️ Untested β€” cos_sim 0.93 suggests usable Use with caution
PQ1 ❌ Experimental β€” outputs random unicode garbage Research only

Recommendation: Use PQ5 (11.7 GB) for the best quality-to-size ratio. It produces outputs identical to the original BF16 model across all tested prompts.

πŸš€ Quick Start

import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file

# Choose your variant: PQ1, PQ2, PQ3, PQ4, PQ5, PQ6, PQ8
VARIANT = "PQ5"
REPO_ID = "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)

# Load quantized weights
from huggingface_hub import hf_hub_download

codes_path = hf_hub_download(REPO_ID, f"{VARIANT}/codes.safetensors")
bf16_path = hf_hub_download(REPO_ID, f"{VARIANT}/bf16.safetensors")

codes = load_file(codes_path)
bf16_weights = load_file(bf16_path)

print(f"Loaded {VARIANT}: {len(codes)} quantized layers + {len(bf16_weights)} BF16 layers")

Full Inference with HLWQ

# pip install polarquant
from polarquant import HLWQModel

model = HLWQModel.from_pretrained(
    "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
    variant="PQ5",          # Choose: PQ1-PQ8
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=128)
    
print(tokenizer.decode(output[0], skip_special_tokens=True))

πŸ’» Hardware Compatibility

Variant RTX 4060 (8GB) RTX 4070 (12GB) RTX 4090 (24GB) Mac M1 (16GB)
PQ1 (10.1 GB) ⚠️ tight βœ… βœ… βœ…
PQ2 (10.5 GB) ⚠️ tight βœ… βœ… βœ…
PQ3 (10.9 GB) ❌ βœ… βœ… βœ…
PQ4 (11.3 GB) ❌ βœ… βœ… βœ…
PQ5 (11.7 GB) ❌ βœ… βœ… βœ…
PQ6 (12.1 GB) ❌ βœ… βœ… βœ…
PQ8 (12.9 GB) ❌ ⚠️ tight βœ… βœ…

πŸ—οΈ Architecture

Gemma-4-E4B-it (4.5B effective, 8B with PLE embeddings)
β”œβ”€β”€ 42 layers, hybrid sliding window (512) + global attention
β”œβ”€β”€ Multimodal: text + image (~150M) + audio (~300M) + video
β”œβ”€β”€ 128K context, 262K vocabulary
β”œβ”€β”€ Per-Layer Embeddings (PLE) for edge efficiency
β”œβ”€β”€ Function calling + reasoning mode
└── 386 layers quantized, rest preserved in BF16

πŸ”¬ Why HLWQ > GGUF

Lloyd-Max centroids are mathematically optimal for Gaussian-distributed weights:

PQ3 (cos_sim 0.977) > GGUF Q3_K_M (~0.95)  β€” same size, better quality
PQ5 (cos_sim 0.996) > GGUF Q5_K_M (~0.99)  β€” same size, better quality

GGUF uses uniform spacing between quantization levels β€” suboptimal. HLWQ places centroids where the weight density is highest (Lloyd-Max algorithm). Hadamard rotation spreads outliers before quantizing β€” each code carries maximum information.

πŸ“¦ Files

PQ1/codes.safetensors + bf16.safetensors   (1-bit, 10.1 GB)
PQ2/codes.safetensors + bf16.safetensors   (2-bit, 10.5 GB)
PQ3/codes.safetensors + bf16.safetensors   (3-bit, 10.9 GB)
PQ4/codes.safetensors + bf16.safetensors   (4-bit, 11.3 GB)
PQ5/codes.safetensors + bf16.safetensors   (5-bit, 11.7 GB) ← recommended
PQ6/codes.safetensors + bf16.safetensors   (6-bit, 12.1 GB)
PQ8/codes.safetensors + bf16.safetensors   (8-bit, 12.9 GB)
config.json + tokenizer files

πŸ”— Links

πŸ“– Citation

@article{polarquant2026,
  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for Large Language Models},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

Quantized with HLWQ β€” Hadamard + Lloyd-Max optimal quantization. First multi-variant HLWQ release.

πŸ§ͺ PQ2-Mixed: Smallest Functional Variant

MLP in PQ2 (4 centroids) + Attention in PQ4 (16 centroids) β€” smaller than PQ3 pure, better quality than PQ2 pure.

Component Bits Centroids cos_sim Layers
MLP (gate/up/down) 2 4 0.940 126
Attention (Q/K/V/O) 4 16 0.997 168
Norms/Embed/PLE 16 β€” 1.000 1836

Generation Test

Prompt PQ2 Pure PQ2-Mixed PQ3 Pure
"What is 2+2?" ⚠️ "4." βœ… "Four" βœ… "Four"
Prime checker ⚠️ Imprecise ⚠️ Different style βœ… Match
Gravity ⚠️ Confused ⚠️ Simplified ⚠️ Reformulated
Score 0/3 1/3 2/3

PQ2-Mixed is the smallest variant that produces correct factual answers. Use PQ3+ for production.

Downloads last month
325
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi

Quantized
(112)
this model

Space using caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi 1

Collections including caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi

Papers for caiovicentino1/Gemma-4-E4B-it-HLWQ-Multi