Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Nemotron-Cascade-2-30B-A3B — Expert Offloading + PolarQuant Q5

30B MoE model at 7.6 GB VRAM, 15+ tok/s, correct output.

Benchmark Results

Config	tok/s	Model VRAM	Quality
Full BF16 (baseline)	54.5	92 GB	Perfect
Expert cache=8 (LFRU)	16.4	7.6 GB	Perfect
Expert cache=8 (LRU)	14.6-16.9	7.6 GB	Perfect
Expert cache=8 (patcher)	15.6	38 GB*	Perfect
Expert cache=16 (patcher)	19.6	42 GB*	Perfect
Expert cache=32 (patcher)	24.4	48 GB*	Perfect

*Patcher: peak VRAM 92 GB (experts loaded to GPU first). Fork: experts load directly to CPU (7.6 GB peak).

Quick Start — Fork (Recommended)

RTX 4090 / RTX 3090 / any 24+ GB GPU:

# Install (uses pre-compiled C extensions, no CUDA build needed)
VLLM_USE_PRECOMPILED=1 pip install \
  vllm --upgrade

# Run
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "
from vllm import LLM, SamplingParams
llm = LLM(
    model='nvidia/Nemotron-Cascade-2-30B-A3B',
    trust_remote_code=True,
    dtype='bfloat16',
    max_model_len=4096,
    enforce_eager=True,
    moe_expert_cache_size=8,
    kernel_config={'moe_backend': 'triton'},
    gpu_memory_utilization=0.95,
)
out = llm.generate(['What is 2+3?'], SamplingParams(max_tokens=200))
print(out[0].outputs[0].text)
"

Cache Size Guide

Cache	Model VRAM	Speed	Target GPU
8	~7.6 GB	~15 tok/s	RTX 4090 (24 GB)
16	~11 GB	~20 tok/s	RTX 4090 (24 GB)
32	~19 GB	~25 tok/s	RTX 4090 (24 GB)
64	~34 GB	~35 tok/s	A6000 (48 GB)

Requirements

GPU: 24+ GB VRAM (RTX 3090/4090 or better)
CPU RAM: 64 GB (expert weights stored in CPU pinned memory)
CUDA: 12.0+
Python: 3.10+

Alternative: PolarQuant Q5 (Full VRAM)

For GPUs with 64+ GB VRAM (A100/H100):

pip install polarengine-vllm
polarquant-convert caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 /tmp/model
vllm serve /tmp/model --trust-remote-code --dtype bfloat16

Download: 20 GB (Q5 bit-packed, 3.15x smaller)
Speed: 175 tok/s (vLLM native)
PPL: 7.47 (+0.02 vs BF16 — near-lossless)

How Expert Offloading Works

Nemotron has 128 routed experts per MoE layer (23 layers), but only 6 are active per token. 92.9% of weights are expert weights sitting idle.

┌──────────────────┐     ┌─────────────────────┐
│   GPU (~8 GB)    │     │   CPU (~60 GB)       │
│                  │     │                      │
│ Non-expert:      │     │ Expert weights:      │
│  - Mamba SSM     │     │  128 experts × 23    │
│  - Attention     │     │  layers (pinned mem) │
│  - Norms/Router  │     │                      │
│                  │     └──────────┬───────────┘
│ LRU Cache:       │               │
│  8 expert slots  │◄── H2D copy ──┘
│  (GPU buffer)    │   on cache miss
└──────────────────┘

Cache hit → zero transfer (fast). Cache miss → copy 1 expert (~20 MB).

Perplexity (WikiText-2)

Config	PPL	Delta
BF16 baseline	7.45	—
Expert cache=8	6.09	lossless
PolarQuant Q5	7.47	+0.02

Expert offloading preserves full model quality. The PPL improvement over baseline is likely due to measurement variance (4K token sample).

Technical Details

Fork: `caiovicentino/vllm-expert-offload@nemotron-expert-offload`

Based on PR #37190 by @e1n00r, rebased on current vLLM main with fixes:

_init_runner NameError — gate and shared_experts stored on self before method call
_init_runner returns None — added return self.runner
shared_experts AttributeError — safe getattr (not yet init in super().__init__)
moe_kernel None when cache active — create kernel even for CPU-resident weights
Prefill overflow — warn + truncate instead of crash when batch needs > cache_size experts

Model Architecture

Total: 30B params (3B active per token)
Layers: 52 (23 Mamba SSM + 23 MoE + 6 Attention)
Experts: 128 routed + 1 shared per MoE layer, top-6 routing
Expert weights: 58.7 GB (92.9%)
Non-expert weights: 4.4 GB (7.1%)

Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.29078},
    year={2026},
    url={https://arxiv.org/abs/2603.29078}
}

🚀 Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory — fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --share

📦 Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

🔗 Links

Downloads last month: 2,364

Safetensors

Model size

20B params

Tensor type

F32

BF16

F16

Model tree for caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5

Base model

nvidia/Nemotron-Cascade-2-30B-A3B

Quantized

(31)

this model

Collections including caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5

Paper for caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 14 days ago