Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
Nemotron-Cascade-2-30B-A3B β Expert Offloading + PolarQuant Q5
30B MoE model at 7.6 GB VRAM, 15+ tok/s, correct output.
Benchmark Results
| Config | tok/s | Model VRAM | Quality |
|---|---|---|---|
| Full BF16 (baseline) | 54.5 | 92 GB | Perfect |
| Expert cache=8 (LFRU) | 16.4 | 7.6 GB | Perfect |
| Expert cache=8 (LRU) | 14.6-16.9 | 7.6 GB | Perfect |
| Expert cache=8 (patcher) | 15.6 | 38 GB* | Perfect |
| Expert cache=16 (patcher) | 19.6 | 42 GB* | Perfect |
| Expert cache=32 (patcher) | 24.4 | 48 GB* | Perfect |
*Patcher: peak VRAM 92 GB (experts loaded to GPU first). Fork: experts load directly to CPU (7.6 GB peak).
Quick Start β Fork (Recommended)
RTX 4090 / RTX 3090 / any 24+ GB GPU:
# Install (uses pre-compiled C extensions, no CUDA build needed)
VLLM_USE_PRECOMPILED=1 pip install \
vllm --upgrade
# Run
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "
from vllm import LLM, SamplingParams
llm = LLM(
model='nvidia/Nemotron-Cascade-2-30B-A3B',
trust_remote_code=True,
dtype='bfloat16',
max_model_len=4096,
enforce_eager=True,
moe_expert_cache_size=8,
kernel_config={'moe_backend': 'triton'},
gpu_memory_utilization=0.95,
)
out = llm.generate(['What is 2+3?'], SamplingParams(max_tokens=200))
print(out[0].outputs[0].text)
"
Cache Size Guide
| Cache | Model VRAM | Speed | Target GPU |
|---|---|---|---|
| 8 | ~7.6 GB | ~15 tok/s | RTX 4090 (24 GB) |
| 16 | ~11 GB | ~20 tok/s | RTX 4090 (24 GB) |
| 32 | ~19 GB | ~25 tok/s | RTX 4090 (24 GB) |
| 64 | ~34 GB | ~35 tok/s | A6000 (48 GB) |
Requirements
- GPU: 24+ GB VRAM (RTX 3090/4090 or better)
- CPU RAM: 64 GB (expert weights stored in CPU pinned memory)
- CUDA: 12.0+
- Python: 3.10+
Alternative: PolarQuant Q5 (Full VRAM)
For GPUs with 64+ GB VRAM (A100/H100):
pip install polarengine-vllm
polarquant-convert caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 /tmp/model
vllm serve /tmp/model --trust-remote-code --dtype bfloat16
- Download: 20 GB (Q5 bit-packed, 3.15x smaller)
- Speed: 175 tok/s (vLLM native)
- PPL: 7.47 (+0.02 vs BF16 β near-lossless)
How Expert Offloading Works
Nemotron has 128 routed experts per MoE layer (23 layers), but only 6 are active per token. 92.9% of weights are expert weights sitting idle.
ββββββββββββββββββββ βββββββββββββββββββββββ
β GPU (~8 GB) β β CPU (~60 GB) β
β β β β
β Non-expert: β β Expert weights: β
β - Mamba SSM β β 128 experts Γ 23 β
β - Attention β β layers (pinned mem) β
β - Norms/Router β β β
β β ββββββββββββ¬ββββββββββββ
β LRU Cache: β β
β 8 expert slots ββββ H2D copy βββ
β (GPU buffer) β on cache miss
ββββββββββββββββββββ
Cache hit β zero transfer (fast). Cache miss β copy 1 expert (~20 MB).
Perplexity (WikiText-2)
| Config | PPL | Delta |
|---|---|---|
| BF16 baseline | 7.45 | β |
| Expert cache=8 | 6.09 | lossless |
| PolarQuant Q5 | 7.47 | +0.02 |
Expert offloading preserves full model quality. The PPL improvement over baseline is likely due to measurement variance (4K token sample).
Technical Details
Fork: caiovicentino/vllm-expert-offload@nemotron-expert-offload
Based on PR #37190 by @e1n00r, rebased on current vLLM main with fixes:
_init_runnerNameError βgateandshared_expertsstored onselfbefore method call_init_runnerreturns None β addedreturn self.runnershared_expertsAttributeError β safegetattr(not yet init insuper().__init__)moe_kernelNone when cache active β create kernel even for CPU-resident weights- Prefill overflow β warn + truncate instead of crash when batch needs > cache_size experts
Model Architecture
- Total: 30B params (3B active per token)
- Layers: 52 (23 Mamba SSM + 23 MoE + 6 Attention)
- Experts: 128 routed + 1 shared per MoE layer, top-6 routing
- Expert weights: 58.7 GB (92.9%)
- Non-expert weights: 4.4 GB (7.1%)
Links
- Fork (expert offloading): github.com/caiovicentino/vllm-expert-offload
- PolarEngine (patcher + quantization): github.com/caiovicentino/polarengine-vllm
- Base model: nvidia/Nemotron-Cascade-2-30B-A3B
- vLLM PR #37190: Expert CPU offloading
Citation
@article{vicentino2026polarquant,
title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.29078},
year={2026},
url={https://arxiv.org/abs/2603.29078}
}
π Quick Start
Install
pip install git+https://github.com/caiovicentino/polarengine-vllm.git
Load & Generate (1 line!)
from polarengine_vllm import PolarQuantModel
model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))
With KV Cache Compression (5.3x more context)
model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
Benchmark
polarquant bench caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --ppl --chart
Gradio Demo
polarquant demo caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --share
π¦ Method: PolarQuant
Hadamard Rotation + Lloyd-Max Optimal Centroids
Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β mathematically proven optimal for Gaussian-distributed neural network weights.
PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
π Links
- Downloads last month
- 2,364
Model tree for caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5
Base model
nvidia/Nemotron-Cascade-2-30B-A3B


