Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Nemotron-Cascade-2-30B-A3B β€” Expert Offloading + PolarQuant Q5

30B MoE model at 7.6 GB VRAM, 15+ tok/s, correct output.

VRAM Before & After

Speed vs VRAM Tradeoff

Benchmark Results

Config tok/s Model VRAM Quality
Full BF16 (baseline) 54.5 92 GB Perfect
Expert cache=8 (LFRU) 16.4 7.6 GB Perfect
Expert cache=8 (LRU) 14.6-16.9 7.6 GB Perfect
Expert cache=8 (patcher) 15.6 38 GB* Perfect
Expert cache=16 (patcher) 19.6 42 GB* Perfect
Expert cache=32 (patcher) 24.4 48 GB* Perfect

*Patcher: peak VRAM 92 GB (experts loaded to GPU first). Fork: experts load directly to CPU (7.6 GB peak).

Quick Start β€” Fork (Recommended)

RTX 4090 / RTX 3090 / any 24+ GB GPU:

# Install (uses pre-compiled C extensions, no CUDA build needed)
VLLM_USE_PRECOMPILED=1 pip install \
  vllm --upgrade

# Run
FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "
from vllm import LLM, SamplingParams
llm = LLM(
    model='nvidia/Nemotron-Cascade-2-30B-A3B',
    trust_remote_code=True,
    dtype='bfloat16',
    max_model_len=4096,
    enforce_eager=True,
    moe_expert_cache_size=8,
    kernel_config={'moe_backend': 'triton'},
    gpu_memory_utilization=0.95,
)
out = llm.generate(['What is 2+3?'], SamplingParams(max_tokens=200))
print(out[0].outputs[0].text)
"

Cache Size Guide

Cache Model VRAM Speed Target GPU
8 ~7.6 GB ~15 tok/s RTX 4090 (24 GB)
16 ~11 GB ~20 tok/s RTX 4090 (24 GB)
32 ~19 GB ~25 tok/s RTX 4090 (24 GB)
64 ~34 GB ~35 tok/s A6000 (48 GB)

Requirements

  • GPU: 24+ GB VRAM (RTX 3090/4090 or better)
  • CPU RAM: 64 GB (expert weights stored in CPU pinned memory)
  • CUDA: 12.0+
  • Python: 3.10+

Alternative: PolarQuant Q5 (Full VRAM)

For GPUs with 64+ GB VRAM (A100/H100):

pip install polarengine-vllm
polarquant-convert caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 /tmp/model
vllm serve /tmp/model --trust-remote-code --dtype bfloat16
  • Download: 20 GB (Q5 bit-packed, 3.15x smaller)
  • Speed: 175 tok/s (vLLM native)
  • PPL: 7.47 (+0.02 vs BF16 β€” near-lossless)

VRAM Comparison

Weight Distribution

How Expert Offloading Works

Nemotron has 128 routed experts per MoE layer (23 layers), but only 6 are active per token. 92.9% of weights are expert weights sitting idle.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   GPU (~8 GB)    β”‚     β”‚   CPU (~60 GB)       β”‚
β”‚                  β”‚     β”‚                      β”‚
β”‚ Non-expert:      β”‚     β”‚ Expert weights:      β”‚
β”‚  - Mamba SSM     β”‚     β”‚  128 experts Γ— 23    β”‚
β”‚  - Attention     β”‚     β”‚  layers (pinned mem) β”‚
β”‚  - Norms/Router  β”‚     β”‚                      β”‚
β”‚                  β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ LRU Cache:       β”‚               β”‚
β”‚  8 expert slots  │◄── H2D copy β”€β”€β”˜
β”‚  (GPU buffer)    β”‚   on cache miss
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cache hit β†’ zero transfer (fast). Cache miss β†’ copy 1 expert (~20 MB).

Perplexity (WikiText-2)

Config PPL Delta
BF16 baseline 7.45 β€”
Expert cache=8 6.09 lossless
PolarQuant Q5 7.47 +0.02

Expert offloading preserves full model quality. The PPL improvement over baseline is likely due to measurement variance (4K token sample).

Technical Details

Fork: caiovicentino/vllm-expert-offload@nemotron-expert-offload

Based on PR #37190 by @e1n00r, rebased on current vLLM main with fixes:

  1. _init_runner NameError β€” gate and shared_experts stored on self before method call
  2. _init_runner returns None β€” added return self.runner
  3. shared_experts AttributeError β€” safe getattr (not yet init in super().__init__)
  4. moe_kernel None when cache active β€” create kernel even for CPU-resident weights
  5. Prefill overflow β€” warn + truncate instead of crash when batch needs > cache_size experts

Model Architecture

  • Total: 30B params (3B active per token)
  • Layers: 52 (23 Mamba SSM + 23 MoE + 6 Attention)
  • Experts: 128 routed + 1 shared per MoE layer, top-6 routing
  • Expert weights: 58.7 GB (92.9%)
  • Non-expert weights: 4.4 GB (7.1%)

Links

Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.29078},
    year={2026},
    url={https://arxiv.org/abs/2603.29078}
}

πŸš€ Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β€” fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --share

πŸ“¦ Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β€” mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

πŸ”— Links

Downloads last month
2,364
Safetensors
Model size
20B params
Tensor type
F32
Β·
BF16
Β·
F16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5

Quantized
(31)
this model

Collections including caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5

Paper for caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5