Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Qwopus-MoE-35B-A3B — HLWQ INT4 CompressedTensors

35B Mixture-of-Experts quantized to INT4 — loads natively in vLLM with Marlin kernel. Zero plugins, zero custom code.

Metric	INT4 CT	INT4 + LFRU Cache	BF16 Original
Perplexity	6.56	6.56	not measured
Speed	23.6 tok/s	37.4 tok/s	16.2 tok/s
Download	19.5 GB	19.5 GB	72 GB
VRAM	~25 GB	~8 GB	~72 GB
vs BF16	1.46x faster	2.3x faster	baseline

Quick Start

Standard (vLLM native)

pip install vllm

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --language-model-only \
  --enforce-eager \
  --max-model-len 4096

With Expert Offloading (fits RTX 4090!)

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --language-model-only \
  --enforce-eager \
  --moe-expert-cache-size 8 \
  --max-model-len 4096

37.4 tok/s with LFRU expert cache — only 8 hot experts in GPU, rest on CPU. 1.58x faster than all-in-GPU and fits in ~8 GB VRAM.

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5",
    trust_remote_code=True,
    enforce_eager=True,
    max_model_len=4096,
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)

Validated Deploy (April 2026, updated vLLM)

The original --language-model-only flag has been removed from recent vLLM builds. Current vLLM (0.17.1rc1+) requires a slightly different invocation because the model config declares Qwen3_5MoeForConditionalGeneration (multimodal class) even though this repo ships text-only weights.

Working command (current vLLM + fork feature branch)

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --served-model-name qwopus-ct \
  --moe-expert-cache-size 128 \
  --enforce-eager \
  --trust-remote-code \
  --max-model-len 4096 \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --gpu-memory-utilization 0.90

Key flag changes vs older instructions:

--language-model-only → removed (flag no longer exists in current vLLM).
--limit-mm-per-prompt '{"image":0,"video":0}' → new, tells vLLM not to allocate dummy image tokens in the profile-run. Without this the engine tries to load a fake image through a vision encoder that has no weights, and crashes during KV-cache sizing.
git+.../vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload → feature branch with:
- opt-in expert LRU/LFRU cache for CompressedTensorsWNA16MarlinMoEMethod (first CT INT4 MoE method in the fork with expert offload support)
- disk-backed backing store via torch.from_file(shared=True) — zero pinned CPU RAM, OS page cache handles hot pages
- correct overflow fallback for top_k=8 models (old truncation produced silent garbage when unique experts > cache size)

Cache size tuning

This model is top_k=8 with 256 experts, so a single prefill forward can activate 40–60 unique experts. The cache policy is per-forward:

--moe-expert-cache-size 128 — recommended. All hot experts fit, ~24 tok/s on RTX PRO 6000, ~8 GB persistent LRU buffer on GPU.
--moe-expert-cache-size 64 — tighter GPU footprint, ~20 tok/s, occasional overflow fallbacks.
--moe-expert-cache-size 8 — minimum footprint, ~10 tok/s, every forward hits the overflow fallback (correct but copies CPU→GPU every call). Useful on 12 GB VRAM + accepting the speed cost.

For top-k=2 models (Nemotron, Gemma-4-26B-A4B) --moe-expert-cache-size 8 is fast because those only activate ~4-8 unique experts per forward.

Repo files added for vLLM loading

preprocessor_config.json — stub pulled from Qwen/Qwen2-VL-7B-Instruct. vLLM Qwen3-VL processor init requires this file even when images are disabled. Never actually invoked at text inference.
tokenizer_config.json — tokenizer_class set to Qwen2TokenizerFast (was PreTrainedTokenizerFast), required by the Qwen3-VL processor class check.

Validated on

NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM, 176 GB host RAM, 236 GB SSD). Disk usage during deploy: ~20 GB HF cache + ~31 GB disk-backed expert cache = ~51 GB peak. Host RAM stable at ~7 GB used, ~167 GB in OS page cache (the disk-backed mmap pages, reclaimable). Outputs coherent on 3-prompt smoke test at cache sizes 8, 64, 128. Informal PPL sanity check: 9.6 on a 357-token wikitext-style sample with cache=8 (the authoritative PPL 6.56 above is from the full WikiText-2 evaluation with a different config).

Architecture

Property	Value
Base Model	samuelcardillo/Qwopus-MoE-35B-A3B
Architecture	Qwen3.5 MoE
Total Parameters	35B
Active per Token	~3B (8 of 256 experts)
Attention	Hybrid — self-attention + linear attention (Mamba-style SSM)
Layers	28 transformer layers

Quantization Details

Component	Format	Details
Expert weights (gate/up/down)	INT4	symmetric, group_size=128, pack-quantized
Shared expert (gate/up/down)	INT4	Same format
Attention (q/k/v/o_proj)	INT4	Standard Linear layers
Linear attention (in_proj_qkv/z, out_proj)	INT4	SSM projections
LayerNorms, norms	BF16	Not quantized
SSM params (A_log, conv1d, dt_bias)	BF16	Not quantized
MoE router (gate.weight)	BF16	Critical for routing
lm_head, embed_tokens	BF16	Embedding layers
in_proj_a, in_proj_b	BF16	Small dims

Format: CompressedTensors pack-quantized — industry standard, native vLLM support via Marlin kernel.

Pipeline: Base BF16 → INT4 symmetric group128 → pack INT32 → CompressedTensors safetensors

Benchmarks

Perplexity (WikiText-2, Full Test Set)

Evaluated over the complete WikiText-2 test set (1,155 windows, 295,680 tokens):

Windows	Tokens Scored	PPL
200	51,200	6.58
500	128,000	6.41
1,000	256,000	6.58
1,155	295,680	6.56

PPL converges to 6.56 on full WikiText-2. We have not measured the BF16 baseline PPL for this specific MoE — we can only say this is a low-PPL result on the benchmark, without a direct BF16 comparison.

Generation Speed

Tested on NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM):

Config	tok/s	Load Time	VRAM
INT4 CT (vLLM native)	23.6	~165s	~25 GB
INT4 CT + LFRU cache	37.4	~315s	~8 GB
BF16 Original	16.2	~120s	~72 GB

The LFRU expert cache is faster because only 8 hot experts are in GPU — better memory locality, less fragmentation across 256 experts.

2.3x faster than BF16 with 3.7x smaller download and 9x less VRAM.

HLWQ Ecosystem

All models load natively in vLLM — no plugins required:

Model	Size	Speed	PPL
Qwopus-MoE-35B INT4	19.5 GB	37.4 tok/s	6.56
Qwen3.5-27B INT4	16.2 GB	18.0 tok/s	—
Harmonic-27B INT4	16.2 GB	18.0 tok/s	—
Qwopus3.5-27B INT4	16.2 GB	18.0 tok/s	—
Qwen3.5-9B INT4	6.5 GB	168.4 tok/s	6.56
Qwopus3.5-9B INT4	6.5 GB	168.4 tok/s	—

GPU Requirements

GPU	VRAM	Config	Expected Speed
RTX 4090	24 GB	LFRU cache (`--moe-expert-cache-size 4`)	~30 tok/s
RTX 3090	24 GB	LFRU cache (`--moe-expert-cache-size 4`)	~15 tok/s
RTX PRO 6000	102 GB	LFRU cache (8 experts)	37.4 tok/s
H100 80GB	80 GB	All-in-GPU or LFRU cache	~25 tok/s
A100 80GB	80 GB	Use `--enforce-eager`	Known Marlin MoE bug (#35922)

Expert LFRU offloading via vllm-expert-offload enables running this 35B MoE on consumer GPUs (RTX 4090, 3090). Only hot experts stay in GPU, rest on CPU. Counterintuitively, this is faster than all-in-GPU due to better memory locality.

Required Flags

Flag	Why
`--language-model-only`	Skips vision encoder (4304 dim not Marlin-compatible)
`--enforce-eager`	Required for Blackwell and expert cache; recommended for stability
`--moe-expert-cache-size N`	Keep N experts in GPU, rest on CPU (fork)

Technical Notes

Expert format: Per-expert 2D tensors (experts.{N}.gate_proj.weight_packed), not 3D stacked. vLLM FusedMoE handles stacking internally.
Reference format: Follows RedHatAI/Qwen3-30B-A3B-quantized.w4a16 conventions.
Marlin kernel: Fused INT4 dequant + matmul for maximum throughput.
Vision weights: Excluded (use --language-model-only). The config retains vision_config for architecture resolution.

Citation

@article{vicentino2026polarquant,
  title={HLWQ: Hadamard-Rotated Post-Training Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

Downloads last month: 2,105

Safetensors

Model size

35B params

Tensor type

I64

I32

BF16

Model tree for caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

samuelcardillo/Qwopus-MoE-35B-A3B

Quantized

(9)

this model

Collections including caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5

HLWQ Models

Collection

Hadamard-Lloyd Weight Quantization · arXiv:2603.29078 · formerly PolarQuant • 26 items • Updated 5 days ago • 1

HLWQ Large MoE (100B+)