Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Qwopus-MoE-35B-A3B — HLWQ INT4 CompressedTensors

35B Mixture-of-Experts quantized to INT4 — loads natively in vLLM with Marlin kernel. Zero plugins, zero custom code.

Benchmarks

Metric INT4 CT INT4 + LFRU Cache BF16 Original
Perplexity 6.56 6.56 not measured
Speed 23.6 tok/s 37.4 tok/s 16.2 tok/s
Download 19.5 GB 19.5 GB 72 GB
VRAM ~25 GB ~8 GB ~72 GB
vs BF16 1.46x faster 2.3x faster baseline

Quick Start

Standard (vLLM native)

pip install vllm

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --language-model-only \
  --enforce-eager \
  --max-model-len 4096

With Expert Offloading (fits RTX 4090!)

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --language-model-only \
  --enforce-eager \
  --moe-expert-cache-size 8 \
  --max-model-len 4096

37.4 tok/s with LFRU expert cache — only 8 hot experts in GPU, rest on CPU. 1.58x faster than all-in-GPU and fits in ~8 GB VRAM.

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5",
    trust_remote_code=True,
    enforce_eager=True,
    max_model_len=4096,
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)

Validated Deploy (April 2026, updated vLLM)

The original --language-model-only flag has been removed from recent vLLM builds. Current vLLM (0.17.1rc1+) requires a slightly different invocation because the model config declares Qwen3_5MoeForConditionalGeneration (multimodal class) even though this repo ships text-only weights.

Working command (current vLLM + fork feature branch)

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload

vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
  --served-model-name qwopus-ct \
  --moe-expert-cache-size 128 \
  --enforce-eager \
  --trust-remote-code \
  --max-model-len 4096 \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --gpu-memory-utilization 0.90

Key flag changes vs older instructions:

  • --language-model-onlyremoved (flag no longer exists in current vLLM).
  • --limit-mm-per-prompt '{"image":0,"video":0}' → new, tells vLLM not to allocate dummy image tokens in the profile-run. Without this the engine tries to load a fake image through a vision encoder that has no weights, and crashes during KV-cache sizing.
  • git+.../vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload → feature branch with:
    • opt-in expert LRU/LFRU cache for CompressedTensorsWNA16MarlinMoEMethod (first CT INT4 MoE method in the fork with expert offload support)
    • disk-backed backing store via torch.from_file(shared=True) — zero pinned CPU RAM, OS page cache handles hot pages
    • correct overflow fallback for top_k=8 models (old truncation produced silent garbage when unique experts > cache size)

Cache size tuning

This model is top_k=8 with 256 experts, so a single prefill forward can activate 40–60 unique experts. The cache policy is per-forward:

  • --moe-expert-cache-size 128recommended. All hot experts fit, ~24 tok/s on RTX PRO 6000, ~8 GB persistent LRU buffer on GPU.
  • --moe-expert-cache-size 64 — tighter GPU footprint, ~20 tok/s, occasional overflow fallbacks.
  • --moe-expert-cache-size 8 — minimum footprint, ~10 tok/s, every forward hits the overflow fallback (correct but copies CPU→GPU every call). Useful on 12 GB VRAM + accepting the speed cost.

For top-k=2 models (Nemotron, Gemma-4-26B-A4B) --moe-expert-cache-size 8 is fast because those only activate ~4-8 unique experts per forward.

Repo files added for vLLM loading

  • preprocessor_config.json — stub pulled from Qwen/Qwen2-VL-7B-Instruct. vLLM Qwen3-VL processor init requires this file even when images are disabled. Never actually invoked at text inference.
  • tokenizer_config.jsontokenizer_class set to Qwen2TokenizerFast (was PreTrainedTokenizerFast), required by the Qwen3-VL processor class check.

Validated on

NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM, 176 GB host RAM, 236 GB SSD). Disk usage during deploy: ~20 GB HF cache + ~31 GB disk-backed expert cache = ~51 GB peak. Host RAM stable at ~7 GB used, ~167 GB in OS page cache (the disk-backed mmap pages, reclaimable). Outputs coherent on 3-prompt smoke test at cache sizes 8, 64, 128. Informal PPL sanity check: 9.6 on a 357-token wikitext-style sample with cache=8 (the authoritative PPL 6.56 above is from the full WikiText-2 evaluation with a different config).


Architecture

Property Value
Base Model samuelcardillo/Qwopus-MoE-35B-A3B
Architecture Qwen3.5 MoE
Total Parameters 35B
Active per Token ~3B (8 of 256 experts)
Attention Hybrid — self-attention + linear attention (Mamba-style SSM)
Layers 28 transformer layers

Quantization Details

Component Format Details
Expert weights (gate/up/down) INT4 symmetric, group_size=128, pack-quantized
Shared expert (gate/up/down) INT4 Same format
Attention (q/k/v/o_proj) INT4 Standard Linear layers
Linear attention (in_proj_qkv/z, out_proj) INT4 SSM projections
LayerNorms, norms BF16 Not quantized
SSM params (A_log, conv1d, dt_bias) BF16 Not quantized
MoE router (gate.weight) BF16 Critical for routing
lm_head, embed_tokens BF16 Embedding layers
in_proj_a, in_proj_b BF16 Small dims

Format: CompressedTensors pack-quantized — industry standard, native vLLM support via Marlin kernel.

Pipeline: Base BF16 → INT4 symmetric group128 → pack INT32 → CompressedTensors safetensors


Benchmarks

Perplexity (WikiText-2, Full Test Set)

Evaluated over the complete WikiText-2 test set (1,155 windows, 295,680 tokens):

Windows Tokens Scored PPL
200 51,200 6.58
500 128,000 6.41
1,000 256,000 6.58
1,155 295,680 6.56

PPL converges to 6.56 on full WikiText-2. We have not measured the BF16 baseline PPL for this specific MoE — we can only say this is a low-PPL result on the benchmark, without a direct BF16 comparison.

Generation Speed

Tested on NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM):

Config tok/s Load Time VRAM
INT4 CT (vLLM native) 23.6 ~165s ~25 GB
INT4 CT + LFRU cache 37.4 ~315s ~8 GB
BF16 Original 16.2 ~120s ~72 GB

The LFRU expert cache is faster because only 8 hot experts are in GPU — better memory locality, less fragmentation across 256 experts.

2.3x faster than BF16 with 3.7x smaller download and 9x less VRAM.


HLWQ Ecosystem

Ecosystem

All models load natively in vLLM — no plugins required:

Model Size Speed PPL
Qwopus-MoE-35B INT4 19.5 GB 37.4 tok/s 6.56
Qwen3.5-27B INT4 16.2 GB 18.0 tok/s
Harmonic-27B INT4 16.2 GB 18.0 tok/s
Qwopus3.5-27B INT4 16.2 GB 18.0 tok/s
Qwen3.5-9B INT4 6.5 GB 168.4 tok/s 6.56
Qwopus3.5-9B INT4 6.5 GB 168.4 tok/s

GPU Requirements

GPU VRAM Config Expected Speed
RTX 4090 24 GB LFRU cache (--moe-expert-cache-size 4) ~30 tok/s
RTX 3090 24 GB LFRU cache (--moe-expert-cache-size 4) ~15 tok/s
RTX PRO 6000 102 GB LFRU cache (8 experts) 37.4 tok/s
H100 80GB 80 GB All-in-GPU or LFRU cache ~25 tok/s
A100 80GB 80 GB Use --enforce-eager Known Marlin MoE bug (#35922)

Expert LFRU offloading via vllm-expert-offload enables running this 35B MoE on consumer GPUs (RTX 4090, 3090). Only hot experts stay in GPU, rest on CPU. Counterintuitively, this is faster than all-in-GPU due to better memory locality.

Required Flags

Flag Why
--language-model-only Skips vision encoder (4304 dim not Marlin-compatible)
--enforce-eager Required for Blackwell and expert cache; recommended for stability
--moe-expert-cache-size N Keep N experts in GPU, rest on CPU (fork)

Technical Notes

  • Expert format: Per-expert 2D tensors (experts.{N}.gate_proj.weight_packed), not 3D stacked. vLLM FusedMoE handles stacking internally.
  • Reference format: Follows RedHatAI/Qwen3-30B-A3B-quantized.w4a16 conventions.
  • Marlin kernel: Fused INT4 dequant + matmul for maximum throughput.
  • Vision weights: Excluded (use --language-model-only). The config retains vision_config for architecture resolution.

Links

Citation

@article{vicentino2026polarquant,
  title={HLWQ: Hadamard-Rotated Post-Training Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}
Downloads last month
2,105
Safetensors
Model size
35B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5

Quantized
(9)
this model

Collections including caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5

Papers for caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5

Evaluation results