Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
Qwopus-MoE-35B-A3B — HLWQ INT4 CompressedTensors
35B Mixture-of-Experts quantized to INT4 — loads natively in vLLM with Marlin kernel. Zero plugins, zero custom code.
| Metric | INT4 CT | INT4 + LFRU Cache | BF16 Original |
|---|---|---|---|
| Perplexity | 6.56 | 6.56 | not measured |
| Speed | 23.6 tok/s | 37.4 tok/s | 16.2 tok/s |
| Download | 19.5 GB | 19.5 GB | 72 GB |
| VRAM | ~25 GB | ~8 GB | ~72 GB |
| vs BF16 | 1.46x faster | 2.3x faster | baseline |
Quick Start
Standard (vLLM native)
pip install vllm
vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
--language-model-only \
--enforce-eager \
--max-model-len 4096
With Expert Offloading (fits RTX 4090!)
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
--language-model-only \
--enforce-eager \
--moe-expert-cache-size 8 \
--max-model-len 4096
37.4 tok/s with LFRU expert cache — only 8 hot experts in GPU, rest on CPU. 1.58x faster than all-in-GPU and fits in ~8 GB VRAM.
Python API
from vllm import LLM, SamplingParams
llm = LLM(
model="caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5",
trust_remote_code=True,
enforce_eager=True,
max_model_len=4096,
)
output = llm.generate(
["Explain quantum computing in simple terms."],
SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)
Validated Deploy (April 2026, updated vLLM)
The original --language-model-only flag has been removed from recent vLLM builds. Current vLLM (0.17.1rc1+) requires a slightly different invocation because the model config declares Qwen3_5MoeForConditionalGeneration (multimodal class) even though this repo ships text-only weights.
Working command (current vLLM + fork feature branch)
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload
vllm serve caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5 \
--served-model-name qwopus-ct \
--moe-expert-cache-size 128 \
--enforce-eager \
--trust-remote-code \
--max-model-len 4096 \
--limit-mm-per-prompt '{"image":0,"video":0}' \
--gpu-memory-utilization 0.90
Key flag changes vs older instructions:
--language-model-only→ removed (flag no longer exists in current vLLM).--limit-mm-per-prompt '{"image":0,"video":0}'→ new, tells vLLM not to allocate dummy image tokens in the profile-run. Without this the engine tries to load a fake image through a vision encoder that has no weights, and crashes during KV-cache sizing.git+.../vllm-expert-offload.git@feat-ct-moe-wna16-marlin-offload→ feature branch with:- opt-in expert LRU/LFRU cache for
CompressedTensorsWNA16MarlinMoEMethod(first CT INT4 MoE method in the fork with expert offload support) - disk-backed backing store via
torch.from_file(shared=True)— zero pinned CPU RAM, OS page cache handles hot pages - correct overflow fallback for
top_k=8models (old truncation produced silent garbage when unique experts > cache size)
- opt-in expert LRU/LFRU cache for
Cache size tuning
This model is top_k=8 with 256 experts, so a single prefill forward can activate 40–60 unique experts. The cache policy is per-forward:
--moe-expert-cache-size 128— recommended. All hot experts fit, ~24 tok/s on RTX PRO 6000, ~8 GB persistent LRU buffer on GPU.--moe-expert-cache-size 64— tighter GPU footprint, ~20 tok/s, occasional overflow fallbacks.--moe-expert-cache-size 8— minimum footprint, ~10 tok/s, every forward hits the overflow fallback (correct but copies CPU→GPU every call). Useful on 12 GB VRAM + accepting the speed cost.
For top-k=2 models (Nemotron, Gemma-4-26B-A4B) --moe-expert-cache-size 8 is fast because those only activate ~4-8 unique experts per forward.
Repo files added for vLLM loading
preprocessor_config.json— stub pulled fromQwen/Qwen2-VL-7B-Instruct. vLLM Qwen3-VL processor init requires this file even when images are disabled. Never actually invoked at text inference.tokenizer_config.json—tokenizer_classset toQwen2TokenizerFast(wasPreTrainedTokenizerFast), required by the Qwen3-VL processor class check.
Validated on
NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB VRAM, 176 GB host RAM, 236 GB SSD). Disk usage during deploy: ~20 GB HF cache + ~31 GB disk-backed expert cache = ~51 GB peak. Host RAM stable at ~7 GB used, ~167 GB in OS page cache (the disk-backed mmap pages, reclaimable). Outputs coherent on 3-prompt smoke test at cache sizes 8, 64, 128. Informal PPL sanity check: 9.6 on a 357-token wikitext-style sample with cache=8 (the authoritative PPL 6.56 above is from the full WikiText-2 evaluation with a different config).
Architecture
| Property | Value |
|---|---|
| Base Model | samuelcardillo/Qwopus-MoE-35B-A3B |
| Architecture | Qwen3.5 MoE |
| Total Parameters | 35B |
| Active per Token | ~3B (8 of 256 experts) |
| Attention | Hybrid — self-attention + linear attention (Mamba-style SSM) |
| Layers | 28 transformer layers |
Quantization Details
| Component | Format | Details |
|---|---|---|
| Expert weights (gate/up/down) | INT4 | symmetric, group_size=128, pack-quantized |
| Shared expert (gate/up/down) | INT4 | Same format |
| Attention (q/k/v/o_proj) | INT4 | Standard Linear layers |
| Linear attention (in_proj_qkv/z, out_proj) | INT4 | SSM projections |
| LayerNorms, norms | BF16 | Not quantized |
| SSM params (A_log, conv1d, dt_bias) | BF16 | Not quantized |
| MoE router (gate.weight) | BF16 | Critical for routing |
| lm_head, embed_tokens | BF16 | Embedding layers |
| in_proj_a, in_proj_b | BF16 | Small dims |
Format: CompressedTensors pack-quantized — industry standard, native vLLM support via Marlin kernel.
Pipeline: Base BF16 → INT4 symmetric group128 → pack INT32 → CompressedTensors safetensors
Benchmarks
Perplexity (WikiText-2, Full Test Set)
Evaluated over the complete WikiText-2 test set (1,155 windows, 295,680 tokens):
| Windows | Tokens Scored | PPL |
|---|---|---|
| 200 | 51,200 | 6.58 |
| 500 | 128,000 | 6.41 |
| 1,000 | 256,000 | 6.58 |
| 1,155 | 295,680 | 6.56 |
PPL converges to 6.56 on full WikiText-2. We have not measured the BF16 baseline PPL for this specific MoE — we can only say this is a low-PPL result on the benchmark, without a direct BF16 comparison.
Generation Speed
Tested on NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM):
| Config | tok/s | Load Time | VRAM |
|---|---|---|---|
| INT4 CT (vLLM native) | 23.6 | ~165s | ~25 GB |
| INT4 CT + LFRU cache | 37.4 | ~315s | ~8 GB |
| BF16 Original | 16.2 | ~120s | ~72 GB |
The LFRU expert cache is faster because only 8 hot experts are in GPU — better memory locality, less fragmentation across 256 experts.
2.3x faster than BF16 with 3.7x smaller download and 9x less VRAM.
HLWQ Ecosystem
All models load natively in vLLM — no plugins required:
| Model | Size | Speed | PPL |
|---|---|---|---|
| Qwopus-MoE-35B INT4 | 19.5 GB | 37.4 tok/s | 6.56 |
| Qwen3.5-27B INT4 | 16.2 GB | 18.0 tok/s | — |
| Harmonic-27B INT4 | 16.2 GB | 18.0 tok/s | — |
| Qwopus3.5-27B INT4 | 16.2 GB | 18.0 tok/s | — |
| Qwen3.5-9B INT4 | 6.5 GB | 168.4 tok/s | 6.56 |
| Qwopus3.5-9B INT4 | 6.5 GB | 168.4 tok/s | — |
GPU Requirements
| GPU | VRAM | Config | Expected Speed |
|---|---|---|---|
| RTX 4090 | 24 GB | LFRU cache (--moe-expert-cache-size 4) |
~30 tok/s |
| RTX 3090 | 24 GB | LFRU cache (--moe-expert-cache-size 4) |
~15 tok/s |
| RTX PRO 6000 | 102 GB | LFRU cache (8 experts) | 37.4 tok/s |
| H100 80GB | 80 GB | All-in-GPU or LFRU cache | ~25 tok/s |
| A100 80GB | 80 GB | Use --enforce-eager |
Known Marlin MoE bug (#35922) |
Expert LFRU offloading via vllm-expert-offload enables running this 35B MoE on consumer GPUs (RTX 4090, 3090). Only hot experts stay in GPU, rest on CPU. Counterintuitively, this is faster than all-in-GPU due to better memory locality.
Required Flags
| Flag | Why |
|---|---|
--language-model-only |
Skips vision encoder (4304 dim not Marlin-compatible) |
--enforce-eager |
Required for Blackwell and expert cache; recommended for stability |
--moe-expert-cache-size N |
Keep N experts in GPU, rest on CPU (fork) |
Technical Notes
- Expert format: Per-expert 2D tensors (
experts.{N}.gate_proj.weight_packed), not 3D stacked. vLLM FusedMoE handles stacking internally. - Reference format: Follows RedHatAI/Qwen3-30B-A3B-quantized.w4a16 conventions.
- Marlin kernel: Fused INT4 dequant + matmul for maximum throughput.
- Vision weights: Excluded (use
--language-model-only). The config retainsvision_configfor architecture resolution.
Links
- Paper: HLWQ — Hadamard-Rotated Post-Training Quantization
- GitHub: polarengine-vllm
- Expert Offloading Fork: vllm-expert-offload — LFRU expert cache for consumer GPUs
- Base Model: samuelcardillo/Qwopus-MoE-35B-A3B
- Reference Format: RedHatAI/Qwen3-30B-A3B-quantized.w4a16
Citation
@article{vicentino2026polarquant,
title={HLWQ: Hadamard-Rotated Post-Training Quantization},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.29078},
year={2026}
}
- Downloads last month
- 2,105
Model tree for caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5
Base model
Qwen/Qwen3.5-35B-A3B-BaseCollections including caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5
Papers for caiovicentino1/Qwopus-MoE-35B-A3B-HLWQ-Q5
PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
PolarQuant: Quantizing KV Caches with Polar Transformation
Evaluation results
- Perplexity on WikiText-2self-reported6.560

