🚀 Qwen3.6-35B-A3B — HLWQ CT INT4 (vLLM-ready)

CompressedTensors INT4 of Qwen/Qwen3.6-35B-A3B via HLWQ

Run a 35B 256-expert MoE on an RTX 3060 (12 GB) with expert offloading

📊 Compression

Metric	Value
📦 Format	CompressedTensors INT4 symmetric (gs=128)
💾 Model size	19.43 GB (5 shards, 62,303 tensors)
📉 Compression	72% (70.2 → 19.4 GB)
⚡ Kernel	Marlin (fused dequant+matmul)
🧩 Expert keys	Per-expert 2D (vLLM-native)
🔢 Breakdown	30,720 expert + 250 linear + 363 BF16

🏎️ Quick Start

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
  --language-model-only \
  --enforce-eager \
  --moe-expert-cache-size 8

💻 GPU Compatibility

GPU	VRAM	Expert Cache	Status
RTX PRO 6000	96 GB	all-in	✅ Full speed
A100 / H100	80 GB	all-in	✅ Full speed
RTX 4090	24 GB	cache=4	✅ ~4 GB
RTX 4070 Ti	16 GB	cache=3	✅ ~3.5 GB
RTX 3060	12 GB	cache=2	✅ ~3 GB
RTX 3050	8 GB	cache=1	⚠️ Tight

🏎️ Benchmarks

RTX PRO 6000 Blackwell (96 GB). vLLM + Marlin will be significantly faster than these notebook benchmarks.

🧬 Architecture

Spec	Value
Layers	40 (30 GDN + 10 Full Attention)
Experts	256/layer (8 routed + 1 shared)
Hidden	2048
Expert intermediate	512
Context	262,144 tokens

📋 Coverage

Per-expert 2D keys (gate_up_proj split → separate gate + up):

model.layers.{L}.mlp.experts.{E}.gate_proj.weight_packed   INT4→int32
model.layers.{L}.mlp.experts.{E}.gate_proj.weight_scale    BF16
model.layers.{L}.mlp.experts.{E}.up_proj.weight_packed     INT4→int32
model.layers.{L}.mlp.experts.{E}.up_proj.weight_scale      BF16
model.layers.{L}.mlp.experts.{E}.down_proj.weight_packed   INT4→int32
model.layers.{L}.mlp.experts.{E}.down_proj.weight_scale    BF16

BF16 preserved: norms, GDN gates (in_proj_a/b), routers, A_log, conv1d, dt_bias, embeddings, lm_head

🔬 Pipeline

BF16 (70.2 GB)
    │
    ▼
[1] HLWQ Q5: Hadamard rotation + Lloyd-Max 5-bit
    │  (better distribution before INT4)
    ▼
[2] PQ5 dequant → BF16
    │
    ▼
[3] INT4 symmetric (gs=128): scale = absmax/7
    │
    ▼
[4] Pack 8×INT4 → int32 (CompressedTensors)
    │
    ▼
CT INT4 (19.4 GB) → Marlin → vLLM serve

📖 Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}

🔗 Links

Resource	Link
📄 Paper	arXiv:2603.29078
🔧 Code	GitHub
📦 PyPI	`pip install polarquant`
🗂️ Q5 codes	Qwen3.6-35B-A3B-HLWQ-Q5
🔀 Expert Offload	vllm-expert-offload
🏠 Base model	Qwen/Qwen3.6-35B-A3B

Downloads last month: 7,627

Safetensors

Model size

35B params

Tensor type

I32

BF16

Model tree for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(201)

this model

Paper for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 24 days ago