🚀 Qwen3.6-35B-A3B — HLWQ CT INT4 (vLLM-ready)

CompressedTensors INT4 of Qwen/Qwen3.6-35B-A3B via HLWQ

Run a 35B 256-expert MoE on an RTX 3060 (12 GB) with expert offloading

📊 Compression

Compression Pipeline

Metric Value
📦 Format CompressedTensors INT4 symmetric (gs=128)
💾 Model size 19.43 GB (5 shards, 62,303 tensors)
📉 Compression 72% (70.2 → 19.4 GB)
⚡ Kernel Marlin (fused dequant+matmul)
🧩 Expert keys Per-expert 2D (vLLM-native)
🔢 Breakdown 30,720 expert + 250 linear + 363 BF16

🏎️ Quick Start

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
  --language-model-only \
  --enforce-eager \
  --moe-expert-cache-size 8

💻 GPU Compatibility

GPU Compatibility

GPU VRAM Expert Cache Status
RTX PRO 6000 96 GB all-in ✅ Full speed
A100 / H100 80 GB all-in ✅ Full speed
RTX 4090 24 GB cache=4 ✅ ~4 GB
RTX 4070 Ti 16 GB cache=3 ✅ ~3.5 GB
RTX 3060 12 GB cache=2 ✅ ~3 GB
RTX 3050 8 GB cache=1 ⚠️ Tight

🏎️ Benchmarks

Speed Benchmark

RTX PRO 6000 Blackwell (96 GB). vLLM + Marlin will be significantly faster than these notebook benchmarks.

🧬 Architecture

Architecture

Spec Value
Layers 40 (30 GDN + 10 Full Attention)
Experts 256/layer (8 routed + 1 shared)
Hidden 2048
Expert intermediate 512
Context 262,144 tokens

📋 Coverage

Quantization Coverage

Per-expert 2D keys (gate_up_proj split → separate gate + up):

model.layers.{L}.mlp.experts.{E}.gate_proj.weight_packed   INT4→int32
model.layers.{L}.mlp.experts.{E}.gate_proj.weight_scale    BF16
model.layers.{L}.mlp.experts.{E}.up_proj.weight_packed     INT4→int32
model.layers.{L}.mlp.experts.{E}.up_proj.weight_scale      BF16
model.layers.{L}.mlp.experts.{E}.down_proj.weight_packed   INT4→int32
model.layers.{L}.mlp.experts.{E}.down_proj.weight_scale    BF16

BF16 preserved: norms, GDN gates (in_proj_a/b), routers, A_log, conv1d, dt_bias, embeddings, lm_head

🔬 Pipeline

BF16 (70.2 GB)
    │
    ▼
[1] HLWQ Q5: Hadamard rotation + Lloyd-Max 5-bit
    │  (better distribution before INT4)
    ▼
[2] PQ5 dequant → BF16
    │
    ▼
[3] INT4 symmetric (gs=128): scale = absmax/7
    │
    ▼
[4] Pack 8×INT4 → int32 (CompressedTensors)
    │
    ▼
CT INT4 (19.4 GB) → Marlin → vLLM serve

📖 Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}

🔗 Links

Resource Link
📄 Paper arXiv:2603.29078
🔧 Code GitHub
📦 PyPI pip install polarquant
🗂️ Q5 codes Qwen3.6-35B-A3B-HLWQ-Q5
🔀 Expert Offload vllm-expert-offload
🏠 Base model Qwen/Qwen3.6-35B-A3B
Downloads last month
7,627
Safetensors
Model size
35B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4

Quantized
(201)
this model

Paper for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4