Qwen3.6-35B-A3B AWQ 4-bit (native)

AWQ 4-bit quantization of Qwen3.6-35B-A3B with thinking + vision preserved, optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	Qwen/Qwen3.6-35B-A3B
Architecture	Qwen3.5 MoE+DeltaNet hybrid (256 experts top-8) + vision tower
Parameters	35B total / 3B active
Layers	40 (mixed full-attention + DeltaNet linear-attn)
Context	262K (native), 1M (YaRN-extended)
Modalities	text + image + video (no audio)
Quantization	Native AWQ 4-bit, group_size=128, fused Triton GEMM
Calibration	GPTQ via llmcompressor, 256 samples × 1024 tokens, `thinking_vision` recipe (AM-Thinking-v1, NuminaMath, LLaVA-Instruct, ultrachat); ignore=`lm_head, in_proj_a/b, mlp.gate, shared_experts., vision_tower, visual., multi_modal_projector, embed_vision`

Performance (2x AMD Radeon AI PRO R9700, TP=2)

sglang.bench_serving, single user, FP8 KV cache:

Context	TPOT (ms)	tok/s
128	46.4	21.6
8192	46.3	21.6
32768	46.4	21.6
65536	47.1	21.2
131072	48.5	20.6

Flat ~21 tok/s decode through 131K — A3B MoE is bandwidth-bound on the active 3B params, no attention-scaling cliff.

6× faster than the compressed-tensors variant (mattbucci/Qwen3.6-35B-A3B-AWQ-CT measured 3.6 tok/s) — same calibration weights, just repacked into native AWQ format so SGLang's fused Triton AWQ GEMM runs instead of the slow ROCm CompressedTensorsWNA16TritonMoE kernel.

Notes

DeltaNet projections (in_proj_a, in_proj_b) are kept BF16 — recurrent state accumulates INT4 error catastrophically across 40 hybrid layers.
Vision tower is BF16 throughout (vision_tower.*, visual.*, multi_modal_projector, embed_vision).
Known calibration imperfection (v1): the recipe used re:.*shared_experts.* (plural) instead of singular, so shared_expert.{gate,up,down}_proj got INT4-quantized. A v2 with the corrected pattern is in flight; see the tracker repo for status.
Greedy decode (temperature=0) loops on this model — use temperature=0.7, top_p=0.95, top_k=20 per Qwen's recommendations. SGLang picks this up automatically with sampling_defaults='model'.

Usage with SGLang (RDNA4)

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3.6-35B-A3B-AWQ scripts/launch.sh qwen36-moe

The qwen36-moe preset auto-detects native AWQ vs compressed-tensors and picks --quantization moe_wna16 accordingly.

Hardware

Calibrated and benchmarked on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 + SGLang v0.5.10 + RDNA4 patches.

Downloads last month: 1,290

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3.6-35B-A3B-AWQ

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(299)

this model