Qwen3.6-35B-A3B AWQ 4-bit (native)
AWQ 4-bit quantization of Qwen3.6-35B-A3B with thinking + vision preserved, optimized for AMD RDNA4 (gfx1201) inference with SGLang.
Model Details
| Base model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Qwen3.5 MoE+DeltaNet hybrid (256 experts top-8) + vision tower |
| Parameters | 35B total / 3B active |
| Layers | 40 (mixed full-attention + DeltaNet linear-attn) |
| Context | 262K (native), 1M (YaRN-extended) |
| Modalities | text + image + video (no audio) |
| Quantization | Native AWQ 4-bit, group_size=128, fused Triton GEMM |
| Calibration | GPTQ via llmcompressor, 256 samples ร 1024 tokens, thinking_vision recipe (AM-Thinking-v1, NuminaMath, LLaVA-Instruct, ultrachat); ignore=lm_head, in_proj_a/b, mlp.gate, shared_experts.*, vision_tower, visual.*, multi_modal_projector, embed_vision |
Performance (2x AMD Radeon AI PRO R9700, TP=2)
sglang.bench_serving, single user, FP8 KV cache:
| Context | TPOT (ms) | tok/s |
|---|---|---|
| 128 | 46.4 | 21.6 |
| 8192 | 46.3 | 21.6 |
| 32768 | 46.4 | 21.6 |
| 65536 | 47.1 | 21.2 |
| 131072 | 48.5 | 20.6 |
Flat ~21 tok/s decode through 131K โ A3B MoE is bandwidth-bound on the active 3B params, no attention-scaling cliff.
6ร faster than the compressed-tensors variant (mattbucci/Qwen3.6-35B-A3B-AWQ-CT measured 3.6 tok/s) โ same calibration weights, just repacked into native AWQ format so SGLang's fused Triton AWQ GEMM runs instead of the slow ROCm CompressedTensorsWNA16TritonMoE kernel.
Notes
- DeltaNet projections (
in_proj_a,in_proj_b) are kept BF16 โ recurrent state accumulates INT4 error catastrophically across 40 hybrid layers. - Vision tower is BF16 throughout (
vision_tower.*,visual.*,multi_modal_projector,embed_vision). - Known calibration imperfection (v1): the recipe used
re:.*shared_experts.*(plural) instead of singular, soshared_expert.{gate,up,down}_projgot INT4-quantized. A v2 with the corrected pattern is in flight; see the tracker repo for status. - Greedy decode (temperature=0) loops on this model โ use
temperature=0.7, top_p=0.95, top_k=20per Qwen's recommendations. SGLang picks this up automatically withsampling_defaults='model'.
Usage with SGLang (RDNA4)
git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3.6-35B-A3B-AWQ scripts/launch.sh qwen36-moe
The qwen36-moe preset auto-detects native AWQ vs compressed-tensors and picks --quantization moe_wna16 accordingly.
Hardware
Calibrated and benchmarked on 2ร AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 + SGLang v0.5.10 + RDNA4 patches.
- Downloads last month
- 1,290
Model tree for mattbucci/Qwen3.6-35B-A3B-AWQ
Base model
Qwen/Qwen3.6-35B-A3B