Qwen3.6-35B-A3B AWQ 4-bit (compressed-tensors)
Compressed-tensors output of GPTQ calibration of Qwen3.6-35B-A3B with thinking + vision preserved.
Which variant should I download?
| Stack | Recommended | Why |
|---|---|---|
| SGLang + ROCm | Native AWQ | 6× faster decode (21.6 vs 3.6 tok/s on R9700) — fused Triton AWQ GEMM beats the ROCm CompressedTensorsWNA16TritonMoE path |
| SGLang + NVIDIA | Native AWQ | SGLang's NVIDIA CT loader doesn't replicate the BF16 fallback for the (1, H) shared_expert_gate and triggers thinking-mode repetition loops; native avoids the loader path entirely |
| vLLM / autoawq / TGI on NVIDIA | Either works | CT loaders in those engines handle the gate correctly; native is a few % faster on Marlin |
| Inspection / re-conversion | This (CT) | Raw GPTQ output from llmcompressor before AWQ repack |
Model Details
| Base model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Qwen3.5 MoE+DeltaNet hybrid (256 experts top-8) + vision tower |
| Parameters | 35B total / 3B active |
| Format | compressed-tensors pack-quantized (W4A16, group_size=128) |
| Calibration | GPTQ via llmcompressor, 256 samples × 1024 tokens, thinking_vision recipe |
For the full ignore list and known calibration limitations, see the native variant's README.
Convert to native AWQ
git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
python scripts/quantize/convert_moe_ct_to_awq.py <local_path_to_this_repo> <output_dir> --group-size 128
Output is bit-equivalent to the native AWQ repo.
Hardware origin
Calibrated on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4) with ROCm 7.2 + SGLang v0.5.10 + RDNA4 patches.
- Downloads last month
- 489
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for mattbucci/Qwen3.6-35B-A3B-AWQ-CT
Base model
Qwen/Qwen3.6-35B-A3B