Qwen3.6-35B-A3B AWQ 4-bit (compressed-tensors)

Compressed-tensors output of GPTQ calibration of Qwen3.6-35B-A3B with thinking + vision preserved.

Which variant should I download?

Stack Recommended Why
SGLang + ROCm Native AWQ 6× faster decode (21.6 vs 3.6 tok/s on R9700) — fused Triton AWQ GEMM beats the ROCm CompressedTensorsWNA16TritonMoE path
SGLang + NVIDIA Native AWQ SGLang's NVIDIA CT loader doesn't replicate the BF16 fallback for the (1, H) shared_expert_gate and triggers thinking-mode repetition loops; native avoids the loader path entirely
vLLM / autoawq / TGI on NVIDIA Either works CT loaders in those engines handle the gate correctly; native is a few % faster on Marlin
Inspection / re-conversion This (CT) Raw GPTQ output from llmcompressor before AWQ repack

Model Details

Base model Qwen/Qwen3.6-35B-A3B
Architecture Qwen3.5 MoE+DeltaNet hybrid (256 experts top-8) + vision tower
Parameters 35B total / 3B active
Format compressed-tensors pack-quantized (W4A16, group_size=128)
Calibration GPTQ via llmcompressor, 256 samples × 1024 tokens, thinking_vision recipe

For the full ignore list and known calibration limitations, see the native variant's README.

Convert to native AWQ

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
python scripts/quantize/convert_moe_ct_to_awq.py <local_path_to_this_repo> <output_dir> --group-size 128

Output is bit-equivalent to the native AWQ repo.

Hardware origin

Calibrated on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4) with ROCm 7.2 + SGLang v0.5.10 + RDNA4 patches.

Downloads last month
489
Safetensors
Model size
35B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3.6-35B-A3B-AWQ-CT

Quantized
(310)
this model