Qwen3-Coder-30B-A3B-REAP AWQ 4-bit
⚠️ VALIDATION FAILED 2026-04-29 — DO NOT USE. Initial smoke test on the AWQ output produces gibberish (
def count_vowels(s):→sweat sweat aster aster…, both/v1/chat/completionsand/v1/completions). The end-to-end pipeline (REAP+REAM merge → AWQ calib → CT→native conversion → audit) all reported success at the file-format level, but the resulting weights are unusable for inference. Likely root cause: insufficient AWQ calibration coverage (256 samples × 1024 tokens) for a 96-expert MoE post-merge, or weight corruption introduced by the REAP+REAM merge step. For a working REAP variant of this base, usemattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ(Cerebras prune, validated, 88/300 on SWE-bench Lite). This repo will be either fixed or removed; tracking under task #52 in the RDNA4 inference repo.
AWQ 4-bit quantization of a self-built REAP-pruned variant of Qwen3-Coder-30B-A3B-Instruct, calibrated with thinking + code data, optimized for AMD RDNA4 (gfx1201) inference with SGLang.
Model Details
| Base model | Qwen/Qwen3-Coder-30B-A3B-Instruct (128 experts) |
| Architecture | Qwen3 MoE (96 experts post-prune, top-8) |
| Parameters | ~25B total / 3B active |
| Pruning method | REAP saliency + REAM grouping (REAP arxiv:2510.13999, REAM arxiv:2503.08009) — 25% expert drop (128 → 96) |
| Pruning mix | --saliency reap --grouping ream --mix_ratio 0.0,0.3,0.7 on c4+math+code calibration |
| Layers | 48 |
| Context | 256K supported by base |
| Quantization | Native AWQ 4-bit, group_size=128, fused Triton GEMM |
| Calibration | GPTQ via llmcompressor, 256 samples × 1024 tokens, code_thinking mix; ignore=lm_head, mlp.gate, shared_expert.* |
Performance (2x AMD Radeon AI PRO R9700, TP=2, fp8 KV)
Bench pending — will populate once sglang.bench_serving runs complete. Expected to track the Cerebras REAP-25B sibling (~22 tok/s flat across 128 → 131K context, A3B MoE bandwidth-bound).
Notes
This variant was built end-to-end on the R9700 box rather than reusing a published prune:
- Pruning —
--saliency reap --grouping reamvia the REAM merge.py pipeline, dropping 32 of 128 experts to 96. REAP scores experts on router-aware impact; REAM groups them for clustered selection. Themix_ratio 0.0,0.3,0.7favors code-leaning calibration over c4. - Calibration — GPTQ-AWQ on the BF16 prune via llmcompressor, 256 samples × 1024 tokens. Calibration data is the
code_thinkingmix (AM-Thinking-v1, NuminaMath-CoT, ultrachat). - Native AWQ conversion — CT (compressed-tensors) format from llmcompressor was converted to native AWQ via
convert_moe_ct_to_awq.py. On ROCm, the AWQ Triton GEMM is 6× faster than the CT path on identical weights.
mlp.gate (router) is preserved in BF16 to avoid INT4 routing artifacts. The Coder-30B-A3B base has no shared_expert (Qwen3MoE arch), so this layout is simpler than the Qwen3.5/3.6 family.
This is a different recipe from the published Cerebras REAP-25B: same 96-expert target but with REAM grouping mixed in, so expect slightly different downstream behavior even though the parameter count matches.
Usage with SGLang
Tested on the RDNA4 inference stack (SGLang v0.5.10 + RDNA4 patches):
git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3-Coder-30B-A3B-REAP-AWQ scripts/launch.sh coder-30b-reap
For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via vllm / transformers + autoawq without modification.
Hardware
Built and quantized on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.
- Downloads last month
- 99
Model tree for mattbucci/Qwen3-Coder-30B-A3B-REAP-AWQ
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct