Qwen3-Coder-30B-A3B-REAP AWQ 4-bit

⚠️ VALIDATION FAILED 2026-04-29 — DO NOT USE. Initial smoke test on the AWQ output produces gibberish (def count_vowels(s): → sweat sweat aster aster…, both /v1/chat/completions and /v1/completions). The end-to-end pipeline (REAP+REAM merge → AWQ calib → CT→native conversion → audit) all reported success at the file-format level, but the resulting weights are unusable for inference. Likely root cause: insufficient AWQ calibration coverage (256 samples × 1024 tokens) for a 96-expert MoE post-merge, or weight corruption introduced by the REAP+REAM merge step. For a working REAP variant of this base, use mattbucci/Qwen3-Coder-REAP-25B-A3B-AWQ (Cerebras prune, validated, 88/300 on SWE-bench Lite). This repo will be either fixed or removed; tracking under task #52 in the RDNA4 inference repo.

AWQ 4-bit quantization of a self-built REAP-pruned variant of Qwen3-Coder-30B-A3B-Instruct, calibrated with thinking + code data, optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	Qwen/Qwen3-Coder-30B-A3B-Instruct (128 experts)
Architecture	Qwen3 MoE (96 experts post-prune, top-8)
Parameters	~25B total / 3B active
Pruning method	REAP saliency + REAM grouping (REAP arxiv:2510.13999, REAM arxiv:2503.08009) — 25% expert drop (128 → 96)
Pruning mix	`--saliency reap --grouping ream --mix_ratio 0.0,0.3,0.7` on c4+math+code calibration
Layers	48
Context	256K supported by base
Quantization	Native AWQ 4-bit, group_size=128, fused Triton GEMM
Calibration	GPTQ via llmcompressor, 256 samples × 1024 tokens, `code_thinking` mix; ignore=`lm_head, mlp.gate, shared_expert.*`

Performance (2x AMD Radeon AI PRO R9700, TP=2, fp8 KV)

Bench pending — will populate once sglang.bench_serving runs complete. Expected to track the Cerebras REAP-25B sibling (~22 tok/s flat across 128 → 131K context, A3B MoE bandwidth-bound).

Notes

This variant was built end-to-end on the R9700 box rather than reusing a published prune:

Pruning — --saliency reap --grouping ream via the REAM merge.py pipeline, dropping 32 of 128 experts to 96. REAP scores experts on router-aware impact; REAM groups them for clustered selection. The mix_ratio 0.0,0.3,0.7 favors code-leaning calibration over c4.
Calibration — GPTQ-AWQ on the BF16 prune via llmcompressor, 256 samples × 1024 tokens. Calibration data is the code_thinking mix (AM-Thinking-v1, NuminaMath-CoT, ultrachat).
Native AWQ conversion — CT (compressed-tensors) format from llmcompressor was converted to native AWQ via convert_moe_ct_to_awq.py. On ROCm, the AWQ Triton GEMM is 6× faster than the CT path on identical weights.

mlp.gate (router) is preserved in BF16 to avoid INT4 routing artifacts. The Coder-30B-A3B base has no shared_expert (Qwen3MoE arch), so this layout is simpler than the Qwen3.5/3.6 family.

This is a different recipe from the published Cerebras REAP-25B: same 96-expert target but with REAM grouping mixed in, so expect slightly different downstream behavior even though the parameter count matches.

Usage with SGLang

Tested on the RDNA4 inference stack (SGLang v0.5.10 + RDNA4 patches):

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
MODEL=mattbucci/Qwen3-Coder-30B-A3B-REAP-AWQ scripts/launch.sh coder-30b-reap

For other inference engines, this is a standard AWQ 4-bit checkpoint (group_size=128, asymmetric, fused MoE) and should load via vllm / transformers + autoawq without modification.

Hardware

Built and quantized on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 64 GB total VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.

Downloads last month: 99

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mattbucci/Qwen3-Coder-30B-A3B-REAP-AWQ

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Quantized

(136)

this model

Papers for mattbucci/Qwen3-Coder-30B-A3B-REAP-AWQ

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19

Integrated Energy Management for Operational Cost Optimization in Community Microgrids

Paper • 2503.08009 • Published Mar 11, 2025