Qwen3-REAP-15B-A3B-W4A16
INT4 weight-quantized version of Qwen3-REAP-15B-A3B, created by applying REAP pruning + AutoRound W4A16 quantization to Qwen3-30B-A3B.
30B โ 15B (REAP 50% prune) โ 8.7 GB (W4A16) | ~3B active per token
Model Summary
| Original | REAP Pruned | Quantized (this model) | |
|---|---|---|---|
| Model | Qwen/Qwen3-30B-A3B | atbender/Qwen3-REAP-15B-A3B | Qwen3-REAP-15B-A3B-W4A16 |
| Total Parameters | ~30B | ~15B | ~15B |
| Active Parameters | ~3B | ~3B | ~3B |
| Experts per Layer | 128 | 64 | 64 |
| Experts Routed per Token | 8 | 8 | 8 |
| Hidden Layers | 48 | 48 | 48 |
| Hidden Size | 2048 | 2048 | 2048 |
| Precision | BF16 | BF16 | W4A16 (INT4 weights, FP16 activations) |
| Disk Size | ~57 GB | ~30 GB | ~8.7 GB |
Compression Pipeline
Stage 1: REAP Expert Pruning (30B โ 15B)
REAP (Router-weighted Expert Activation Pruning) from Cerebras Research combines router gate statistics with expert activation norms to determine which experts to prune. 50% of MoE experts pruned globally (128 โ 64 per layer, across all 48 layers).
REAP calibration data โ 1,000 samples, packed to 2,048 token sequences:
| Source | Proportion | Dataset | Description |
|---|---|---|---|
| Agentic trajectories | 40% | togethercomputer/CoderForge-Preview |
Passing SWE-agent trajectories |
| Raw code | 30% | bigcode/the-stack-smol (Python) |
Python source code |
| General web text | 10% | allenai/c4 (English) |
Pretraining distribution proxy |
| Broad coverage | 20% | NeelNanda/pile-10k |
Mixed general text |
prune_method: reap
compression_ratio: 0.5
seed: 42
distance_measure: cosine
samples_per_category: 1024
model_max_length: 2048
Stage 2: AutoRound W4A16 Quantization (30 GB โ 8.7 GB)
AutoRound from Intel uses signed gradient descent to find optimal rounding decisions โ iteratively adjusting rounding directions to minimize output error across calibration samples.
| Property | Value |
|---|---|
| Quantization tool | Intel AutoRound v0.10.2 |
| Bits | 4 (INT4 symmetric) |
| Group size | 128 |
| Format | auto_round (GPTQ-compatible packing) |
| Calibration dataset | NeelNanda/pile-10k |
| Calibration samples | 64 |
| Sequence length | 512 |
| Quantized layers | 196/197 per block (9,408/9,456 total) |
| Skipped | mlp.gate (MoE router) โ kept at FP16 to preserve routing precision |
How to Use
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "atbender/Qwen3-REAP-15B-A3B-W4A16"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
With vLLM
vllm serve atbender/Qwen3-REAP-15B-A3B-W4A16 \
--quantization gptq \
--trust-remote-code
How It Was Made
The Fun Part
from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"atbender/Qwen3-REAP-15B-A3B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"atbender/Qwen3-REAP-15B-A3B",
trust_remote_code=True,
)
ar = AutoRound(
model,
tokenizer=tokenizer,
device="cuda",
bits=4,
group_size=128,
nsamples=64,
seqlen=512,
batch_size=1,
)
ar.quantize_and_save("./Qwen3-REAP-15B-A3B-W4A16", format="auto_round")
The Less Fun Part (Patches for Forward Compatibility)
These patches are needed for transformers 5.x โ they are no-ops on 4.55 but included for reproducibility:
import types, torch, transformers
# Patch 1: Conv1D removed in transformers 5.x
if not hasattr(transformers, 'pytorch_utils'):
pytorch_utils = types.ModuleType('pytorch_utils')
pytorch_utils.Conv1D = torch.nn.Linear
transformers.pytorch_utils = pytorch_utils
# Patch 2: Qwen can be misdetected as multimodal
import auto_round.utils
import auto_round.autoround
auto_round.utils.is_mllm_model = lambda *args, **kwargs: False
auto_round.autoround.is_mllm_model = lambda *args, **kwargs: False
# Both patches must come BEFORE: from auto_round import AutoRound
Hardware & Runtime
| Stage | Time | Hardware |
|---|---|---|
| REAP observation | ~74 min | 2ร RTX A6000 48GB |
| REAP pruning | <1 sec | CPU |
| AutoRound quantization (48 blocks) | ~46 min | 2ร RTX A6000 48GB |
| Total pipeline | ~2 hours |
| Resource | Peak |
|---|---|
| VRAM | 14.94 GB |
| RAM | 33.56 GB |
Intended Use
- Research on MoE pruning and compression techniques
- Practice run / reference for larger MoE compression (e.g., Step-3.5-Flash)
- Exploring sparsityโperformance trade-offs in MoE architectures
- Local / single-GPU deployment of a compressed Qwen3 MoE variant
Limitations
- No post-pruning fine-tuning โ raw prune + quantize, quality degradation expected
- Aggressive compression โ 50% expert removal + 4-bit quantization is significant
- Calibration bias โ REAP calibration was 70% code-focused; quantization used general text (pile-10k)
- Not benchmarked โ no formal evals run yet; contributions welcome
Acknowledgements
- Qwen team โ Qwen3-30B-A3B base model
- Cerebras Research โ REAP method
- Intel โ AutoRound quantization framework
- OpenMOSE โ Reference implementation and model card inspiration
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
License
Apache License 2.0 โ same as the base Qwen3-30B-A3B model.
- Downloads last month
- 5