Qwen3-30B-A3B-REAP-50: 50% Expert-Pruned Qwen3 MoE
This model is a 50% expert-pruned version of Qwen/Qwen3-30B-A3B, compressed using REAP (Router-weighted Expert Activation Pruning) from Cerebras Research.
REAP is a one-shot compression technique for Mixture-of-Experts (MoE) models that physically removes low-importance experts based on a saliency criterion combining router gate-values and expert activation norms. The method was published at ICLR 2026.
What Changed
| Property | Original | Pruned |
|---|---|---|
| Total Experts per Layer | 128 | 64 |
| Active Experts per Token | 8 | 8 (unchanged) |
| Model Size on Disk | 57 GB | 30 GB |
| Safetensor Shards | 16 | 7 |
| Architecture | Qwen3MoeForCausalLM | Qwen3MoeForCausalLM (unchanged) |
| Hidden Size | 2048 | 2048 (unchanged) |
| Layers | 48 | 48 (unchanged) |
| Precision | bfloat16 | bfloat16 (unchanged) |
The pruned model is a standard HuggingFace model and can be loaded directly with transformers -- no custom code required.
How REAP Works
The Problem
MoE models like Qwen3-30B-A3B use sparsely-activated expert networks: each token is routed to only 8 of 128 available experts per layer. This means most experts sit idle for any given input, making many experts redundant. REAP exploits this by identifying and removing the least important experts.
The REAP Saliency Criterion
REAP scores each expert using a dual criterion that captures both how often an expert is selected and how much it contributes when active:
REAP_score(expert_i) = mean over calibration tokens of:
router_weight(expert_i) * activation_norm(expert_i)
Where:
- Router weight (
router_weight): The softmax probability assigned by the gating network when selecting this expert. Higher means the router "prefers" this expert. - Expert Activation Norm (
activation_norm): The L2 norm of the expert's output vector. Higher means the expert produces larger (more impactful) modifications to the hidden state.
The product captures experts that are both frequently/strongly selected AND produce meaningful outputs. An expert with high router weight but low activation norm is just noise; one with high activation norm but low router weight is rarely used. REAP finds the experts that matter on both dimensions.
Why Pruning Beats Merging
The REAP paper (ICLR 2026) demonstrates a key finding: expert pruning consistently outperforms expert merging for MoE compression on generative tasks. Merging (combining similar experts into one) degrades all participating experts, while pruning (removing entire experts) preserves the full capacity of remaining experts and the router's ability to select among them.
The Full Pipeline
1. Load Model
|
2. Attach Observer Hooks to every MoE layer
|
3. Forward Pass over calibration data (1024 samples)
|-- Record router weights per expert per token
|-- Record L2 norm of expert outputs per token
|
4. Compute REAP saliency score for each expert
|-- score = mean(router_weight * activation_norm)
|
5. Rank experts by saliency score (lowest = least important)
|
6. Prune bottom 50% of experts per layer
|-- Remove expert modules from ModuleList
|-- Slice router weight matrix to match
|
7. Update config.json (num_experts: 128 -> 64)
|
8. Save compressed model
Detailed Parameters Used
Model Configuration
| Parameter | Value | Description |
|---|---|---|
model_name |
Qwen/Qwen3-30B-A3B |
Base model: 30B total params, 3B active per token |
num_hidden_layers |
48 | Number of transformer layers |
hidden_size |
2048 | Hidden dimension |
num_attention_heads |
32 | Multi-head attention heads |
num_key_value_heads |
4 | GQA key-value heads |
head_dim |
128 | Per-head dimension |
intermediate_size |
6144 | FFN intermediate size (shared experts) |
moe_intermediate_size |
768 | Per-expert FFN intermediate size |
num_experts |
128 -> 64 | Experts per MoE layer (before -> after) |
num_experts_per_tok |
8 | Top-K experts activated per token (unchanged) |
vocab_size |
151,936 | Vocabulary size |
max_position_embeddings |
40,960 | Maximum sequence length |
torch_dtype |
bfloat16 | Model precision |
Pruning Configuration
| Parameter | Value | Description |
|---|---|---|
prune_method |
reap |
REAP saliency criterion (router_weight * activation_norm) |
compression_ratio |
0.50 | Remove 50% of experts (128 -> 64 per layer) |
seed |
42 | Random seed for reproducibility |
singleton_super_experts |
false |
Do not force high-activation outlier experts into singleton clusters |
singleton_outlier_experts |
false |
Do not force outlier experts into singleton clusters |
Observer Configuration (Activation Collection)
| Parameter | Value | Description |
|---|---|---|
samples_per_category |
1024 | Number of calibration samples processed |
batch_size |
1 | Samples per forward pass |
model_max_length |
2048 | Maximum sequence length for calibration |
distance_measure |
cosine |
Distance metric for expert similarity |
renormalize_router_weights |
true |
Renormalize router logits after softmax |
record_pruning_metrics_only |
true |
Only collect metrics needed for pruning (skip merging metrics) |
overwrite_observations |
false |
Do not overwrite existing observation files |
Calibration Dataset
| Parameter | Value | Description |
|---|---|---|
dataset_name |
theblackcat102/evol-codealpaca-v1 |
Code instruction-following dataset |
split |
train |
Dataset split used |
shuffle |
true |
Shuffle before sampling |
Clustering Configuration
| Parameter | Value | Description |
|---|---|---|
cluster_method |
agglomerative |
Hierarchical agglomerative clustering |
expert_sim |
ttm |
Token-to-token similarity matrix for expert similarity |
linkage_method |
average |
Average linkage for hierarchical clustering |
frequency_penalty |
true |
Penalize frequently-used experts during clustering |
Timing
| Phase | Duration |
|---|---|
| Model loading | ~5 seconds |
| Observer pass (1024 samples) | ~6.5 hours |
| Expert pruning (all 48 layers) | < 1 second |
| Model saving | ~26 seconds |
| Total | ~6.5 hours |
Evaluation Results (0-shot, lm-eval-harness v0.4.11)
| Benchmark | Metric | Score |
|---|---|---|
| MMLU (57 subjects) | acc | 49.42% |
| -- Humanities | acc | 39.17% |
| -- Social Sciences | acc | 60.38% |
| -- STEM | acc | 56.68% |
| -- Other | acc | 46.73% |
| ARC Challenge | acc | 33.62% |
| ARC Challenge | acc_norm | 38.65% |
| ARC Easy | acc | 53.16% |
| ARC Easy | acc_norm | 50.51% |
| HellaSwag | acc | 37.70% |
| HellaSwag | acc_norm | 47.64% |
| BoolQ | acc | 74.22% |
| WinoGrande | acc | 58.80% |
| OpenBookQA | acc | 19.80% |
| OpenBookQA | acc_norm | 31.20% |
| RTE | acc | 58.48% |
Evaluation Notes
- All benchmarks run at 0-shot (no few-shot examples)
- Evaluation performed on the base model (not instruction-tuned)
- Evaluated using
lm-eval-harnessv0.4.11 withmodel="hf"backend - Model loaded with
device_map="auto"across 2 GPUs
Usage
Direct Loading with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "harryadav3/Qwen3-30B-A3B-REAP-50"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Serving with vLLM
vllm serve harryadav3/Qwen3-30B-A3B-REAP-50 \
--tensor-parallel-size 2 \
--port 8000 \
--trust-remote-code
Reproducing This Model
# Clone REAP
git clone https://github.com/CerebrasResearch/reap.git
cd reap
git submodule init && git submodule update --recursive
# Install
uv venv .venv --seed --python 3.12
source .venv/bin/activate
uv pip install --editable . --native-tls --torch-backend auto
# Download base model
huggingface-cli download Qwen/Qwen3-30B-A3B
# Run REAP pruning
bash experiments/pruning-cli.sh \
0,1 \
"Qwen/Qwen3-30B-A3B" \
"reap" \
42 \
0.50 \
"theblackcat102/evol-codealpaca-v1" \
false false false false false false false
Citation
If you use this model, please cite the REAP paper:
@inproceedings{klasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Mike Klasby and Thao Nguyen and Robert D Nowak},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}
Links
- REAP Paper: arXiv:2510.13999
- REAP Repository: github.com/CerebrasResearch/reap
- Base Model: Qwen/Qwen3-30B-A3B
- Cerebras Blog: cerebras.ai/blog/reap
- Downloads last month
- 290
Model tree for harryadav3/Qwen3-30B-A3B-REAP-50
Paper for harryadav3/Qwen3-30B-A3B-REAP-50
Evaluation results
- Accuracy (0-shot) on MMLUself-reported49.420
- Accuracy Normalized (0-shot) on ARC-Challengeself-reported38.650
- Accuracy Normalized (0-shot) on HellaSwagself-reported47.640
- Accuracy (0-shot) on BoolQself-reported74.220
- Accuracy (0-shot) on WinoGrandeself-reported58.800