Qwen3-30B-A3B-REAP-50: 50% Expert-Pruned Qwen3 MoE

This model is a 50% expert-pruned version of Qwen/Qwen3-30B-A3B, compressed using REAP (Router-weighted Expert Activation Pruning) from Cerebras Research.

REAP is a one-shot compression technique for Mixture-of-Experts (MoE) models that physically removes low-importance experts based on a saliency criterion combining router gate-values and expert activation norms. The method was published at ICLR 2026.

What Changed

Property Original Pruned
Total Experts per Layer 128 64
Active Experts per Token 8 8 (unchanged)
Model Size on Disk 57 GB 30 GB
Safetensor Shards 16 7
Architecture Qwen3MoeForCausalLM Qwen3MoeForCausalLM (unchanged)
Hidden Size 2048 2048 (unchanged)
Layers 48 48 (unchanged)
Precision bfloat16 bfloat16 (unchanged)

The pruned model is a standard HuggingFace model and can be loaded directly with transformers -- no custom code required.


How REAP Works

The Problem

MoE models like Qwen3-30B-A3B use sparsely-activated expert networks: each token is routed to only 8 of 128 available experts per layer. This means most experts sit idle for any given input, making many experts redundant. REAP exploits this by identifying and removing the least important experts.

The REAP Saliency Criterion

REAP scores each expert using a dual criterion that captures both how often an expert is selected and how much it contributes when active:

REAP_score(expert_i) = mean over calibration tokens of:
    router_weight(expert_i) * activation_norm(expert_i)

Where:

  • Router weight (router_weight): The softmax probability assigned by the gating network when selecting this expert. Higher means the router "prefers" this expert.
  • Expert Activation Norm (activation_norm): The L2 norm of the expert's output vector. Higher means the expert produces larger (more impactful) modifications to the hidden state.

The product captures experts that are both frequently/strongly selected AND produce meaningful outputs. An expert with high router weight but low activation norm is just noise; one with high activation norm but low router weight is rarely used. REAP finds the experts that matter on both dimensions.

Why Pruning Beats Merging

The REAP paper (ICLR 2026) demonstrates a key finding: expert pruning consistently outperforms expert merging for MoE compression on generative tasks. Merging (combining similar experts into one) degrades all participating experts, while pruning (removing entire experts) preserves the full capacity of remaining experts and the router's ability to select among them.

The Full Pipeline

1. Load Model
   |
2. Attach Observer Hooks to every MoE layer
   |
3. Forward Pass over calibration data (1024 samples)
   |-- Record router weights per expert per token
   |-- Record L2 norm of expert outputs per token
   |
4. Compute REAP saliency score for each expert
   |-- score = mean(router_weight * activation_norm)
   |
5. Rank experts by saliency score (lowest = least important)
   |
6. Prune bottom 50% of experts per layer
   |-- Remove expert modules from ModuleList
   |-- Slice router weight matrix to match
   |
7. Update config.json (num_experts: 128 -> 64)
   |
8. Save compressed model

Detailed Parameters Used

Model Configuration

Parameter Value Description
model_name Qwen/Qwen3-30B-A3B Base model: 30B total params, 3B active per token
num_hidden_layers 48 Number of transformer layers
hidden_size 2048 Hidden dimension
num_attention_heads 32 Multi-head attention heads
num_key_value_heads 4 GQA key-value heads
head_dim 128 Per-head dimension
intermediate_size 6144 FFN intermediate size (shared experts)
moe_intermediate_size 768 Per-expert FFN intermediate size
num_experts 128 -> 64 Experts per MoE layer (before -> after)
num_experts_per_tok 8 Top-K experts activated per token (unchanged)
vocab_size 151,936 Vocabulary size
max_position_embeddings 40,960 Maximum sequence length
torch_dtype bfloat16 Model precision

Pruning Configuration

Parameter Value Description
prune_method reap REAP saliency criterion (router_weight * activation_norm)
compression_ratio 0.50 Remove 50% of experts (128 -> 64 per layer)
seed 42 Random seed for reproducibility
singleton_super_experts false Do not force high-activation outlier experts into singleton clusters
singleton_outlier_experts false Do not force outlier experts into singleton clusters

Observer Configuration (Activation Collection)

Parameter Value Description
samples_per_category 1024 Number of calibration samples processed
batch_size 1 Samples per forward pass
model_max_length 2048 Maximum sequence length for calibration
distance_measure cosine Distance metric for expert similarity
renormalize_router_weights true Renormalize router logits after softmax
record_pruning_metrics_only true Only collect metrics needed for pruning (skip merging metrics)
overwrite_observations false Do not overwrite existing observation files

Calibration Dataset

Parameter Value Description
dataset_name theblackcat102/evol-codealpaca-v1 Code instruction-following dataset
split train Dataset split used
shuffle true Shuffle before sampling

Clustering Configuration

Parameter Value Description
cluster_method agglomerative Hierarchical agglomerative clustering
expert_sim ttm Token-to-token similarity matrix for expert similarity
linkage_method average Average linkage for hierarchical clustering
frequency_penalty true Penalize frequently-used experts during clustering

Timing

Phase Duration
Model loading ~5 seconds
Observer pass (1024 samples) ~6.5 hours
Expert pruning (all 48 layers) < 1 second
Model saving ~26 seconds
Total ~6.5 hours

Evaluation Results (0-shot, lm-eval-harness v0.4.11)

Benchmark Metric Score
MMLU (57 subjects) acc 49.42%
-- Humanities acc 39.17%
-- Social Sciences acc 60.38%
-- STEM acc 56.68%
-- Other acc 46.73%
ARC Challenge acc 33.62%
ARC Challenge acc_norm 38.65%
ARC Easy acc 53.16%
ARC Easy acc_norm 50.51%
HellaSwag acc 37.70%
HellaSwag acc_norm 47.64%
BoolQ acc 74.22%
WinoGrande acc 58.80%
OpenBookQA acc 19.80%
OpenBookQA acc_norm 31.20%
RTE acc 58.48%

Evaluation Notes

  • All benchmarks run at 0-shot (no few-shot examples)
  • Evaluation performed on the base model (not instruction-tuned)
  • Evaluated using lm-eval-harness v0.4.11 with model="hf" backend
  • Model loaded with device_map="auto" across 2 GPUs

Usage

Direct Loading with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "harryadav3/Qwen3-30B-A3B-REAP-50"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Serving with vLLM

vllm serve harryadav3/Qwen3-30B-A3B-REAP-50 \
    --tensor-parallel-size 2 \
    --port 8000 \
    --trust-remote-code

Reproducing This Model

# Clone REAP
git clone https://github.com/CerebrasResearch/reap.git
cd reap
git submodule init && git submodule update --recursive

# Install
uv venv .venv --seed --python 3.12
source .venv/bin/activate
uv pip install --editable . --native-tls --torch-backend auto

# Download base model
huggingface-cli download Qwen/Qwen3-30B-A3B

# Run REAP pruning
bash experiments/pruning-cli.sh \
    0,1 \
    "Qwen/Qwen3-30B-A3B" \
    "reap" \
    42 \
    0.50 \
    "theblackcat102/evol-codealpaca-v1" \
    false false false false false false false

Citation

If you use this model, please cite the REAP paper:

@inproceedings{klasby2025reap,
    title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
    author={Mike Klasby and Thao Nguyen and Robert D Nowak},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2025},
    url={https://arxiv.org/abs/2510.13999}
}

Links

Downloads last month
290
Safetensors
Model size
16B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for harryadav3/Qwen3-30B-A3B-REAP-50

Finetuned
(43)
this model

Paper for harryadav3/Qwen3-30B-A3B-REAP-50

Evaluation results