𓌳 REAP𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression

Qwen3.5-24B-A3B-REAP-0.32

✨ Highlights

Introducing Qwen3.5-24B-A3B-REAP-0.32, a memory-efficient compressed variant of Qwen3.5-35B-A3B that maintains the core reasoning and coding capabilities of the architecture while being 32% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:

Aggressive Compression: 32% reduction in expert count, bringing the total parameter count down to approximately 24B.
3B Active Parameters: Maintains the same computational efficiency during inference as the original model (3B parameters activated per token).
High-Precision GGUF: Includes optimized quants using an importance matrix (imatrix) and custom tensor precision recipes.
Drop-in Compatibility: Fully compatible with the latest transformers (from source) and vLLM.
Orchestration Scripts: Full pipeline available at sandeshrajbhandari/reap-qwen3.5-modal.

📋 Model Overview

Qwen3.5-24B-A3B-REAP-0.32 has the following specifications:

Base Model: Qwen/Qwen3.5-35B-A3B
Compression Method: REAP (Router-weighted Expert Activation Pruning)
Compression Ratio: 32% expert pruning
Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
Number of Parameters: ~24B total, ~3B activated per token
Number of Experts: 175 (uniformly pruned from 256)
Number of Activated Experts: 8 per token
License: Apache 2.0

📂 Repository Contents

This repository contains the following artifacts:

Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf: High-precision 4-bit quant using the Unsloth-style recipe (imatrix + Q8_0 overrides for critical tensors).
Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf: Smaller 4-bit quant variant.
Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf: Naive Q4_K_M quant.
imatrix.dat: The importance matrix used for quantization.
calibration_data_v5_rc.txt: The calibration corpus used to generate the imatrix.

🚀 Deployment

Transformers

Since Qwen 3.5 MoE is a new architecture, ensure you are using the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git

vLLM

You can deploy the model directly using vLLM:

vllm serve sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32 \
    --enable-expert-parallel

GGUF (llama.cpp)

Optimized GGUF versions are available in this repository. We recommend using the IQ4_K_M variant for the best balance of size and performance.

🧩 Model Creation

How REAP Works

REAP selects experts to prune based on a saliency criterion that considers router gate values and expert activation norms. This ensures that only experts contributing minimally to the model's internal representations are removed.

Infrastructure

The project utilized Modal for high-memory compute (A100-80GB) and a custom fork of the REAP library.

Orchestration Code: reap-qwen3.5-modal
Library Fork: sandeshrajbhandari/reap

⚠️ Caveats & Future Work

Compute Constraints: Due to current memory limitations, the model was calibrated with a context length of 1024 tokens and a limited sample size.
Room for Optimization: There is significant room for improvement by using larger sample sizes and the full 2048/4096 context length. The current REAP fork for Qwen 3.5 still hits OOM on 80GB VRAM at 2048 context length during activation profiling, which is a target for future optimization.

📚 References & Resources

🔧 GGUF & Quantization Guides

📊 Benchmarks & Comparisons

🎯 Research

⚖️ License

This model is derived from Qwen3.5-35B-A3B and distributed under the Apache 2.0 License.

🧾 Citation

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Downloads last month: 1,814

GGUF

Model size

19B params

Architecture

qwen35moe

Hardware compatibility

4-bit

Model tree for sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(240)

this model

Paper for sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19