๐“Œณ REAP๐“Œณ the Experts: Why Pruning Prevails for One-Shot MoE Compression

Qwen3.5-24B-A3B-REAP-0.32

โœจ Highlights

Introducing Qwen3.5-24B-A3B-REAP-0.32, a memory-efficient compressed variant of Qwen3.5-35B-A3B that maintains the core reasoning and coding capabilities of the architecture while being 32% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:

  • Aggressive Compression: 32% reduction in expert count, bringing the total parameter count down to approximately 24B.
  • 3B Active Parameters: Maintains the same computational efficiency during inference as the original model (3B parameters activated per token).
  • High-Precision GGUF: Includes optimized quants using an importance matrix (imatrix) and custom tensor precision recipes.
  • Drop-in Compatibility: Fully compatible with the latest transformers (from source) and vLLM.
  • Orchestration Scripts: Full pipeline available at sandeshrajbhandari/reap-qwen3.5-modal.

๐Ÿ“‹ Model Overview

Qwen3.5-24B-A3B-REAP-0.32 has the following specifications:

  • Base Model: Qwen/Qwen3.5-35B-A3B
  • Compression Method: REAP (Router-weighted Expert Activation Pruning)
  • Compression Ratio: 32% expert pruning
  • Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
  • Number of Parameters: ~24B total, ~3B activated per token
  • Number of Experts: 175 (uniformly pruned from 256)
  • Number of Activated Experts: 8 per token
  • License: Apache 2.0

๐Ÿ“‚ Repository Contents

This repository contains the following artifacts:

  • Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf: High-precision 4-bit quant using the Unsloth-style recipe (imatrix + Q8_0 overrides for critical tensors).
  • Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf: Smaller 4-bit quant variant.
  • Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf: Naive Q4_K_M quant.
  • imatrix.dat: The importance matrix used for quantization.
  • calibration_data_v5_rc.txt: The calibration corpus used to generate the imatrix.

๐Ÿš€ Deployment

Transformers

Since Qwen 3.5 MoE is a new architecture, ensure you are using the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git

vLLM

You can deploy the model directly using vLLM:

vllm serve sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32 \
    --enable-expert-parallel

GGUF (llama.cpp)

Optimized GGUF versions are available in this repository. We recommend using the IQ4_K_M variant for the best balance of size and performance.


๐Ÿงฉ Model Creation

How REAP Works

REAP selects experts to prune based on a saliency criterion that considers router gate values and expert activation norms. This ensures that only experts contributing minimally to the model's internal representations are removed.

Infrastructure

The project utilized Modal for high-memory compute (A100-80GB) and a custom fork of the REAP library.

โš ๏ธ Caveats & Future Work

  • Compute Constraints: Due to current memory limitations, the model was calibrated with a context length of 1024 tokens and a limited sample size.
  • Room for Optimization: There is significant room for improvement by using larger sample sizes and the full 2048/4096 context length. The current REAP fork for Qwen 3.5 still hits OOM on 80GB VRAM at 2048 context length during activation profiling, which is a target for future optimization.

๐Ÿ“š References & Resources

๐Ÿ”ง GGUF & Quantization Guides

๐Ÿ“Š Benchmarks & Comparisons

๐ŸŽฏ Research


โš–๏ธ License

This model is derived from Qwen3.5-35B-A3B and distributed under the Apache 2.0 License.


๐Ÿงพ Citation

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}
Downloads last month
1,814
GGUF
Model size
19B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF

Quantized
(240)
this model

Paper for sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF