---
language:
- en
license: gemma
tags:
- safetensors
- gemma4
- moe
- pruning
- reap
- cerebras
- expert-pruning
base_model:
- google/gemma-4-26b-a4b-it
library_name: transformers
pipeline_tag: text-generation
---
> [!TIP]
> Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**
> 
> REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap)

# Gemma 4 21B-A4B-it REAP

**20% expert-pruned** version of [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) using **[Cerebras REAP](https://github.com/cerebras/reap)** (Router-weighted Expert Activation Pruning).

| | Original | This Model (0.20) | [0.30 variant](https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP) |
|---|---:|---:|---:|
| **Total params** | ~26B | **21.34B** | 19.02B |
| **Experts per layer** | 128 | **103** | 90 |
| **Active params/tok** | ~4B | ~4B | ~4B |
| **Experts/tok** | 8 | 8 | 8 |
| **Format** | BF16 | BF16 | BF16 |
| **Disk size** | ~52 GB | **~43 GB** | ~36 GB |

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an **~18% reduction in total disk/memory footprint**.

## How This Model Was Made

### Step 1: Calibration (Activation Observation)

We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens.

**Calibration dataset: 22,000 samples** drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks:

| Category | Samples | Source Dataset |
|----------|--------:|----------------|
| Coding (general) | 1,000 | `theblackcat102/evol-codealpaca-v1` |
| Coding (additional) | 1,636 | `theblackcat102/evol-codealpaca-v1` |
| Reasoning -- code | 3,480 | `open-r1/Mixture-of-Thoughts[code]` |
| Reasoning -- math | 3,578 | `open-r1/Mixture-of-Thoughts[math]` |
| Reasoning -- science | 3,576 | `open-r1/Mixture-of-Thoughts[science]` |
| Tool calling | 1,000 | `Salesforce/xlam-function-calling-60k` |
| Agentic coding | 1,000 | `SWE-bench/SWE-smith-trajectories` |
| Biomedical QA | 800 | `qiaojin/PubMedQA[pqa_labeled]` |
| Science QA | 800 | `derek-thomas/ScienceQA` |
| Grade-school math | 4,466 | `openai/gsm8k[main]` |
| Competition math | 500 | `HuggingFaceH4/MATH-500` |
| Code correctness | 164 | `evalplus/humanevalplus` |
| **Total** | **22,000** | |

### Step 2: REAP Pruning

Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution.

### Pruning Configuration

| Parameter | Value |
|-----------|-------|
| Compression ratio | 0.20 (20% expert removal) |
| Original experts per layer | 128 |
| Remaining experts per layer | 103 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |

## Benchmark Results

### Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090)

Evaluated using lm-eval generative tasks with `--apply_chat_template` and `think_end_token=<channel|>` to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching.

| Task | Original | REAP 0.20 | REAP 0.30 |
|------|-------:|-------:|-------:|
| Elementary Math | **92%** | **90%** | 88% |
| Philosophy | **92%** | **88%** | 74% |
| World Religions | **90%** | 64% | 48% |
| College CS | 56% | **76%** | 68% |
| HS Math | 24%* | 44%* | 48%* |
| Abstract Algebra | 12%* | 28%* | 28%* |
| College Math | 16%* | 18%* | 24%* |
| GSM8K | **86%** | **84%** | -- |

*\* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.*

**Notes:**
- Gemma 4 is a **thinking model** -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first.
- GSM8K uses flexible-extract which handles thinking output well.
- College CS and math tasks show REAP sometimes **outperforming** the original, likely due to sampling variance at n=50.

### Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens)

Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting.

| Domain | N | Orig AvgWords | REAP AvgWords | Orig Loop | REAP Loop | Orig Collapse | REAP Collapse |
|--------|--:|-------------:|--------------:|----------:|----------:|--------------:|--------------:|
| Coding | 3 | 670 | 648 | 0% | 0% | 0% | 0% |
| Math reasoning | 3 | 296 | 261 | 0% | 0% | 0% | 0% |
| Philosophy | 3 | 819 | 727 | 0% | 0% | 0% | 0% |
| Long context | 2 | 1210 | 854 | 50% | 0% | 0% | 0% |
| Repetition stress | 3 | 1088 | 1099 | 33% | 33% | 0% | 0% |

**12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms).** The REAP 0.20 model is essentially indistinguishable from the original on generation quality.

## Architecture

Gemma 4 uses a hybrid sliding/full attention MoE architecture:
- **30 transformer layers**
- **Sliding attention** (window=1024) for 25 layers, **full attention** every 6th layer
- **MoE FFN** with 103 remaining experts per layer (originally 128), 8 active per token
- **Thinking model** -- uses `<|channel>thought` / `<|channel>response` channels
- **Multimodal** -- supports text and vision inputs
- **Context window:** 262,144 tokens
- **Vocab size:** 262,144

## Usage

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/gemma-4-21b-a4b-it-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

### vLLM

```bash
pip install vllm>=0.19 transformers>=5.0

vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 16384 \
    --trust-remote-code
```

## Citation

```bibtex
@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}
```

## Links

- **REAP paper:** [arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999)
- **REAP code:** [github.com/cerebras/reap](https://github.com/cerebras/reap)
- **30% pruned variant:** [0xSero/gemma-4-19b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP)
- **Base model:** [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it)

## Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle