--- language: - en license: gemma tags: - safetensors - gemma4 - moe - pruning - reap - cerebras - expert-pruning base_model: - google/gemma-4-26b-a4b-it library_name: transformers pipeline_tag: text-generation --- > [!TIP] > Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)** > > REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap) # Gemma 4 21B-A4B-it REAP **20% expert-pruned** version of [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) using **[Cerebras REAP](https://github.com/cerebras/reap)** (Router-weighted Expert Activation Pruning). | | Original | This Model (0.20) | [0.30 variant](https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP) | |---|---:|---:|---:| | **Total params** | ~26B | **21.34B** | 19.02B | | **Experts per layer** | 128 | **103** | 90 | | **Active params/tok** | ~4B | ~4B | ~4B | | **Experts/tok** | 8 | 8 | 8 | | **Format** | BF16 | BF16 | BF16 | | **Disk size** | ~52 GB | **~43 GB** | ~36 GB | REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool. This yields an **~18% reduction in total disk/memory footprint**. ## How This Model Was Made ### Step 1: Calibration (Activation Observation) We ran the full Gemma 4 26B-A4B-it model over a curated calibration dataset to record expert activation patterns. The observer hooks capture router gate values, expert activation norms, and routing frequencies for every layer across all calibration tokens. **Calibration dataset: 22,000 samples** drawn from 12 sources covering coding, reasoning, math, science, tool-calling, and agentic tasks: | Category | Samples | Source Dataset | |----------|--------:|----------------| | Coding (general) | 1,000 | `theblackcat102/evol-codealpaca-v1` | | Coding (additional) | 1,636 | `theblackcat102/evol-codealpaca-v1` | | Reasoning -- code | 3,480 | `open-r1/Mixture-of-Thoughts[code]` | | Reasoning -- math | 3,578 | `open-r1/Mixture-of-Thoughts[math]` | | Reasoning -- science | 3,576 | `open-r1/Mixture-of-Thoughts[science]` | | Tool calling | 1,000 | `Salesforce/xlam-function-calling-60k` | | Agentic coding | 1,000 | `SWE-bench/SWE-smith-trajectories` | | Biomedical QA | 800 | `qiaojin/PubMedQA[pqa_labeled]` | | Science QA | 800 | `derek-thomas/ScienceQA` | | Grade-school math | 4,466 | `openai/gsm8k[main]` | | Competition math | 500 | `HuggingFaceH4/MATH-500` | | Code correctness | 164 | `evalplus/humanevalplus` | | **Total** | **22,000** | | ### Step 2: REAP Pruning Using the recorded activation data, REAP scores each expert's importance per layer by combining router gate values, expert activation norms, and frequency-weighted saliency. The lowest-scoring 20% of experts (25 per layer) are removed. Router logits are renormalized post-pruning to maintain the output distribution. ### Pruning Configuration | Parameter | Value | |-----------|-------| | Compression ratio | 0.20 (20% expert removal) | | Original experts per layer | 128 | | Remaining experts per layer | 103 | | Pruning method | REAP | | Distance measure | Angular (cosine) | | Router weight renormalization | Yes | | Seed | 42 | ## Benchmark Results ### Accuracy (generative, 0-shot, 50 samples/task, thinking enabled, vLLM 0.19, 4x RTX 3090) Evaluated using lm-eval generative tasks with `--apply_chat_template` and `think_end_token=` to properly handle Gemma 4's thinking mode. Scores extracted from model responses using regex matching. | Task | Original | REAP 0.20 | REAP 0.30 | |------|-------:|-------:|-------:| | Elementary Math | **92%** | **90%** | 88% | | Philosophy | **92%** | **88%** | 74% | | World Religions | **90%** | 64% | 48% | | College CS | 56% | **76%** | 68% | | HS Math | 24%* | 44%* | 48%* | | Abstract Algebra | 12%* | 28%* | 28%* | | College Math | 16%* | 18%* | 24%* | | GSM8K | **86%** | **84%** | -- | *\* Tasks with significant extraction failures (model outputs equations rather than single letters). Real accuracy likely higher for all models.* **Notes:** - Gemma 4 is a **thinking model** -- it reasons internally before answering. Standard loglikelihood-based benchmarks give incorrect results because the model wants to think first. - GSM8K uses flexible-extract which handles thinking output well. - College CS and math tasks show REAP sometimes **outperforming** the original, likely due to sampling variance at n=50. ### Generation Quality: Side-by-Side (14 prompts, temp=0.7, top_p=0.9, max 2048 tokens) Both the original and REAP 0.20 models were tested on 14 challenging prompts across coding, math, philosophy, long-context, and repetition stress with proper chat template formatting. | Domain | N | Orig AvgWords | REAP AvgWords | Orig Loop | REAP Loop | Orig Collapse | REAP Collapse | |--------|--:|-------------:|--------------:|----------:|----------:|--------------:|--------------:| | Coding | 3 | 670 | 648 | 0% | 0% | 0% | 0% | | Math reasoning | 3 | 296 | 261 | 0% | 0% | 0% | 0% | | Philosophy | 3 | 819 | 727 | 0% | 0% | 0% | 0% | | Long context | 2 | 1210 | 854 | 50% | 0% | 0% | 0% | | Repetition stress | 3 | 1088 | 1099 | 33% | 33% | 0% | 0% | **12/14 clean ties, 1 REAP win (long-context), 1 mutual mild loop (sorting algorithms).** The REAP 0.20 model is essentially indistinguishable from the original on generation quality. ## Architecture Gemma 4 uses a hybrid sliding/full attention MoE architecture: - **30 transformer layers** - **Sliding attention** (window=1024) for 25 layers, **full attention** every 6th layer - **MoE FFN** with 103 remaining experts per layer (originally 128), 8 active per token - **Thinking model** -- uses `<|channel>thought` / `<|channel>response` channels - **Multimodal** -- supports text and vision inputs - **Context window:** 262,144 tokens - **Vocab size:** 262,144 ## Usage ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "0xSero/gemma-4-21b-a4b-it-REAP" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True) messages = [{"role": "user", "content": "Write a quicksort in Python."}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=4096) print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` ### vLLM ```bash pip install vllm>=0.19 transformers>=5.0 vllm serve 0xSero/gemma-4-21b-a4b-it-REAP \ --tensor-parallel-size 2 \ --enforce-eager \ --gpu-memory-utilization 0.9 \ --max-model-len 16384 \ --trust-remote-code ``` ## Citation ```bibtex @inproceedings{lasby2025reap, title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression}, author={Lasby, Mike and others}, booktitle={International Conference on Learning Representations (ICLR)}, year={2026}, url={https://arxiv.org/abs/2510.13999} } ``` ## Links - **REAP paper:** [arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999) - **REAP code:** [github.com/cerebras/reap](https://github.com/cerebras/reap) - **30% pruned variant:** [0xSero/gemma-4-19b-a4b-it-REAP](https://huggingface.co/0xSero/gemma-4-19b-a4b-it-REAP) - **Base model:** [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) ## Sponsors Thank you for the kind sponsors, wouldn't be possible without them: - Nvidia - TNG Technology - Lambda - Prime Intellect - HotAisle