Gemma-4 21B-A4B-it REAP - GGUF

GGUF quantized versions of 0xSero/gemma-4-21b-a4b-it-REAP.

Model Description

This is 20% expert-pruned version of Google's Gemma-4 26B-A4B-it using Cerebras REAP (Router-weighted Expert Activation Pruning).

Key Specifications

Metric Original (26B) This Model (21B)
Total params ~26B 21.34B
Experts/layer 128 103
Active params/token ~4B ~4B
Disk size ~52GB ~43GB

REAP removes 20% of MoE experts (25 of 128 per layer) while preserving the model's routing behavior. The active parameter count per token is unchanged since the router still selects 8 experts per token from the remaining pool.

Architecture

  • 30 transformer layers
  • Sliding attention (window=1024) for 25 layers, full attention every 6th layer
  • MoE FFN with 103 experts per layer, 8 active per token
  • Thinking model -- uses <|channel>thought / <|channel>response channels
  • Multimodal -- supports text and vision inputs
  • Context window: 262,144 tokens

Available Quantizations

Filename Quant Type Size Description
gemma-4-21b-a4b-it-REAP.gguf BF16 ~43GB Full precision, best quality
gemma-4-21b-a4b-it-REAP-Q8_0.gguf Q8_0 ~23GB High quality
gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf Q5_K_M ~15GB Balanced (recommended)
gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf Q4_K_M ~13GB Good quality, smaller
gemma-4-21b-a4b-it-REAP-Q3_K_M.gguf Q3_K_M ~10GB Smallest

Usage with llama.cpp

# Download a quantized model
wget https://huggingface.co/Ayodele01/gemma-4-21b-a4b-it-REAP-GGUF/resolve/main/gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf

# Run with llama.cpp
./llama-cli -m gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf \
  -p "Write a quicksort in Python." \
  -n 2048

Usage with Ollama

Create a Modelfile:

FROM ./gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf

TEMPLATE """<bos><start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""

PARAMETER stop "<end_of_turn>"
PARAMETER temperature 0.7

Then:

ollama create gemma4-21b-reap -f Modelfile
ollama run gemma4-21b-reap

Benchmark Results (from original REAP model)

Task Original (26B) REAP 21B
Elementary Math 92% 90%
Philosophy 92% 88%
GSM8K 86% 84%

Generation quality is "essentially indistinguishable from the original" according to the REAP authors.

License

This model is released under the Gemma License.

Credits

Downloads last month
1,556
GGUF
Model size
21B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ayodele01/gemma-4-21b-a4b-it-REAP-GGUF

Quantized
(9)
this model

Paper for Ayodele01/gemma-4-21b-a4b-it-REAP-GGUF