Gemma 4 19B-A4B-it REAP โ€” GGUF Quantizations

GGUF quantizations of 0xSero/gemma-4-19b-a4b-it-REAP โ€” a 30% expert-pruned Gemma 4 using Cerebras REAP.

Available Quantizations

File Quant Size BPW Use Case
gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M 12 GB 5.32 Recommended. Best quality/size tradeoff.
gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S 8.4 GB 3.89 Fits 12GB cards. Slight quality loss on technical terms.

Performance (AMD RX 9070 XT, 16GB VRAM, Vulkan)

Metric Original 26B Q4_K_M REAP 19B Q4_K_M
Speed 17 tok/s 130 tok/s
VRAM @ 2k ctx 99% 76%
Max context @ 16GB ~4k 65k+
Quality Baseline Indistinguishable on coding/reasoning/synthesis

The 7x speedup comes from crossing the VRAM comfort threshold โ€” at 76% usage the GPU runs without memory pressure.

Quick Start

llama.cpp

# Download
hf download vsark/gemma-4-19b-a4b-it-REAP-GGUF gemma-4-19b-reap-Q4_K_M.gguf

# Run (IMPORTANT: use --reasoning off for Gemma 4)
llama-server \
    --model gemma-4-19b-reap-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 16384 \
    --reasoning off \
    --host 127.0.0.1 --port 8012

Ollama

# Create a Modelfile
echo 'FROM ./gemma-4-19b-reap-Q4_K_M.gguf
PARAMETER num_ctx 16384' > Modelfile

ollama create gemma4-19b-reap -f Modelfile
ollama run gemma4-19b-reap

Important Notes

  • Use --reasoning off with llama-server, or the model tries to emit thinking tokens
  • Requires recent llama.cpp with gemma4 architecture support (older builds fail)
  • Speculative decoding hurts at these speeds (31 tok/s vs 130 without) โ€” don't use it
  • Q3_K_S introduces occasional spelling errors on technical terms (e.g., "Affinity" โ†’ "Affity"). Use Q4_K_M for production.

About REAP

REAP removes 30% of MoE experts (38 of 128 per layer) while keeping the same 8 active experts per token. Active parameter count is unchanged (~4B/token). The pruned experts were the least-used ones based on router gate values and activation norms across 22,000 calibration samples.

See the original model card for full details and benchmarks.

Conversion

python3 convert_hf_to_gguf.py ./gemma-4-19b-reap-bf16/ --outfile gemma-4-19b-reap-F16.gguf --outtype f16
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S

Verification

SHA256 checksums โ€” if you reproduce the conversion from 0xSero's BF16 weights using the same llama.cpp version, you should get identical hashes:

77c579174b559a3d1812a8d8c03fa3a3ba514acc1ba54cd634ceaa783375d156  gemma-4-19b-reap-Q4_K_M.gguf
ed8e9acefa96cb97c00aaf0a79f2fe95893713148cf28d5cc8f3dad1d71dda6f  gemma-4-19b-reap-Q3_K_S.gguf

Credits

Downloads last month
1,427
GGUF
Model size
18B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vsark/gemma-4-19b-a4b-it-REAP-GGUF

Quantized
(7)
this model