Gemma 4 19B-A4B-it REAP — GGUF Quantizations

GGUF quantizations of 0xSero/gemma-4-19b-a4b-it-REAP — a 30% expert-pruned Gemma 4 using Cerebras REAP.

Available Quantizations

File	Quant	Size	BPW	Use Case
`gemma-4-19b-reap-Q4_K_M.gguf`	Q4_K_M	12 GB	5.32	Recommended. Best quality/size tradeoff.
`gemma-4-19b-reap-Q3_K_S.gguf`	Q3_K_S	8.4 GB	3.89	Fits 12GB cards. Slight quality loss on technical terms.

Performance (AMD RX 9070 XT, 16GB VRAM, Vulkan)

Metric	Original 26B Q4_K_M	REAP 19B Q4_K_M
Speed	17 tok/s	130 tok/s
VRAM @ 2k ctx	99%	76%
Max context @ 16GB	~4k	65k+
Quality	Baseline	Indistinguishable on coding/reasoning/synthesis

The 7x speedup comes from crossing the VRAM comfort threshold — at 76% usage the GPU runs without memory pressure.

Quick Start

llama.cpp

# Download
hf download vsark/gemma-4-19b-a4b-it-REAP-GGUF gemma-4-19b-reap-Q4_K_M.gguf

# Run (IMPORTANT: use --reasoning off for Gemma 4)
llama-server \
    --model gemma-4-19b-reap-Q4_K_M.gguf \
    --n-gpu-layers 99 \
    --ctx-size 16384 \
    --reasoning off \
    --host 127.0.0.1 --port 8012

Ollama

# Create a Modelfile
echo 'FROM ./gemma-4-19b-reap-Q4_K_M.gguf
PARAMETER num_ctx 16384' > Modelfile

ollama create gemma4-19b-reap -f Modelfile
ollama run gemma4-19b-reap

Important Notes

Use --reasoning off with llama-server, or the model tries to emit thinking tokens
Requires recent llama.cpp with gemma4 architecture support (older builds fail)
Speculative decoding hurts at these speeds (31 tok/s vs 130 without) — don't use it
Q3_K_S introduces occasional spelling errors on technical terms (e.g., "Affinity" → "Affity"). Use Q4_K_M for production.

About REAP

REAP removes 30% of MoE experts (38 of 128 per layer) while keeping the same 8 active experts per token. Active parameter count is unchanged (~4B/token). The pruned experts were the least-used ones based on router gate values and activation norms across 22,000 calibration samples.

See the original model card for full details and benchmarks.

Conversion

python3 convert_hf_to_gguf.py ./gemma-4-19b-reap-bf16/ --outfile gemma-4-19b-reap-F16.gguf --outtype f16
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S

Verification

SHA256 checksums — if you reproduce the conversion from 0xSero's BF16 weights using the same llama.cpp version, you should get identical hashes:

77c579174b559a3d1812a8d8c03fa3a3ba514acc1ba54cd634ceaa783375d156  gemma-4-19b-reap-Q4_K_M.gguf
ed8e9acefa96cb97c00aaf0a79f2fe95893713148cf28d5cc8f3dad1d71dda6f  gemma-4-19b-reap-Q3_K_S.gguf

Credits

REAP pruning: 0xSero
Base model: Google Gemma 4
GGUF conversion + benchmarks: vsark

Downloads last month: 1,427

GGUF

Model size

18B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vsark/gemma-4-19b-a4b-it-REAP-GGUF

Base model

0xSero/gemma-4-19b-a4b-it-REAP

Quantized

(7)

this model