Gemma 4 19B-A4B-it REAP โ GGUF Quantizations
GGUF quantizations of 0xSero/gemma-4-19b-a4b-it-REAP โ a 30% expert-pruned Gemma 4 using Cerebras REAP.
Available Quantizations
| File | Quant | Size | BPW | Use Case |
|---|---|---|---|---|
gemma-4-19b-reap-Q4_K_M.gguf |
Q4_K_M | 12 GB | 5.32 | Recommended. Best quality/size tradeoff. |
gemma-4-19b-reap-Q3_K_S.gguf |
Q3_K_S | 8.4 GB | 3.89 | Fits 12GB cards. Slight quality loss on technical terms. |
Performance (AMD RX 9070 XT, 16GB VRAM, Vulkan)
| Metric | Original 26B Q4_K_M | REAP 19B Q4_K_M |
|---|---|---|
| Speed | 17 tok/s | 130 tok/s |
| VRAM @ 2k ctx | 99% | 76% |
| Max context @ 16GB | ~4k | 65k+ |
| Quality | Baseline | Indistinguishable on coding/reasoning/synthesis |
The 7x speedup comes from crossing the VRAM comfort threshold โ at 76% usage the GPU runs without memory pressure.
Quick Start
llama.cpp
# Download
hf download vsark/gemma-4-19b-a4b-it-REAP-GGUF gemma-4-19b-reap-Q4_K_M.gguf
# Run (IMPORTANT: use --reasoning off for Gemma 4)
llama-server \
--model gemma-4-19b-reap-Q4_K_M.gguf \
--n-gpu-layers 99 \
--ctx-size 16384 \
--reasoning off \
--host 127.0.0.1 --port 8012
Ollama
# Create a Modelfile
echo 'FROM ./gemma-4-19b-reap-Q4_K_M.gguf
PARAMETER num_ctx 16384' > Modelfile
ollama create gemma4-19b-reap -f Modelfile
ollama run gemma4-19b-reap
Important Notes
- Use
--reasoning offwith llama-server, or the model tries to emit thinking tokens - Requires recent llama.cpp with
gemma4architecture support (older builds fail) - Speculative decoding hurts at these speeds (31 tok/s vs 130 without) โ don't use it
- Q3_K_S introduces occasional spelling errors on technical terms (e.g., "Affinity" โ "Affity"). Use Q4_K_M for production.
About REAP
REAP removes 30% of MoE experts (38 of 128 per layer) while keeping the same 8 active experts per token. Active parameter count is unchanged (~4B/token). The pruned experts were the least-used ones based on router gate values and activation norms across 22,000 calibration samples.
See the original model card for full details and benchmarks.
Conversion
python3 convert_hf_to_gguf.py ./gemma-4-19b-reap-bf16/ --outfile gemma-4-19b-reap-F16.gguf --outtype f16
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S
Verification
SHA256 checksums โ if you reproduce the conversion from 0xSero's BF16 weights using the same llama.cpp version, you should get identical hashes:
77c579174b559a3d1812a8d8c03fa3a3ba514acc1ba54cd634ceaa783375d156 gemma-4-19b-reap-Q4_K_M.gguf
ed8e9acefa96cb97c00aaf0a79f2fe95893713148cf28d5cc8f3dad1d71dda6f gemma-4-19b-reap-Q3_K_S.gguf
Credits
- REAP pruning: 0xSero
- Base model: Google Gemma 4
- GGUF conversion + benchmarks: vsark
- Downloads last month
- 1,427
Hardware compatibility
Log In to add your hardware
3-bit
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for vsark/gemma-4-19b-a4b-it-REAP-GGUF
Base model
0xSero/gemma-4-19b-a4b-it-REAP