GGUF Q4_K_M + Q3_K_S Results — RX 9070 XT 16GB Vulkan — 128 tok/s, 65k context
Hey @0xSero — back again. Just finished testing the 19B (0.30) variant. The results are even more dramatic than the 21B.
Setup: llama.cpp (Vulkan backend), RX 9070 XT 16GB, Arch Linux
Conversion Pipeline
Same as before — works cleanly with recent llama.cpp:
hf download 0xSero/gemma-4-19b-a4b-it-REAP --local-dir ./gemma-4-19b-reap-bf16
python3 convert_hf_to_gguf.py ./gemma-4-19b-reap-bf16/ --outfile gemma-4-19b-reap-F16.gguf --outtype f16
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S
Quant Sizes
| Quant | Original 26B | REAP 21B (0.20) | REAP 19B (0.30) |
|---|---|---|---|
| Q4_K_M | 16 GB | 13 GB | 12 GB |
| Q3_K_S | 12 GB | 9.5 GB | 8.4 GB |
Speed + VRAM Results (Q4_K_M, ctx 2048, ubatch 64, KV q8_0)
| Metric | Original 26B | REAP 21B | REAP 19B |
|---|---|---|---|
| tok/s (research) | 16.7 | 18.1 | 124.6 |
| tok/s (synthesis) | 17.1 | 17.9 | 126.8 |
| tok/s (coding) | 17.1 | 17.8 | 128.3 |
| VRAM @ 2k ctx | 99% | 99% | 76% |
Yes, that's 7x faster. The 19B Q4_K_M (12GB) sits at 76% VRAM on a 16GB card, meaning it runs entirely in the GPU's comfort zone with zero memory pressure. The 21B and original both hit 99% VRAM, which tanks throughput on Vulkan/ROCm.
Context Window Scaling (Q4_K_M, 16GB VRAM)
| Context | Original 26B | REAP 21B | REAP 19B |
|---|---|---|---|
| 2k | Yes (99% VRAM) | Yes (99%) | Yes (76%) |
| 4k | Barely | Yes | Yes |
| 8k | No | Yes | Yes |
| 16k | No | No | Yes (77%) |
| 32k | No | No | Yes (78%) |
| 65k | No | No | Yes (78%) |
The 19B runs 65k context on 16GB VRAM with q4_0 KV cache. The original can barely fit 4k. This completely changes what's possible with local inference on consumer GPUs.
Quality Assessment
Tested on 3 production task types (same as my 21B tests — automated knowledge synthesis pipeline):
- Structured knowledge notes — proper frontmatter, LaTeX equations, citations, section hierarchy. Quality indistinguishable from original and 21B.
- Cross-domain synthesis — clean mechanistic explanations, comparative tables. Slightly more concise but equally insightful.
- Python code generation — proper type hints, docstrings, jitter implementation, edge case handling. Equivalent quality.
No looping. No vocabulary collapse. The 30% pruning is invisible on reasoning/coding/synthesis tasks. Knowledge-intensive trivia (world religions, philosophy facts) takes a hit per your benchmarks, but for structured generation workloads this is a non-issue.
The Insight: VRAM Comfort Zone
The biggest finding isn't the 31% size reduction itself — it's that crossing the threshold from 99% to 76% VRAM usage unlocks a 7x throughput improvement on AMD Vulkan. The 21B (13GB Q4_K_M) was still at 99% on 16GB and only got 5-8% faster than the original. The 19B (12GB) breaks into comfortable territory and everything flies.
This means the 19B is dramatically more useful than the 21B for 16GB cards. The extra 10% pruning from 20% to 30% has disproportionate real-world impact.
Q3_K_S Also Done
Made Q3_K_S variants of both models:
- REAP 21B Q3_K_S: 9.5 GB
- REAP 19B Q3_K_S: 8.4 GB
The 19B Q3_K_S at 8.4GB should be incredible on 12GB cards. Happy to share all GGUF files if you want to host them.
Bottom Line
The 19B REAP is the sweet spot for 16GB consumer GPUs. 128 tok/s, 65k context, zero quality regression on structured tasks. Swapping my production server over today. This is genuinely game-changing for local inference.
Hardware: AMD RX 9070 XT (16GB VRAM), Ryzen 5 7600X, Arch Linux, llama.cpp (Vulkan)
Workload: 24/7 automated knowledge synthesis pipeline
Update: GGUF quants are now uploaded!
- Q4_K_M + Q3_K_S: vsark/gemma-4-19b-a4b-it-REAP-GGUF
SHA256 checksums included in the repo for verification. Download and run:
hf download vsark/gemma-4-19b-a4b-it-REAP-GGUF gemma-4-19b-reap-Q4_K_M.gguf
llama-server --model gemma-4-19b-reap-Q4_K_M.gguf -ngl 99 --reasoning off --ctx-size 16384
Also uploaded the 21B variant: vsark/gemma-4-21b-a4b-it-REAP-GGUF
Hey, I don't understand this. How is it possible to achieve 2% more VRAM with 65K context? Doesn't that much context require significantly more VRAM?