GGUF Q4_K_M + Q3_K_S Results — RX 9070 XT 16GB Vulkan — 128 tok/s, 65k context

by vsark - opened 17 days ago

Hey @0xSero — back again. Just finished testing the 19B (0.30) variant. The results are even more dramatic than the 21B.

Setup: llama.cpp (Vulkan backend), RX 9070 XT 16GB, Arch Linux

Conversion Pipeline

Same as before — works cleanly with recent llama.cpp:

hf download 0xSero/gemma-4-19b-a4b-it-REAP --local-dir ./gemma-4-19b-reap-bf16
python3 convert_hf_to_gguf.py ./gemma-4-19b-reap-bf16/ --outfile gemma-4-19b-reap-F16.gguf --outtype f16
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q4_K_M.gguf Q4_K_M
llama-quantize gemma-4-19b-reap-F16.gguf gemma-4-19b-reap-Q3_K_S.gguf Q3_K_S

Quant Sizes

Quant	Original 26B	REAP 21B (0.20)	REAP 19B (0.30)
Q4_K_M	16 GB	13 GB	12 GB
Q3_K_S	12 GB	9.5 GB	8.4 GB

Speed + VRAM Results (Q4_K_M, ctx 2048, ubatch 64, KV q8_0)

Metric	Original 26B	REAP 21B	REAP 19B
tok/s (research)	16.7	18.1	124.6
tok/s (synthesis)	17.1	17.9	126.8
tok/s (coding)	17.1	17.8	128.3
VRAM @ 2k ctx	99%	99%	76%

Yes, that's 7x faster. The 19B Q4_K_M (12GB) sits at 76% VRAM on a 16GB card, meaning it runs entirely in the GPU's comfort zone with zero memory pressure. The 21B and original both hit 99% VRAM, which tanks throughput on Vulkan/ROCm.

Context Window Scaling (Q4_K_M, 16GB VRAM)

Context	Original 26B	REAP 21B	REAP 19B
2k	Yes (99% VRAM)	Yes (99%)	Yes (76%)
4k	Barely	Yes	Yes
8k	No	Yes	Yes
16k	No	No	Yes (77%)
32k	No	No	Yes (78%)
65k	No	No	Yes (78%)

The 19B runs 65k context on 16GB VRAM with q4_0 KV cache. The original can barely fit 4k. This completely changes what's possible with local inference on consumer GPUs.

Quality Assessment

Tested on 3 production task types (same as my 21B tests — automated knowledge synthesis pipeline):

Structured knowledge notes — proper frontmatter, LaTeX equations, citations, section hierarchy. Quality indistinguishable from original and 21B.
Cross-domain synthesis — clean mechanistic explanations, comparative tables. Slightly more concise but equally insightful.
Python code generation — proper type hints, docstrings, jitter implementation, edge case handling. Equivalent quality.

No looping. No vocabulary collapse. The 30% pruning is invisible on reasoning/coding/synthesis tasks. Knowledge-intensive trivia (world religions, philosophy facts) takes a hit per your benchmarks, but for structured generation workloads this is a non-issue.

The Insight: VRAM Comfort Zone

The biggest finding isn't the 31% size reduction itself — it's that crossing the threshold from 99% to 76% VRAM usage unlocks a 7x throughput improvement on AMD Vulkan. The 21B (13GB Q4_K_M) was still at 99% on 16GB and only got 5-8% faster than the original. The 19B (12GB) breaks into comfortable territory and everything flies.

This means the 19B is dramatically more useful than the 21B for 16GB cards. The extra 10% pruning from 20% to 30% has disproportionate real-world impact.

Q3_K_S Also Done

Made Q3_K_S variants of both models:

REAP 21B Q3_K_S: 9.5 GB
REAP 19B Q3_K_S: 8.4 GB

The 19B Q3_K_S at 8.4GB should be incredible on 12GB cards. Happy to share all GGUF files if you want to host them.

Bottom Line

The 19B REAP is the sweet spot for 16GB consumer GPUs. 128 tok/s, 65k context, zero quality regression on structured tasks. Swapping my production server over today. This is genuinely game-changing for local inference.

Hardware: AMD RX 9070 XT (16GB VRAM), Ryzen 5 7600X, Arch Linux, llama.cpp (Vulkan)
Workload: 24/7 automated knowledge synthesis pipeline

vsark

17 days ago

Update: GGUF quants are now uploaded!

Q4_K_M + Q3_K_S: vsark/gemma-4-19b-a4b-it-REAP-GGUF

SHA256 checksums included in the repo for verification. Download and run:

hf download vsark/gemma-4-19b-a4b-it-REAP-GGUF gemma-4-19b-reap-Q4_K_M.gguf
llama-server --model gemma-4-19b-reap-Q4_K_M.gguf -ngl 99 --reasoning off --ctx-size 16384

Also uploaded the 21B variant: vsark/gemma-4-21b-a4b-it-REAP-GGUF

Amiee7

14 days ago

Hey, I don't understand this. How is it possible to achieve 2% more VRAM with 65K context? Doesn't that much context require significantly more VRAM?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment