Apr 11 chat template causes ~7.5s template rendering overhead per request in llama.cpp
The updated chat template baked into the Apr 11 GGUF is dramatically slower to render in llama.cpp's Jinja engine, adding ~7.5 seconds of CPU-bound overhead per request on long conversations. KV cache and GPU prompt processing are not affected β the bottleneck is entirely in template rendering before inference begins.
Environment
- llama.cpp
b8757(ghcr.io/ggml-org/llama.cpp:server-cuda, latest) gemma-4-26B-A4B-it-BF16on NVIDIA RTX PRO 6000 Blackwell- Server flags:
-ngl 999 -np 1 --reasoning off - No
--mmproj - Test payload: 633 messages (~6,113 tokens after template rendering)
Comparison: Old (Apr 7) vs New (Apr 11) GGUF
Both GGUFs produce identical token counts (~6,113 tokens from 633 messages). KV cache works correctly in both cases. The difference is purely in template rendering time.
Apr 7 GGUF (old template):
=== Cold request (633 messages) ===
real 0m2.231s <-- total wall clock
=== Warm request (identical payload, cache hit) ===
real 0m0.225s <-- cache works, 1 token processed
=== Server timing (cold) ===
prompt eval time = 2019.90 ms / 6113 tokens (0.33 ms per token, 3026.39 t/s)
total time = 2019.90 ms / 6114 tokens
=== Server timing (warm) ===
prompt eval time = 16.88 ms / 1 tokens
total time = 16.88 ms / 2 tokens
Wall clock (2.2s) β server total time (2.0s). Minimal template overhead (~200ms).
Apr 11 GGUF (new template):
=== Cold request (633 messages) ===
real 0m9.535s <-- total wall clock
=== Warm request (identical payload, cache hit) ===
real 0m7.486s <-- cache hits but still 7.5s wall clock
=== Incremental request (634 messages) ===
real 0m8.178s
=== Server timing (cold) ===
prompt eval time = 1992.66 ms / 6114 tokens (GPU processing similar)
total time = 1992.66 ms / 6114 tokens
=== Server timing (warm, 1 token from cache) ===
prompt eval time = 19.39 ms / 1 tokens
total time = 19.39 ms / 2 tokens
Server total time is 2s (same as old), but wall clock is 7.5-9.5s. The **7.5 seconds** between wall clock and server timing is spent in Jinja template rendering before inference begins. This overhead is paid on every request regardless of KV cache state.
Server logs confirm KV cache is working (warm request):
slot update_slots: id 0 | task 6 | need to evaluate at least 1 token (n_past = 6113, task.n_tokens() = 6113)
slot update_slots: id 0 | task 6 | n_past was set to 6112
slot update_slots: id 0 | task 6 | prompt processing done, n_tokens = 6113, batch.n_tokens = 1
Root cause
The new template introduces more complex Jinja logic per message (reasoning content replay, <|channel>thought markers, as noted by https://github.com/asf0/gemma4_jinja/). llama.cpp's Jinja engine appears to scale poorly with this added complexity Γ message count: 12ms per message on the new template vs ~0.3ms per message on the old template (40x slower per message).
Impact
For multi-turn applications (voice assistants, agents with tool use), conversations grow to hundreds of messages quickly. At 633 messages:
- Old template: 253ms warm, 2.2s cold β usable
- New template: 7.5s warm, 9.5s cold β unusable
Workaround
Pinned to the Apr 7 GGUF. The <|channel>thought tag leakage from the old template can be stripped in application code.
Related
- https://github.com/ggml-org/llama.cpp/issues/21468 (Gemma 4 cache reuse issues)
- https://github.com/asf0/gemma4_jinja/ (community template fix stripping the problematic sections)