Apr 11 chat template causes ~7.5s template rendering overhead per request in llama.cpp

#27

by btdeviant - opened 10 days ago

The updated chat template baked into the Apr 11 GGUF is dramatically slower to render in llama.cpp's Jinja engine, adding ~7.5 seconds of CPU-bound overhead per request on long conversations. KV cache and GPU prompt processing are not affected — the bottleneck is entirely in template rendering before inference begins.

Environment

llama.cpp b8757 (ghcr.io/ggml-org/llama.cpp:server-cuda, latest)
gemma-4-26B-A4B-it-BF16 on NVIDIA RTX PRO 6000 Blackwell
Server flags: -ngl 999 -np 1 --reasoning off
No --mmproj
Test payload: 633 messages (~6,113 tokens after template rendering)

Comparison: Old (Apr 7) vs New (Apr 11) GGUF

Both GGUFs produce identical token counts (~6,113 tokens from 633 messages). KV cache works correctly in both cases. The difference is purely in template rendering time.

Apr 7 GGUF (old template):

=== Cold request (633 messages) ===
real    0m2.231s          <-- total wall clock

=== Warm request (identical payload, cache hit) ===
real    0m0.225s          <-- cache works, 1 token processed

=== Server timing (cold) ===
prompt eval time =    2019.90 ms /  6113 tokens (0.33 ms per token, 3026.39 t/s)
total time       =    2019.90 ms /  6114 tokens

=== Server timing (warm) ===
prompt eval time =      16.88 ms /     1 tokens
total time       =      16.88 ms /     2 tokens

Wall clock (2.2s) ≈ server total time (2.0s). Minimal template overhead (~200ms).

Apr 11 GGUF (new template):

=== Cold request (633 messages) ===
real    0m9.535s          <-- total wall clock

=== Warm request (identical payload, cache hit) ===
real    0m7.486s          <-- cache hits but still 7.5s wall clock

=== Incremental request (634 messages) ===
real    0m8.178s

=== Server timing (cold) ===
prompt eval time =    1992.66 ms /  6114 tokens (GPU processing similar)
total time       =    1992.66 ms /  6114 tokens

=== Server timing (warm, 1 token from cache) ===
prompt eval time =      19.39 ms /     1 tokens
total time       =      19.39 ms /     2 tokens

Server total time is 2s (same as old), but wall clock is 7.5-9.5s. The **7.5 seconds** between wall clock and server timing is spent in Jinja template rendering before inference begins. This overhead is paid on every request regardless of KV cache state.

Server logs confirm KV cache is working (warm request):

slot update_slots: id  0 | task 6 | need to evaluate at least 1 token (n_past = 6113, task.n_tokens() = 6113)
slot update_slots: id  0 | task 6 | n_past was set to 6112
slot update_slots: id  0 | task 6 | prompt processing done, n_tokens = 6113, batch.n_tokens = 1

Root cause

The new template introduces more complex Jinja logic per message (reasoning content replay, <|channel>thought markers, as noted by https://github.com/asf0/gemma4_jinja/). llama.cpp's Jinja engine appears to scale poorly with this added complexity × message count: ~~12ms per message on the new template vs ~0.3ms per message on the old template (~~40x slower per message).

Impact

For multi-turn applications (voice assistants, agents with tool use), conversations grow to hundreds of messages quickly. At 633 messages:

Old template: 253ms warm, 2.2s cold — usable
New template: 7.5s warm, 9.5s cold — unusable

Workaround

Pinned to the Apr 7 GGUF. The <|channel>thought tag leakage from the old template can be stripped in application code.

https://github.com/ggml-org/llama.cpp/issues/21468 (Gemma 4 cache reuse issues)
https://github.com/asf0/gemma4_jinja/ (community template fix stripping the problematic sections)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

unsloth
/

gemma-4-26B-A4B-it-GGUF