Prompt cache not working correctly

by guiopen - opened Feb 25

Feb 25

It never caches the last interaction, for example, if i am using it as an agent with a lot of rules and tools, run the first interaction where it will do the prompt processing of all the tools, in th second interaction i would expect the response to come much faster as it would be already cached, but the model still process a lot of tokens before responding, it seems to never cache its last response. I think the problem is related to this:

"No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed."

The behaviour persist with thinking disabled too, problably because it is removing the empty tags from the last turn in the next one and forcing the model to reprocess everything

CHNtentes

Feb 25

https://github.com/ggml-org/llama.cpp/issues/19858
could be related

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment