Prompt cache not working correctly

#4
by guiopen - opened

It never caches the last interaction, for example, if i am using it as an agent with a lot of rules and tools, run the first interaction where it will do the prompt processing of all the tools, in th second interaction i would expect the response to come much faster as it would be already cached, but the model still process a lot of tokens before responding, it seems to never cache its last response. I think the problem is related to this:

"No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed."

The behaviour persist with thinking disabled too, problably because it is removing the empty tags from the last turn in the next one and forcing the model to reprocess everything

Sign up or log in to comment