llama.cpp prompt reanalyze issue
Hi Qwen Team,
can you please work with llama.cpp team how to get over this "[42155] slot update_slots: id 0 | task 0 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
" .... This is causing a massive slow down as a small change in prompt head or change in agent role.. is causing re reading of a large context and slowing down the agentic flow
This is a flaw in the architecture, unfortunately no fix would change that. Hybrid RNN models like Qwen 3.5 can't make proper use of context shifting, as soon as the prefix changes, it will have to reprocess the entire prompt again.
I really love Qwen 3.5 otherwise but because of that it is not really practical to use it IMO.
oh.. thats bad.... is there something like how much of the initial prompt should stay same or putting in a small draft model??