llama.cpp prompt reanalyze issue

#49

by mayankiit04 - opened Mar 6

Mar 6

Hi Qwen Team,

can you please work with llama.cpp team how to get over this "[42155] slot update_slots: id 0 | task 0 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
" .... This is causing a massive slow down as a small change in prompt head or change in agent role.. is causing re reading of a large context and slowing down the agentic flow

Dampfinchen

Mar 6

This is a flaw in the architecture, unfortunately no fix would change that. Hybrid RNN models like Qwen 3.5 can't make proper use of context shifting, as soon as the prefix changes, it will have to reprocess the entire prompt again.

I really love Qwen 3.5 otherwise but because of that it is not really practical to use it IMO.

mayankiit04

Mar 6

oh.. thats bad.... is there something like how much of the initial prompt should stay same or putting in a small draft model??

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment