Prefill speed degradation

#2
by daibuzizai - opened

After testing, ik_llama.cpp shows low running efficiency, with prefill speed seriously degraded to only 25% of the original.

That's because most of the quants keep everything but the conditional experts in Q8. So things like attention are a bit heavier, but should degrade less over longer contexts. That's the theory, at least.

Sign up or log in to comment