Prefill speed degradation
#2
by daibuzizai - opened
After testing, ik_llama.cpp shows low running efficiency, with prefill speed seriously degraded to only 25% of the original.
That's because most of the quants keep everything but the conditional experts in Q8. So things like attention are a bit heavier, but should degrade less over longer contexts. That's the theory, at least.