<unused49> infinite generation after llama.cpp release b8699
Hi,
after updating to release b8699 which support attention rotation for heterogeneous iSWA (PR #21513) I'm getting infinite generation of token.
I saw that Unsloth (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/20) that they are updating their GGUF.
Do you plan to update your quants?
Thank you in advance
Can you give an example of a command that's failing? I updated to latest llama.cpp and ran the Q4_K_M I just downloaded and have no issues
Thank you for answering, I found out that the reason is V cache quantization to Q8_0.
I was using for K cache BF16 and Q8_0 for V cache and is giving error after upgrade to b8699 with your model and also with Unsloth new ones.
Using BF16 for both is working correctly as before so probably a regression on llama.cpp.