<unused49> infinite generation after llama.cpp release b8699

by fakezeta - opened 13 days ago

Hi,

after updating to release b8699 which support attention rotation for heterogeneous iSWA (PR #21513) I'm getting infinite generation of token.
I saw that Unsloth (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/20) that they are updating their GGUF.

Do you plan to update your quants?

Thank you in advance

bartowski

Owner 13 days ago

Can you give an example of a command that's failing? I updated to latest llama.cpp and ran the Q4_K_M I just downloaded and have no issues

fakezeta

13 days ago

Thank you for answering, I found out that the reason is V cache quantization to Q8_0.
I was using for K cache BF16 and Q8_0 for V cache and is giving error after upgrade to b8699 with your model and also with Unsloth new ones.
Using BF16 for both is working correctly as before so probably a regression on llama.cpp.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment