Do we need new Quants? There are few fixes
"
kv-cache : support attention rotation for heterogeneous iSWA https://github.com/ggml-org/llama.cpp/pull/21513
CUDA: check for buffer overlap before fusing - CRITICAL fixes tokens https://github.com/ggml-org/llama.cpp/pull/21566
vocab : add byte token handling to BPE detokenizer for Gemma4 https://github.com/ggml-org/llama.cpp/pull/21488
convert : set "add bos" == True for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21500
common : add gemma 4 specialized parser https://github.com/ggml-org/llama.cpp/pull/21418
llama-model: read final_logit_softcapping for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21390
llama: add custom newline split for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21406
"
I don't think so, look at my upload date, my GGUFs were uploaded 2 days ago, the changes that you mention are older than that.
This list is dumb, half of it is inference /server bugfixes. But, the tokenizer changes sadly fucks with GGUF in an indirect way: the imatrix file needs to be redone, and by extension the GGUF.
Not sure about this one, but your 31B version, probably.
the imatrix file needs to be redone, and by extension the GGUF.
Not sure about this one, but your 31B version, probably.
None of my GGUF use Imatrix, I only do standard Quants with no Imatrix, knowing that does my 31B still need to be redone?
I don't think so, but that's slightly above my paygrade.
(that said, now that I know your quants don't use imatrix, which you probably should specify somewhere, as it's quite rare nowadays, I'll remake them anyway :D)
(that said, now that I know your quants don't use imatrix, which you probably should specify somewhere, as it's quite rare nowadays, I'll remake them anyway :D)
You have metadata entry added in GGUF file that list imatrix dataset if matrix is applied to quant.
Okay, I redid the GGUFs, just finished uploading them.
The GGUFs for gemma-4-E4B-it, gemma-4-26B-A4B-it and gemma-4-31B-it have now all been redone and re-uploaded.
Big effort but likely useless: https://github.com/ggml-org/llama.cpp/pull/21500/commits/4e19abc52b275f547d2b9968095cc599c6e2e2e2 - it would work OK anyways.
Big effort but likely useless
Could you explain?
In above commit they added workaround to handle old GGUF's so they are handled by llama properly. Sorry for late post, just had some time now to check this. I just wondered myself if I have to regenerate.
That's for the BOS token handling, it's just inference changes and metadata. (btw I really hate that backends feel like they should manage the BoS token themselves, now we have toggles at 3 levels, file/backend/frontend, issues with each and every new release, and go explain what's what to the end-user, or maintain consistency as a middleware.)
Out of all this changelog, only one is a possibly related real GGUF changes (the cuda one which outputs broken tokens on rare occasions), and only GGUF is done with imatrix. I had to read all that stuff to make sure.
Yeah, right for imatrixed ones.
The GGUFs for Gemma 4 E4B don't work at all now, tried it multiple times, safetensors are fine, GGUFs refuse to load, I guess it's an issue with llama.cpp? Gemma 4 31B and 26B seem to work no issues however.
No idea why it doesn't work but anyways Google updated chat_templates so there is another reason to reupload new quant :P