Apr 11: Updated with Google chat template fixes + more
Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.
We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.
NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
Do you have an example screenshot comparing the two models?
Did you update llama.cpp?
Still seems broken to me? I'm getting this (it wasn't even this bad before, before I just had the issue where the model would randomly stop making tool calls even after it thought it needed to continue):
This fixed the same issue for me: https://github.com/asf0/gemma4_jinja/
Thank you!
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
Did you try updating llama.cpp? Many users say they dont encounter it now like the user below, now that llama.cpp fixed it. Not our quant issue, see: https://github.com/ggml-org/llama.cpp/issues/21321
Thank you!
Is the unused token issue fixed now?

I am having the same issue, with the previous quants the model would always forget its reasoning content after each tool call, with these quants (Q5_K_XL) the tool calls break completely most of the times (as shown in the image) I am using the latest llama-cpp release (https://github.com/ggml-org/llama.cpp/releases/tag/b8763).
Thanks for your work!
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
Do you have an example screenshot comparing the two models?
Did you update llama.cpp?
That actually did fix it. Thanks!
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
Do you have an example screenshot comparing the two models?
Did you update llama.cpp?
llama.cpp was updated right before the testing that resulted in that screenshot, but I will update again and report back
I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.
Does not happen with the bartowski quants.
Do you have an example screenshot comparing the two models?
Did you update llama.cpp?
updated llama.cpp again, and now the behavior is closer to what I expect; it doesn't have those weird extra tokens anymore.
But whereas before it would think it needs to do further tool calling and then fail to call a tool after it stops thinking and instead goes straight into answering the user's query before completing research, it now tries to do a tool call, messes it up, and stops right there (this was at the end of its third round of thinking with 4 successful web search tool calls in between):
, and
The whole run was just this:
On the next run it actually managed to call web search successfully 14 times and come up with an answer, but its research was still incomplete, and it knew it because on its last thinking block it had already listed a number of queries it still needed to check before answering, then searched one of those queries and went straight to answering without finishing research.
I'm starting to wonder if this model is just bad. It also consistently fails to use the web fetch tool I have provided to it, and despite running it with recommended parameters, it keeps repeating itself over and over in thinking blocks (reiterating what it has already found and said before). Qwen3.5 35B is way better at research than this model, it actually recognizes it has a web fetch tool and instead of searching up every spec of every phone individually, it can search for the link to the full specs page of each phone and fetch that in its entirety, which gemma never does despite being provided with the same tools and same system prompt.
Yeah, i have fixed though token leak issues.
But model just cant work with opencode or claude-code - terrible results
