Apr 11: Updated with Google chat template fixes + more

#24

pinned

by danielhanchen - opened 12 days ago

Discussion

danielhanchen

Unsloth AI org 12 days ago

•

edited 12 days ago

Hey everyone, we’ve updated the quants again to include all of Google’s official chat template fixes (which fixed/improved tool-calling), along with the latest llama.cpp fixes.

We know there have been a lot of re-downloading lately, so we appreciate your patience. We’re pushing updates whenever fixes become available to make sure you always have the latest and best-performing quants.

NVIDIA is working on the CUDA 13.2 issue. Until it is fixed, do not use CUDA 13.2.

danielhanchen pinned discussion 12 days ago

danielhanchen changed discussion title from Updated with Google chat template fixes + more to Apr 11: Updated with Google chat template fixes + more 12 days ago

Plopperzz

11 days ago

•

edited 11 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Daniel-H212

11 days ago

•

edited 11 days ago

Still seems broken to me? I'm getting this (it wasn't even this bad before, before I just had the issue where the model would randomly stop making tool calls even after it thought it needed to continue):

danielhanchen

Unsloth AI org 11 days ago

•

edited 11 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Do you have an example screenshot comparing the two models?

Did you update llama.cpp?

aljaca

11 days ago

•

edited 11 days ago

Still seems broken to me? I'm getting this (it wasn't even this bad before, before I just had the issue where the model would randomly stop making tool calls even after it thought it needed to continue):

This fixed the same issue for me: https://github.com/asf0/gemma4_jinja/

ValfarDeveloper

11 days ago

•

edited 11 days ago

Thank you!

danielhanchen

Unsloth AI org 11 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Did you try updating llama.cpp? Many users say they dont encounter it now like the user below, now that llama.cpp fixed it. Not our quant issue, see: https://github.com/ggml-org/llama.cpp/issues/21321

Thank you!

Is the unused token issue fixed now?

LifeLight

11 days ago

•

edited 11 days ago

I am having the same issue, with the previous quants the model would always forget its reasoning content after each tool call, with these quants (Q5_K_XL) the tool calls break completely most of the times (as shown in the image) I am using the latest llama-cpp release (https://github.com/ggml-org/llama.cpp/releases/tag/b8763).

Thanks for your work!

Plopperzz

9 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Do you have an example screenshot comparing the two models?

Did you update llama.cpp?

That actually did fix it. Thanks!

Daniel-H212

9 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Do you have an example screenshot comparing the two models?

Did you update llama.cpp?

llama.cpp was updated right before the testing that resulted in that screenshot, but I will update again and report back

Daniel-H212

9 days ago

•

edited 9 days ago

I am getting a bunch of "unused" tokens now on Q8_K_XL when using --cpu-moe or --n-cpu-moe.

Does not happen with the bartowski quants.

Do you have an example screenshot comparing the two models?

Did you update llama.cpp?

updated llama.cpp again, and now the behavior is closer to what I expect; it doesn't have those weird extra tokens anymore.

But whereas before it would think it needs to do further tool calling and then fail to call a tool after it stops thinking and instead goes straight into answering the user's query before completing research, it now tries to do a tool call, messes it up, and stops right there (this was at the end of its third round of thinking with 4 successful web search tool calls in between):
, and

The whole run was just this:

On the next run it actually managed to call web search successfully 14 times and come up with an answer, but its research was still incomplete, and it knew it because on its last thinking block it had already listed a number of queries it still needed to check before answering, then searched one of those queries and went straight to answering without finishing research.

I'm starting to wonder if this model is just bad. It also consistently fails to use the web fetch tool I have provided to it, and despite running it with recommended parameters, it keeps repeating itself over and over in thinking blocks (reiterating what it has already found and said before). Qwen3.5 35B is way better at research than this model, it actually recognizes it has a web fetch tool and instead of searching up every spec of every phone individually, it can search for the link to the full specs page of each phone and fetch that in its entirety, which gemma never does despite being provided with the same tools and same system prompt.

ViliamVolosV

9 days ago

Yeah, i have fixed though token leak issues.
But model just cant work with opencode or claude-code - terrible results

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment