Getting stuck in thinking
Hello,
This model is the first Gemma MOE I saw passing the 15 Tool Tests, so I was excited to use in my project. The project is going thru hundreds of entries and fact checking them against live web searches via Tavily. I've tested a many models and this is also the only one which gets stuck, simply takes many minutes without responding, as if it is in an internal loop. using --reasoning off solves the issue, but then it loses some of its wit. i.e. it scores less on my internal tests than other models, such as CelesteImperia one. And doesn't pass on anymore on stevibe's test.
FYI, I am using q8_0.gguf with 262k context.
Yea I noticed this issue as well. I am actively looking into it. Just confirming, the issue is that the model will sometimes not exit it's thinking and it's entire response part will be parsed into reasoning content correct?
Hello,
This model is the first Gemma MOE I saw passing the 15 Tool Tests, so I was excited to use in my project. The project is going thru hundreds of entries and fact checking them against live web searches via Tavily. I've tested a many models and this is also the only one which gets stuck, simply takes many minutes without responding, as if it is in an internal loop. using
--reasoning offsolves the issue, but then it loses some of its wit. i.e. it scores less on my internal tests than other models, such as CelesteImperia one. And doesn't pass on anymore on stevibe's test.FYI, I am using q8_0.gguf with 262k context.
I also ran into issues with this model. What card are you using? Are you using llama.cpp? Which flags do you use to launch the model?
could you confirm you are using the latest llama.cpp (release 6 hours ago). I see the only note for the release is "fix: common/gemma4 : handle parsing edge cases". If you could try try that out and let me know, that would be a huge help. I am already looking into potential chat template bugs amongst other things. The most likely issue is the model started to forget about closing it's tags. nothing some DPO alignment can't fix. Same with the tendency to open triple backticks
for bulleted lists in it's reasoning
I am running on RTX 5090.
my flags:
llama-server \
--model $LLM_DIR/TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill-v2-GGUF/gemma-4-26B-A4B-it-Claude-Opus-Distill.q8_0.gguf \
--alias "gemma-4-26B-A4B-Claude-Opus-Distill" \
--temp 0.2 \
--top-p 0.95 \
--top-k 64 \
--ctx-size 262144 \
-ngl 99 \
--port 8001 \
--host 0.0.0.0 \
--reasoning off \
-np 2 -fa on \
-b 4096 -ub 1024 \
--cache-type-k q8_0 --cache-type-v q8_0
I just tried again with latest release (b8783). Still didn't fix the issue.
Oh, it just core dumped, but a later run using another model also core dumped, so not related to this model.
Ok will look into a fix more in depth today, thanks
Ok will look into a fix more in depth today, thanks
Alright. I will be monitoring this thread in case you need me to test anything.
Realistically, anyone can test it :P
Ok will look into a fix more in depth today, thanks
Alright. I will be monitoring this thread in case you need me to test anything.
In the coming moments ill be trying some stuff and uploading on my personal to test.
- I will test with a stricter template (i just did but results looked pretty similar, feel free to test if you'd like)
- I will replace the tokenizer config with the one from google (doing this now)
- I will roll back 1 epoch to see if this eliminates the artifacts (honestly not super confident this will work, though it should help a bit with the other jank)
Not sure if you have tested the 31b v1/v2, do they have these issues as well? I am suspecting this is a MoE training issue.
https://huggingface.co/armand0e/test-2-26B-A4B-template-fixes-GGUF
deleted the first attempt. please try out this gguf when you get a chance using the google official tokenizer. so far i haven't hit a cut off issue, though other issues are still present. I've triple checked all my scripts and can't seem to find the issue
I've just tested the 31B. The issue is not there. It ran without failing for 1502s. BUT, this is way more than other variants take to complete the same task. They usually complete under 400s. This one took more and used more web search calls.
Going to test your new version now.
oh, wow. The new version just slammed thru the task, it actually scored better than 31B and finished in 171s. Didn't get stuck. It was getting stuck 100% of the time.
thats great great great news. bad news for me because i need to redo ggufs XD, but good news overall.
if you are running your own tool calling evaluations please feel free to share your results compared to other models!
The tool calling evaluation I am running is against live web searches and some manually verified data, but it is still unstable, even the same model scores differently across runs, but your model is the fastest one that still scores well. If I am able to make it more stable, then I will share the results. The other model to score well is https://huggingface.co/CelesteImperia/Gemma-4-26B-MoE-GGUF (both MoE and dense).
Do you believe your dense model needs updates, too? I am thinking it is unexpected that it took 25min to finish a task your MoE version finished in 3 minutes, with comparable scoring.
I will update all the gemma4 models with these template and tokenizer config fixes just to be safe. It definitely won't hurt.
Though as for the speed issue hardware will definitely play a big role here. MoE only has ~4B active compared to the 31B dense model
I know, but I am running on 5090 and the dense models for CelesteImperia / unsloth finishes under 7 minutes.
Sheesh. I'll look into it but i don't have the slightest clue tbh
Here are the scores and timings for running this job on a larger dataset. Smaller points is better here. So it is the same score as another dense model, but took 7x longer to finish. Qwopus dense took even longer.
MOE Q8 - kv q8 - 72 points 1,621s
MOE Q8 - kv q8 - 70 points 1,582s
Dense Q6 - kv q8 - 63 points 8,3012s
CelesteImperia
MOE Q8 - kv q8 - points: 74 - 1165s
MOE Q8 - kv q8 - points: 71 - 1087s
Dense Q6 - kv q8 - points 62 - 1169s
MOE Q8 - kv f16 - points: 71 - 2,351s
Qwopus
Dense Q8 - points: 67 11,835s
Here are the scores and timings for running this job on a larger dataset. Smaller points is better here. So it is the same score as another dense model, but took 7x longer to finish. Qwopus dense took even longer.
MOE Q8 - kv q8 - 72 points 1,621s MOE Q8 - kv q8 - 70 points 1,582s Dense Q6 - kv q8 - 63 points 8,3012s CelesteImperia MOE Q8 - kv q8 - points: 74 - 1165s MOE Q8 - kv q8 - points: 71 - 1087s Dense Q6 - kv q8 - points 62 - 1169s MOE Q8 - kv f16 - points: 71 - 2,351s Qwopus Dense Q8 - points: 67 11,835s
the jumps in speed when you move off our models are really noticeable. I wonder if it was due to faulty day 1 unsloth support. Not sure what could be causing this, but impressive numbers nonetheless!
Is that comparison to Qwopus indicating that the kv cache for the qwopus wasn't even quantized?
@khronnuz Sorry for troubling you again, but I just realized I had used the gemma4 chat template from before 6 days ago (without the tool calling patches they made) If possible I'd appreciate another test with this model (mainly to make sure the changes didn't influence anything poorly. Again sorry for the trouble, I've really gotta not touch this stuff late at night π₯
https://huggingface.co/armand0e/test-gemma4-template-updates-GGUF
Sure thing, will test the Q8 version of it right now. Will let you know once it finishes.
oh boy there are more issues, tokenizer config was outdated with wrong regex.
The new version was actually taking way longer than others. Had not finished yet and it was leaking tool calls as text messages. I've aborted it.
Wow! Plot twist fr. well thanks for your help. I guess I will leave the old template and tokenizer config