Transformers
GGUF
text-generation-inference
unsloth
gemma4
reasoning
conversational

Getting stuck in thinking

#1
by khronnuz - opened

Hello,

This model is the first Gemma MOE I saw passing the 15 Tool Tests, so I was excited to use in my project. The project is going thru hundreds of entries and fact checking them against live web searches via Tavily. I've tested a many models and this is also the only one which gets stuck, simply takes many minutes without responding, as if it is in an internal loop. using --reasoning off solves the issue, but then it loses some of its wit. i.e. it scores less on my internal tests than other models, such as CelesteImperia one. And doesn't pass on anymore on stevibe's test.

FYI, I am using q8_0.gguf with 262k context.

TeichAI org
β€’
edited 9 days ago

Yea I noticed this issue as well. I am actively looking into it. Just confirming, the issue is that the model will sometimes not exit it's thinking and it's entire response part will be parsed into reasoning content correct?

Hello,

This model is the first Gemma MOE I saw passing the 15 Tool Tests, so I was excited to use in my project. The project is going thru hundreds of entries and fact checking them against live web searches via Tavily. I've tested a many models and this is also the only one which gets stuck, simply takes many minutes without responding, as if it is in an internal loop. using --reasoning off solves the issue, but then it loses some of its wit. i.e. it scores less on my internal tests than other models, such as CelesteImperia one. And doesn't pass on anymore on stevibe's test.

FYI, I am using q8_0.gguf with 262k context.

I also ran into issues with this model. What card are you using? Are you using llama.cpp? Which flags do you use to launch the model?

TeichAI org

could you confirm you are using the latest llama.cpp (release 6 hours ago). I see the only note for the release is "fix: common/gemma4 : handle parsing edge cases". If you could try try that out and let me know, that would be a huge help. I am already looking into potential chat template bugs amongst other things. The most likely issue is the model started to forget about closing it's tags. nothing some DPO alignment can't fix. Same with the tendency to open triple backticks


for bulleted lists in it's reasoning

I am running on RTX 5090.

my flags:

llama-server \
    --model $LLM_DIR/TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill-v2-GGUF/gemma-4-26B-A4B-it-Claude-Opus-Distill.q8_0.gguf \
    --alias "gemma-4-26B-A4B-Claude-Opus-Distill" \
    --temp 0.2 \
    --top-p 0.95 \
    --top-k 64 \
    --ctx-size 262144 \
    -ngl 99 \
    --port 8001 \
    --host 0.0.0.0 \
    --reasoning off \
    -np 2 -fa on \
    -b 4096 -ub 1024 \
    --cache-type-k q8_0 --cache-type-v q8_0

I just tried again with latest release (b8783). Still didn't fix the issue.

Oh, it just core dumped, but a later run using another model also core dumped, so not related to this model.

TeichAI org

Ok will look into a fix more in depth today, thanks

Ok will look into a fix more in depth today, thanks

Alright. I will be monitoring this thread in case you need me to test anything.

TeichAI org

Realistically, anyone can test it :P

TeichAI org

Ok will look into a fix more in depth today, thanks

Alright. I will be monitoring this thread in case you need me to test anything.

In the coming moments ill be trying some stuff and uploading on my personal to test.

  1. I will test with a stricter template (i just did but results looked pretty similar, feel free to test if you'd like)
  2. I will replace the tokenizer config with the one from google (doing this now)
  3. I will roll back 1 epoch to see if this eliminates the artifacts (honestly not super confident this will work, though it should help a bit with the other jank)

Not sure if you have tested the 31b v1/v2, do they have these issues as well? I am suspecting this is a MoE training issue.

TeichAI org
β€’
edited 8 days ago

https://huggingface.co/armand0e/test-2-26B-A4B-template-fixes-GGUF

deleted the first attempt. please try out this gguf when you get a chance using the google official tokenizer. so far i haven't hit a cut off issue, though other issues are still present. I've triple checked all my scripts and can't seem to find the issue

I've just tested the 31B. The issue is not there. It ran without failing for 1502s. BUT, this is way more than other variants take to complete the same task. They usually complete under 400s. This one took more and used more web search calls.
Going to test your new version now.

oh, wow. The new version just slammed thru the task, it actually scored better than 31B and finished in 171s. Didn't get stuck. It was getting stuck 100% of the time.

TeichAI org

thats great great great news. bad news for me because i need to redo ggufs XD, but good news overall.

TeichAI org

if you are running your own tool calling evaluations please feel free to share your results compared to other models!

The tool calling evaluation I am running is against live web searches and some manually verified data, but it is still unstable, even the same model scores differently across runs, but your model is the fastest one that still scores well. If I am able to make it more stable, then I will share the results. The other model to score well is https://huggingface.co/CelesteImperia/Gemma-4-26B-MoE-GGUF (both MoE and dense).

Do you believe your dense model needs updates, too? I am thinking it is unexpected that it took 25min to finish a task your MoE version finished in 3 minutes, with comparable scoring.

TeichAI org

I will update all the gemma4 models with these template and tokenizer config fixes just to be safe. It definitely won't hurt.

Though as for the speed issue hardware will definitely play a big role here. MoE only has ~4B active compared to the 31B dense model

I know, but I am running on 5090 and the dense models for CelesteImperia / unsloth finishes under 7 minutes.

TeichAI org

Sheesh. I'll look into it but i don't have the slightest clue tbh

Here are the scores and timings for running this job on a larger dataset. Smaller points is better here. So it is the same score as another dense model, but took 7x longer to finish. Qwopus dense took even longer.

MOE Q8 - kv q8 - 72 points 1,621s
MOE Q8 - kv q8 - 70 points 1,582s
Dense Q6 - kv q8 - 63 points 8,3012s

CelesteImperia

MOE Q8 - kv q8 - points: 74 - 1165s
MOE Q8 - kv q8 - points: 71 - 1087s
Dense Q6 - kv q8 - points 62 - 1169s
MOE Q8 - kv f16 - points: 71 - 2,351s

Qwopus
Dense Q8 - points: 67 11,835s 
TeichAI org
This comment has been hidden (marked as Off-Topic)
This comment has been hidden (marked as Off-Topic)
TeichAI org
This comment has been hidden (marked as Off-Topic)
This comment has been hidden (marked as Off-Topic)
TeichAI org
β€’
edited 8 days ago

Here are the scores and timings for running this job on a larger dataset. Smaller points is better here. So it is the same score as another dense model, but took 7x longer to finish. Qwopus dense took even longer.

MOE Q8 - kv q8 - 72 points 1,621s
MOE Q8 - kv q8 - 70 points 1,582s
Dense Q6 - kv q8 - 63 points 8,3012s

CelesteImperia

MOE Q8 - kv q8 - points: 74 - 1165s
MOE Q8 - kv q8 - points: 71 - 1087s
Dense Q6 - kv q8 - points 62 - 1169s
MOE Q8 - kv f16 - points: 71 - 2,351s

Qwopus
Dense Q8 - points: 67 11,835s 

the jumps in speed when you move off our models are really noticeable. I wonder if it was due to faulty day 1 unsloth support. Not sure what could be causing this, but impressive numbers nonetheless!

Is that comparison to Qwopus indicating that the kv cache for the qwopus wasn't even quantized?

TeichAI org
β€’
edited 7 days ago

@khronnuz Sorry for troubling you again, but I just realized I had used the gemma4 chat template from before 6 days ago (without the tool calling patches they made) If possible I'd appreciate another test with this model (mainly to make sure the changes didn't influence anything poorly. Again sorry for the trouble, I've really gotta not touch this stuff late at night πŸ˜₯

https://huggingface.co/armand0e/test-gemma4-template-updates-GGUF

Sure thing, will test the Q8 version of it right now. Will let you know once it finishes.

TeichAI org

oh boy there are more issues, tokenizer config was outdated with wrong regex.

The new version was actually taking way longer than others. Had not finished yet and it was leaking tool calls as text messages. I've aborted it.

TeichAI org

Wow! Plot twist fr. well thanks for your help. I guess I will leave the old template and tokenizer config

Sign up or log in to comment