Thinking process (v1.0)

#1
by szwedek - opened

I use qwen3.6, but v1.0 doesn't seem to work with the thinking process in Hermes and Opencode.
Is this a known issue?

szwedek changed discussion title from Thinking process to Thinking process (v1.0)

Which behaviour did you experienced exactly?
So far I tested Cline, OpenHands and nanobot on v1.0, Toolcalls were fine and no parsing problems.
Coded project run fine.

I did noticed a a bit too long thinking processes from time to time, I need to look into that detail more. Need to test if the thinking_budget is correctly taken into account.

Or do you mean the thinking switch/toggle?

Actually I don't see a thinking process. The same issue as here - https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/discussions/25#6a0cdd59948e2fe7a091a764

the same issue

image

Im almost sure thats not the templates fault.
I checked your other thread aswell, there you had a screenshot showing problems with OpenCode, while it works fine for me with my template ein OpenCode.

I bet its your parameters, if not than app settings.

Which parameters do you use to start llama.cpp / ik_llama?
Edit: sry didnt sleeped yet ^^
You posted your config their already.

I see you use speculative decoding.
It might be the issue but there are also other parameters I didnt used yet.

Could you start with a minimal version only containing this and tell me if it works?

--model /models/Qwen3.6-27B-Q4_K_M-uc-mtp-v2.gguf
--alias "Qwen3.6 27B"
--temp 0.6
--top-p 0.95
--min-p 0.00
--top-k 20
--port 8080
--host 0.0.0.0
--fit off
--ctx-size 200000
--presence-penalty 0.0
--repeat-penalty 1.0
--jinja
--chat-template-file /models/TEMPLATE_NAME.jinja
--reasoning-budget 8192
--split-mode tensor

Screenshot_20260521_093752
Screenshot_20260521_093949

I use the built-in webui for llama. This template skips the reasoning process. Froggeric's template works ok. I used same llama params for both tests.
My params:
-m Qwen3.6-35B-A3B-UD-Q5_K_M.gguf-c 128000 -fa 1 --fit 1 -b 2048 -ub 512 --no-mmap -ctk q8_0 -ctv q8_0 --jinja --chat-template-kwargs "{\"preserve_thinking\":true}" --no-mmproj -np 1 --reasoning-budget 4096 --metrics --cache-ram 0 --chat-template-file template.jinja --presence-penalty 0.0 --repeat-penalty 1.0

Thanks for the heads up. You were absolutely right. It was the template.
It was a simple fix actually, but easy to overlook. Version 1.1 is now released and also includes some minor enhancements, such as an updated strategy for handling three consecutive tool call failures.

Let me know what you think

v1.1 fixed the issue. I noticed another one. When I've tried to compare two jinja templates then opencode didn't recognize </think> tag.
For other prompts, thinking block closes correctly.
Screenshot_20260521_145623

The issue with llama.cpp (and maybe some few other tools im not aware of) not triggering thinking was easy to solve in v1.1.
Talking about OpenCodes thinking tag output its another case. Thinking tags and the process within displayed as normal text is a known bug in current OpenCode versions. The problem here is that the model (Qwen in this case) needs to output the thinking in some way. Usually tools will strip this, so for example now with v1.1 you see the thinking block correctly containing the thinking process when you use llama.cpp server for example.

But OpenCode doesnt do any of thinking tag stripping nor has it an option to enable it. The only solution I know of right now would be to switch to another Tool (Cline for example) or you disable thinking, but thats not the best output quality then you could get. I might could look into this deeper but considering its an open bug on OpenCode since around 4 months I dont think there is an easy fix. MiniMax 2.5 for example gets displayed the same way in OpenCode. You can see the Bugreport here: https://github.com/anomalyco/opencode/issues/11439

There might be an way to supress thinking output completly but still let the model think, however, you would need to pass that as chat kwarg with llama.cpp startup parameters then I think...
So it would be the situation: either output thinking text or dont. No choose per Tool, only per Server you run currently. So llama.cpp server for example wouldnt get a notice about any thinking but in the background it would still think. In OpenCode the output would look clean then though...

It might however be possible to build a filter, so you could enter some specific construct like into systemprompt of opencode and the chat template would map this then to the models output behaviour...

However, this would need some additional time to research and was actually not exactly on my current schedule...

I just want to confirm that v1.1 works with Hermes without any issues.

edit. sorry, it doesn't work. I have a tool loop:
May 21 15:51:07 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=79; cronjob has failed 79 times with identical arguments. This looks like a lo
May 21 15:51:11 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:11 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=80; cronjob has failed 80 times with identical arguments. This looks like a lo
May 21 15:51:15 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:15 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=81; cronjob has failed 81 times with identical arguments. This looks like a lo
May 21 15:51:19 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:19 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=82; cronjob has failed 82 times with identical arguments. This looks like a lo
May 21 15:51:23 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:23 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=83; cronjob has failed 83 times with identical arguments. This looks like a lo
May 21 15:51:27 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:27 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=84; cronjob has failed 84 times with identical arguments. This looks like a lo
May 21 15:51:32 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:32 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=85; cronjob has failed 85 times with identical arguments. This looks like a lo
May 21 15:51:36 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}
May 21 15:51:36 ai python[732479]: [Tool loop warning: repeated_exact_failure_warning; count=86; cronjob has failed 86 times with identical arguments. This looks like a lo
May 21 15:51:40 ai python[732479]: WARNING agent.tool_executor: Tool cronjob returned error (0.00s): {"error": "schedule is required for create", "success": false}

Tools in Hermes work with foggeric's template.

I just want to confirm that v1.1 works with Hermes without any issues.

Excellent :)
Thats great to hear.

If you encounter some other bug like an loop or something just send me a message here. v1.1 should handle tool call loops nicely (lended this one from froggerics and other community templates).

I did a bit research for the OpenCode problem and the thing is, chat template is a oneway street. So you send a request, it gets passed to the llama server, which restructures the input you just sent (from Hermes for example) accordingly to the template and passes it then to the llm. but from the LLM output it usually gets directly back to the origin. There are two possible ways to reach that goal anyway: either filter it through a middlemen (for example litellm or openrouter) or with an llama.cpp server flag so the server itself strips it before sending the output back.

Heureka I found it :D

If you start your llama server with
--reasoning-format deepseek
flag, it will strip any thinking blocks from the models output but still think. Tested it, works with opencode.

The only downside is that you dont see thinking block content anymore in other tools then as long as you use that flag.
But it works

Edit: I will look into the looping issue.

Edit2: The root cause is clear, if any tool call errors errors raised they need all be catched and fed back into the llm, injecting "try another thing".
This will be fixed, no big problem. Just need a bit prepare time to make sure its also organized and not just pushed in somewhere ^^
There will be a fix, latest tomorrow

v1.1.2 is now out fixing every tool calling issues by implementing a new logic.
Please try it out and tell me if any errors still happen.

Atleast for me, it worked correctly. I tried intentionally multiple times wrong tool calls, it was directly catched and the model continued.
Also upgraded my tools and skills aswell created a expandable python testsuite currently containing 65 tests.
Now also that sometimes happened "stale" behaviour I noticed sometimes in hours long agentic coding runs should be gone.

Let me know what you think

Unfortunatelly, Hermes is still looping with tools. Same error with cronjob.

1.2 much better, but some times it uses tool call inside thinking process

image

@gena-br
Added minimal but strict systemprompt addition to stop this from happening.
It should now dont happen again or atleast much much less.
If its still happening though, even from time to time, tell me, im looking into it again and refine it.

@szwedek
Every text passed over tools to LLM is now correctly capsulated into JSON (which is capsulated into modelspecific XML tags, just how Qwen was trained), it should work now. Was a oneliner but important.

Please try out v1.1.5

The loop still occurs in Hermes using V1.1.5

@szwedek
Which version of Hermes do you use?

Does this also happen with other templates (like froggeric v16)?

Which LLM do you use as base? Stock Qwen3.5/Qwen3.6 or modified (finetuned) HermesAgent Qwen3 Version?

@szwedek
Please post your answer here:
https://huggingface.co/StableQuant/Qwen-Templates-Rebuild-Project/discussions/4

I will install this tool on my own VM then and test it out manually. Excpect some time needed for manual testing (around 2-3 days).
I will update you there once it has been resolved and manually verified by me to be working.

@gena-br
Im leaving this Discussion open for around 48h, if you still see this behaviour happening we can keep discussion it here, else I will close this discussion then to organize discussions a bit more theme-specific.
If needed you can open then anytime a new discussion for the toolcall inside thinking behaviour

Sign up or log in to comment