Still some tool calling issues

#22

by LadyJun - opened Mar 2

Discussion

LadyJun

Mar 2

•

edited Mar 2

Hi!

I'm afraid we still have tool calling issues. It happens quite often, I'd say roughly 15% of the times, especially in the first turns of a multi-turn chat session in the context of agentic coding usage.

Here is an example of failed tool call, where llama.cpp obviously didn't catch the call:

2026/03/02 15:01:57
╔══ ASSEMBLED RESPONSE id=chatcmpl-BlufyKJKgbDOnIjwFhRJWxgts1Bdipgk model=Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf finish_reason=stop
╠── [reasoning_content]
Now let me check the launch.json file:

/Users/vox/Developer/Workspace/OAIProxy/.vscode/launch.json 1 100 ╚══ END (raw SSE bytes: 16282)

Command line parameters for llama-server:

llama-server --model /Users/vox/Developer/Models/models/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf
--port $PORT
--host 127.0.0.1
--n-gpu-layers -1
--threads 16
--parallel 1
--ctx-size 131072
--jinja
--temp 0.6
--top-p 0.95
--min-p 0.00
--top-k 20
--repeat-penalty 1.0
--presence-penalty 0.0
-b 2048
-ub 512
-fa 1

It appears that the model is emitting the call before closing the reasoning content with .
I'll try lowering the temperature to something like 0.3 to see if it can mitigates the issue. Disabling thinking (/no_think or chat kwargs) obviously solves the issue.

JefreyGone

Mar 2

I don't know if that related but sometimes my tools calls are coming as part of the reasoning_content, like:

[Thinking] The grep command failed. Let me try a different pattern to find link elements. The role might be formatted differently.

<tool_call>
<function=shell_command>
<parameter=opts>
{"command": "cat /tmp/snapshot.json"}
</parameter>
</function>

Qwen3.5-35B-A3B-UD-Q8_K_XL
llama.cpp b8187
my llama.cpp setting are all default

LadyJun

Mar 2

Completely related. The model outputs the tool calling without properly closing the reasoning section, with the consequence for this call of not being handled by llama.cpp's Jinja framework and thus, not being converted as a proper JSON tool calling section in the deltas of the OAI API's answer. The actual behavior of the request initiator (probably the agent calling the OAI endpoint in the first place) may vary, but it often ends in displaying the tool call syntax as generated by the model as literal reasoning content (which it is, technically, just not in the correct part of the answer.)

marduk191

Mar 3

yea these models are completely useless until it's solved. the default ones work fine, use that.

JefreyGone

Mar 3

the weird part is that when I use it in opencode it seems to handle it just fine, as if that tool call misplacement never happens, I've searched opencode codebase for signs of handling tool_call inside the reasoning content and it doesn't seem to do that. So I don't get what it does differently 🤔

LadyJun

Mar 3

the weird part is that when I use it in opencode it seems to handle it just fine, as if that tool call misplacement never happens, I've searched opencode codebase for signs of handling tool_call inside the reasoning content and it doesn't seem to do that. So I don't get what it does differently 🤔

I've noticed the exact same behavior with Opencode: it managed to recover from the model's failure. VS Code, on the contrary, is on a 15-20% « hit 'Try again' » trend.

As it is very unlikely for Opencode to actually implement a safety net for the very exotic way the Qwen models family have to emit tool calls (with sometimes parameters in JSON included in XML tags) I'd assume that when it receives the equivalent of an empty response (only "reasoning_content" deltas, without actual "content" or "tool_calling" sections) it silently does the equivalent of the VS Code's Try again option and send a "Carry on"/"Go ahead"/whatever message to the model, which hopefully emits the correct tool call at this point.

guiopen

Mar 11

Same issue, I had to disable reasoning so I can use it for coding

LadyJun

Mar 13

I'm in the process of collecting a DPO dataset to fine-tune the model, but this takes time as 150 to 200 samples at least are needed for the tuning to be effective.
In the meantime you could inject instructions in your prompt/system prompt to tell the model to properly close its reasoning before emitting any tool call. But these kind of instruction tend to loose adherence as the attention window slides and are pushed back in the KV cache in multi-turn agentic sessions.
You may also try to lower temperature a bit.
Hopping @unsloth will take a look at this, but they're working on more pressing matters I guess.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment