Qwen3.5 Tool Calling Template
Having issues integrating Qwen3.5 with crush. Tool Calling is broken out of the box.
Does anyone have a fixed chat template already?
If you're running a mainline quant, def want to be using pwilkin's autoparser branch, instructions and links here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF#vibe-coding . I've had the best luck with this thus far.
If you are running ik_llama.cpp, there is still some work to be done on caching context for the Qwen35Moe and Qwen3Next Gated Delta Net models, PR merged less than an hour ago: https://github.com/ikawrakow/ik_llama.cpp/pull/1295 though I just got an error still:
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected empty grammar stack after accepting piece: =list (40972)
I've been testing with opencode e.g. put an opencode.json file in the directory then run opencode and /connect to the model:
{
"$schema": "https://opencode.ai/config.json",
"share": "disabled",
"autoupdate": false,
"experimental": {
"openTelemetry": false
},
"tools": {
"websearch": true
},
"disabled_providers": ["exa"],
"provider": {
"LMstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "ik_llama.cpp (local)",
"options": {
"baseURL": "http://localhost:8080/v1",
"timeout": 99999999999
},
"models": {
"Kimi-K2.5-Q4_X": {
"name": "Kimi-K2.5-Q4_X"
}
}
}
}
}
Oh, we're still at the custom branch stage! I thought Qwen3-Next-80B already paved the way. Thanks for the tip!
The telemetry setting on by default is giving me trust issues with all these tools...
hrm, that would be like ~5.3BPW so a little larger than an IQ4_K but smaller than the IQ5_K...
Yeah, Qwen3-Next only landed in ik a couple days ago due to challenges with Gated Delta Net handling it is a more complex arch.
And Qwen3-Next did pave the way for Qwen35MoE, but that way still has some bumps haha...
Interestingly my Qwen3.5 MoE quant seems to be working now on ik_llama.cpp, but the smaller Qwen3-Coder-Next is what threw that error for me above.
The telemetry setting on by default is giving me trust issues with all these tools...
RIGHT?!... I only realized that this morning, and found some config to attempt to disable some of it at least.. haha
I learned about it in general from this post: https://www.reddit.com/r/LocalLLaMA/comments/1rar6md/qwen_code_a_powerful_opensource_coding_agent_no/
Seems to be vibe coding okay with the above opencode.json running ik_llama.cpp on PR https://github.com/ikawrakow/ik_llama.cpp/pull/1296 . No mmproj support on ik yet for this model, and I'm not sure the best way to configure prompt cache stuff here yet either.
model=/mnt/raid/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K-00001-of-00005.gguf
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Qwen3.5-392B-A17B \
--ctx-size 163840 \
-ger \
--merge-qkv \
-sm layer \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
-ub 4096 -b 4096 \
--cpu-moe \
--threads 24 \
--host 127.0.0.1 \
--port 8080 \
--parallel 1 \
--no-mmap \
--jinja \
--cache-ram 65536 \
--prompt-cache-all
Though I did have it OOM CUDA error: out of memory after running a few minutes, so backed off extra layer offload a little.
I applied this diff based on https://github.com/ggml-org/llama.cpp/pull/19635, and it seems to work fine for Qwen3.5:
diff --git a/common/chat.cpp b/common/chat.cpp
index 34a48ea5..2ff52259 100644
--- a/common/chat.cpp
+++ b/common/chat.cpp
@@ -1225,6 +1225,18 @@ static common_chat_params common_chat_params_init_qwen3_coder_xml(const common_c
data.prompt = apply(tmpl, params);
data.format = COMMON_CHAT_FORMAT_QWEN3_CODER_XML;
+ // Qwen3.5, Nemotron Nano 3 and Step-3.5-Flash use the Qwen3 Coder tool calling with thinking
+ bool supports_reasoning = (tmpl.source().find("<think>") != std::string::npos);
+
+ // Handle thinking tags appropriately based on inputs.enable_thinking
+ if (supports_reasoning && string_ends_with(data.prompt, "<think>\n")) {
+ if (!params.enable_thinking) {
+ data.prompt += "</think>";
+ } else {
+ data.thinking_forced_open = true;
+ }
+ }
+
data.preserved_tokens = {
"<tool_call>",
"</tool_call>",
@@ -1234,16 +1246,20 @@ static common_chat_params common_chat_params_init_qwen3_coder_xml(const common_c
"</parameter>",
};
+ if (supports_reasoning) {
+ data.preserved_tokens.insert(data.preserved_tokens.end(), {"<think>", "</think>"});
+ }
+
// build grammar for tool call
static const xml_tool_call_format form {
- /* form.scope_start = */ "<tool_call>\n",
- /* form.tool_start = */ "<function=",
+ /* form.scope_start = */ "",
+ /* form.tool_start = */ "\n<tool_call>\n<function=",
/* form.tool_sep = */ ">\n",
/* form.key_start = */ "<parameter=",
/* form.key_val_sep = */ ">\n",
- /* form.val_end = */ "\n</parameter>\n",
- /* form.tool_end = */ "</function>\n",
- /* form.scope_end = */ "</tool_call>",
+ /* form.val_end = */ "\n</parameter>",
+ /* form.tool_end = */ "\n</function>\n</tool_call>",
+ /* form.scope_end = */ "",
};
build_grammar_xml_tool_call(data, params.tools, form);
@@ -2064,9 +2080,7 @@ static common_chat_params common_chat_templates_apply_jinja(
// Detect via explicit XML markers unique to Qwen3-Coder to avoid false positives in other templates.
// Require presence of <tool_call>, <function=...>, and <parameter=...> blocks.
if (src.find("<tool_call>") != std::string::npos &&
- src.find("<function>") != std::string::npos &&
src.find("<function=") != std::string::npos &&
- src.find("<parameters>") != std::string::npos &&
src.find("<parameter=") != std::string::npos) {
return common_chat_params_init_qwen3_coder_xml(tmpl, params);
}
The xml_tool_call_format has to be modified to allow the model to make multiple function calls in a row.
The chat template itself is fine. We can optionally modify it to support both "interleaved thinking" and "preserved thinking", by adding a clear_thinking flag similar to how it is done in GLM-4.7 / GLM-5.