Qwen3.5 Tool Calling Template

#5
by anikifoss - opened

Having issues integrating Qwen3.5 with crush. Tool Calling is broken out of the box.

Does anyone have a fixed chat template already?

@anikifoss

If you're running a mainline quant, def want to be using pwilkin's autoparser branch, instructions and links here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF#vibe-coding . I've had the best luck with this thus far.

If you are running ik_llama.cpp, there is still some work to be done on caching context for the Qwen35Moe and Qwen3Next Gated Delta Net models, PR merged less than an hour ago: https://github.com/ikawrakow/ik_llama.cpp/pull/1295 though I just got an error still:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Unexpected empty grammar stack after accepting piece: =list (40972)

I've been testing with opencode e.g. put an opencode.json file in the directory then run opencode and /connect to the model:

{
    "$schema": "https://opencode.ai/config.json",
    "share": "disabled",
    "autoupdate": false,
    "experimental": {
        "openTelemetry": false
    },
    "tools": {
        "websearch": true
    },
    "disabled_providers": ["exa"],
    "provider": {
        "LMstudio": {
            "npm": "@ai-sdk/openai-compatible",
            "name": "ik_llama.cpp (local)",
            "options": {
                "baseURL": "http://localhost:8080/v1",
                "timeout": 99999999999
            },
            "models": {
                "Kimi-K2.5-Q4_X": {
                    "name": "Kimi-K2.5-Q4_X"
                }
            }
        }
    }
}

Oh, we're still at the custom branch stage! I thought Qwen3-Next-80B already paved the way. Thanks for the tip!

The telemetry setting on by default is giving me trust issues with all these tools...

hrm, that would be like ~5.3BPW so a little larger than an IQ4_K but smaller than the IQ5_K...

Yeah, Qwen3-Next only landed in ik a couple days ago due to challenges with Gated Delta Net handling it is a more complex arch.

And Qwen3-Next did pave the way for Qwen35MoE, but that way still has some bumps haha...

Interestingly my Qwen3.5 MoE quant seems to be working now on ik_llama.cpp, but the smaller Qwen3-Coder-Next is what threw that error for me above.

The telemetry setting on by default is giving me trust issues with all these tools...

RIGHT?!... I only realized that this morning, and found some config to attempt to disable some of it at least.. haha

I learned about it in general from this post: https://www.reddit.com/r/LocalLLaMA/comments/1rar6md/qwen_code_a_powerful_opensource_coding_agent_no/

Seems to be vibe coding okay with the above opencode.json running ik_llama.cpp on PR https://github.com/ikawrakow/ik_llama.cpp/pull/1296 . No mmproj support on ik yet for this model, and I'm not sure the best way to configure prompt cache stuff here yet either.

model=/mnt/raid/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K-00001-of-00005.gguf

./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Qwen3.5-392B-A17B \
    --ctx-size 163840 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 4096 -b 4096 \
    --cpu-moe \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080 \
    --parallel 1 \
    --no-mmap \
    --jinja \
    --cache-ram 65536 \
    --prompt-cache-all

Though I did have it OOM CUDA error: out of memory after running a few minutes, so backed off extra layer offload a little.

I applied this diff based on https://github.com/ggml-org/llama.cpp/pull/19635, and it seems to work fine for Qwen3.5:

diff --git a/common/chat.cpp b/common/chat.cpp
index 34a48ea5..2ff52259 100644
--- a/common/chat.cpp
+++ b/common/chat.cpp
@@ -1225,6 +1225,18 @@ static common_chat_params common_chat_params_init_qwen3_coder_xml(const common_c
     data.prompt = apply(tmpl, params);
     data.format = COMMON_CHAT_FORMAT_QWEN3_CODER_XML;
 
+    // Qwen3.5, Nemotron Nano 3 and Step-3.5-Flash use the Qwen3 Coder tool calling with thinking
+    bool supports_reasoning = (tmpl.source().find("<think>") != std::string::npos);
+
+    // Handle thinking tags appropriately based on inputs.enable_thinking
+    if (supports_reasoning && string_ends_with(data.prompt, "<think>\n")) {
+        if (!params.enable_thinking) {
+            data.prompt += "</think>";
+        } else {
+            data.thinking_forced_open = true;
+        }
+    }
+
     data.preserved_tokens = {
         "<tool_call>",
         "</tool_call>",
@@ -1234,16 +1246,20 @@ static common_chat_params common_chat_params_init_qwen3_coder_xml(const common_c
         "</parameter>",
     };
 
+    if (supports_reasoning) {
+        data.preserved_tokens.insert(data.preserved_tokens.end(), {"<think>", "</think>"});
+    }
+
     // build grammar for tool call
     static const xml_tool_call_format form {
-        /* form.scope_start = */ "<tool_call>\n",
-        /* form.tool_start  = */ "<function=",
+        /* form.scope_start = */ "",
+        /* form.tool_start  = */ "\n<tool_call>\n<function=",
         /* form.tool_sep    = */ ">\n",
         /* form.key_start   = */ "<parameter=",
         /* form.key_val_sep = */ ">\n",
-        /* form.val_end     = */ "\n</parameter>\n",
-        /* form.tool_end    = */ "</function>\n",
-        /* form.scope_end   = */ "</tool_call>",
+        /* form.val_end     = */ "\n</parameter>",
+        /* form.tool_end    = */ "\n</function>\n</tool_call>",
+        /* form.scope_end   = */ "",
     };
     build_grammar_xml_tool_call(data, params.tools, form);
 
@@ -2064,9 +2080,7 @@ static common_chat_params common_chat_templates_apply_jinja(
     // Detect via explicit XML markers unique to Qwen3-Coder to avoid false positives in other templates.
     // Require presence of <tool_call>, <function=...>, and <parameter=...> blocks.
     if (src.find("<tool_call>") != std::string::npos &&
-        src.find("<function>") != std::string::npos &&
         src.find("<function=") != std::string::npos &&
-        src.find("<parameters>") != std::string::npos &&
         src.find("<parameter=") != std::string::npos) {
         return common_chat_params_init_qwen3_coder_xml(tmpl, params);
     }

The xml_tool_call_format has to be modified to allow the model to make multiple function calls in a row.

The chat template itself is fine. We can optionally modify it to support both "interleaved thinking" and "preserved thinking", by adding a clear_thinking flag similar to how it is done in GLM-4.7 / GLM-5.

Sign up or log in to comment