Tool calling

#1
by dehnhaide - opened

Sorry if am asking a stupid question but I haven't checked lately with exl3 development progress, but is the tool calling working, like in known TUIs like Claude Code or Open code?

And thanks for the quants! πŸ™

I plan to test EXL3 with this PR https://github.com/theroyallab/tabbyAPI/pull/413

I've seen your and remichu_sm's stance on the tool calling transition on the Exllama discord! Interesting! I will try the PR and report...

I can confirm tool calls work with #413 and Qwen3.5

Using git clone https://github.com/devnen/tabbyAPI.git -b full-tool-calling-support but still no luck (in Opencode) with "Qwen3.5-397B-A17B-exl3". Any idea what am I doing wrong?

Screenshot from 2026-03-22 11-22-32

This is my Docker build

# Use an official CUDA runtime with Ubuntu as a parent image
FROM nvidia/cuda:12.8.1-runtime-ubuntu24.04

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    ca-certificates \
    python3.12 \
    python3-pip \
    python3.12-venv \
    python3.12-dev \
    git \
    && rm -rf /var/lib/apt/lists/*

# Create a virtual environment
RUN python3 -m venv /opt/venv

# Activate the venv and set the PATH
ENV PATH="/opt/venv/bin:$PATH"

# Upgrade pip and install uv
RUN pip install --no-cache-dir --upgrade pip

# Set the working directory in the container
WORKDIR /app

# Clone tabbyAPI repository. 0d1a8ba (fix for proper reasoning support) has issues with PR #413 so use a manual patch
RUN git clone https://github.com/theroyallab/tabbyAPI.git /app
# RUN git checkout -b app 803ca5c

# Configure git user (required for merge)
RUN git config --global user.email "docker@tabbyapi.local" && \
    git config --global user.name "Docker Build"

# Fetch and merge PR #413 - Tool-calling
# conflict resolution '--strategy-option theirs'
# RUN git fetch origin pull/413/head:pr-413 && \
#     git merge --strategy-option theirs pr-413

COPY reasoning_tool_call_pr413.patch reasoning_tool_call_pr413.patch
RUN git apply reasoning_tool_call_pr413.patch

# Install packages specified in pyproject.toml cu12, extras
# RUN pip install --no-cache-dir .[cu12,extras]
RUN pip install --no-cache-dir .[cu12]

# Triton needs `apt get install python3.12-dev` for <Python.h>
RUN pip install triton flash-linear-attention

# Impossible to compile by itself, if fails in PyTorch cpp_extension.py
# with a 404 error
# similar to https://github.com/Dao-AILab/causal-conv1d/issues/4
RUN pip install https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.6.1.post4/causal_conv1d-1.6.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

# Make port 5000 available to the world outside this container
EXPOSE 5000

# Set the entry point
ENTRYPOINT ["python3"]

# Run main.py when the container launches
CMD ["main.py"]
chat_template: Qwen3.5.jinja
{# TabbyAPI Metadata #}
{%- set tool_call_format = "xml" -%}
{%- set tool_start = "<tool_call>" -%}
{%- set tool_end = "</tool_call>" -%}
{%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%}

{%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}
{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role == "system" %}
        {%- if not loop.first %}
            {{- raise_exception('System message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

Sampler (using qwen recommended settings)

temperature:
  override: 0.7
  force: false
top_p:
  override: 0.8
  force: false
top_k:
  override: 20
  force: false
# min_p:
#   override: 0
#   force: false
`reasoning_tool_call_pr413.patch`
diff --git a/backends/exllamav3/model.py b/backends/exllamav3/model.py
index a6d7968..b49b303 100644
--- a/backends/exllamav3/model.py
+++ b/backends/exllamav3/model.py
@@ -1021,6 +1021,7 @@ class ExllamaV3Container(BaseModelContainer):
             max_rq_tokens=self.max_rq_tokens,
             filters=grammar_handler.filters,
         )
+        self.active_job_ids[request_id] = job
 
         generated_tokens = 0
         full_response = ""
@@ -1038,8 +1039,21 @@ class ExllamaV3Container(BaseModelContainer):
                 if chunk:
                     chunk_tokens = result.get("token_ids", self.tokenizer.encode(chunk))
                     full_response += chunk
+
+                    # Extract token IDs as a plain list for downstream consumers
                     if isinstance(chunk_tokens, torch.Tensor):
+                        token_id_list = chunk_tokens.flatten().tolist()
                         generated_tokens += chunk_tokens.size(dim=0)
+                    elif isinstance(chunk_tokens, tuple):
+                        first = chunk_tokens[0]
+                        if isinstance(first, torch.Tensor):
+                            token_id_list = first.flatten().tolist()
+                        else:
+                            token_id_list = list(first)
+                        generated_tokens += len(token_id_list)
+                    else:
+                        token_id_list = list(chunk_tokens)
+                        generated_tokens += len(token_id_list)
 
                     # Increase penalty range to generated token amount
                     # TODO:
@@ -1049,6 +1063,7 @@ class ExllamaV3Container(BaseModelContainer):
                     generation = {
                         "request_id": request_id,
                         "text": chunk,
+                        "token_ids": token_id_list,
                         "prompt_tokens": context_len,
                         "generated_tokens": generated_tokens,
                         "offset": len(full_response),
@@ -1069,8 +1084,6 @@ class ExllamaV3Container(BaseModelContainer):
 
                     yield finish_chunk
                     break
-            # Assign the active job to the request ID
-            self.active_job_ids[request_id] = job
 
         except asyncio.CancelledError:
             await job.cancel()
diff --git a/common/templating.py b/common/templating.py
index cc0cceb..dda06d8 100644
--- a/common/templating.py
+++ b/common/templating.py
@@ -12,6 +12,7 @@ from jinja2 import Template, TemplateError
 from jinja2.ext import loopcontrols
 from jinja2.sandbox import ImmutableSandboxedEnvironment
 from loguru import logger
+from markupsafe import Markup
 from packaging import version
 
 
@@ -24,12 +25,17 @@ class TemplateLoadError(Exception):
     pass
 
 
+VALID_TOOL_CALL_FORMATS = {"json", "xml", "auto"}
+
+
 @dataclass
 class TemplateMetadata:
     """Represents the parsed metadata from a template."""
 
     stop_strings: List[str] = field(default_factory=list)
     tool_start: Optional[str] = None
+    tool_end: Optional[str] = None
+    tool_call_format: str = "json"
 
 
 class PromptTemplate:
@@ -46,6 +52,22 @@ class PromptTemplate:
     )
     metadata: Optional[TemplateMetadata] = None
 
+    @staticmethod
+    def _tojson_compat(value, indent=None, ensure_ascii=True):
+        """Compatibility JSON filter for chat templates.
+
+        Some model templates call ``tojson(ensure_ascii=False)`` while the
+        bundled Jinja filter may not accept that keyword in sandboxed mode.
+        """
+        return Markup(
+            json.dumps(
+                value,
+                indent=indent,
+                ensure_ascii=ensure_ascii,
+                separators=(",", ": "),
+            )
+        )
+
     async def extract_metadata(self, template_vars: dict):
         """
         Returns deserialized template metadata from a chat template.
@@ -76,6 +98,22 @@ class PromptTemplate:
             if isinstance(template_module.tool_start, str):
                 template_metadata.tool_start = template_module.tool_start
 
+        if hasattr(template_module, "tool_end"):
+            if isinstance(template_module.tool_end, str):
+                template_metadata.tool_end = template_module.tool_end
+
+        if hasattr(template_module, "tool_call_format"):
+            fmt = template_module.tool_call_format
+            if isinstance(fmt, str) and fmt in VALID_TOOL_CALL_FORMATS:
+                template_metadata.tool_call_format = fmt
+                logger.debug(f"Template tool_call_format: {fmt}")
+            else:
+                logger.warning(
+                    f"Invalid tool_call_format '{fmt}' in template, "
+                    f"defaulting to 'json'. "
+                    f"Valid values: {VALID_TOOL_CALL_FORMATS}"
+                )
+
         self.metadata = template_metadata
         return template_metadata
 
@@ -107,6 +145,7 @@ class PromptTemplate:
 
         self.environment.globals["strftime_now"] = strftime_now
         self.environment.globals["raise_exception"] = raise_exception
+        self.environment.filters["tojson"] = self._tojson_compat
 
         return self.environment.from_string(template_str)
 
diff --git a/endpoints/OAI/types/chat_completion.py b/endpoints/OAI/types/chat_completion.py
index ca311dd..05aec35 100644
--- a/endpoints/OAI/types/chat_completion.py
+++ b/endpoints/OAI/types/chat_completion.py
@@ -4,7 +4,7 @@ from typing import Literal, Union, List, Optional, Dict
 from uuid import uuid4
 
 from endpoints.OAI.types.common import UsageStats, CommonCompletionRequest
-from endpoints.OAI.types.tools import ToolSpec, ToolCall
+from endpoints.OAI.types.tools import NamedToolChoice, ToolSpec, ToolCall
 
 
 class ChatCompletionLogprob(BaseModel):
@@ -72,6 +72,10 @@ class ChatCompletionRequest(CommonCompletionRequest):
 
     tools: Optional[List[ToolSpec]] = None
     functions: Optional[List[Dict]] = None
+    tool_choice: Optional[
+        Union[Literal["none", "auto", "required"], NamedToolChoice]
+    ] = None
+    parallel_tool_calls: Optional[bool] = True
 
     # Chat completions requests do not have a BOS token preference. Backend
     # respects the tokenization config for the individual model.
diff --git a/endpoints/OAI/types/tools.py b/endpoints/OAI/types/tools.py
index b5b9611..1e57266 100644
--- a/endpoints/OAI/types/tools.py
+++ b/endpoints/OAI/types/tools.py
@@ -1,5 +1,5 @@
 from pydantic import BaseModel, Field
-from typing import Dict, Literal
+from typing import Dict, Literal, Optional
 from uuid import uuid4
 
 
@@ -28,8 +28,28 @@ class Tool(BaseModel):
 
 
 class ToolCall(BaseModel):
-    """Represents an OAI tool description."""
+    """Represents an OAI tool call.
+
+    The ``index`` field is optional so it can be omitted in non-streaming
+    responses (where OpenAI does not include it) via ``exclude_none=True``,
+    while being set explicitly for streaming deltas where it is required
+    by strict validators like the Vercel AI SDK.
+    """
 
-    id: str = Field(default_factory=lambda: str(uuid4()).replace("-", "")[:9])
+    id: str = Field(default_factory=lambda: f"call_{uuid4().hex[:24]}")
     function: Tool
     type: Literal["function"] = "function"
+    index: Optional[int] = None
+
+
+class NamedToolFunction(BaseModel):
+    """Represents a named function reference for tool_choice."""
+
+    name: str
+
+
+class NamedToolChoice(BaseModel):
+    """Represents a named tool choice (forces a specific function call)."""
+
+    function: NamedToolFunction
+    type: Literal["function"] = "function"
diff --git a/endpoints/OAI/utils/chat_completion.py b/endpoints/OAI/utils/chat_completion.py
index fee51a6..a95d00d 100644
--- a/endpoints/OAI/utils/chat_completion.py
+++ b/endpoints/OAI/utils/chat_completion.py
@@ -1,6 +1,7 @@
 """Chat completion utilities for OAI server."""
 
 import asyncio
+import json
 import pathlib
 from asyncio import CancelledError
 from typing import List, Optional
@@ -28,6 +29,7 @@ from endpoints.OAI.types.chat_completion import (
     ChatCompletionStreamChoice,
 )
 from endpoints.OAI.types.common import UsageStats
+from endpoints.OAI.types.tools import NamedToolChoice, ToolCall
 from endpoints.OAI.utils.completion import _parse_gen_request_id, _stream_collector
 from endpoints.OAI.utils.tools import ToolCallProcessor, TOOL_CALL_SCHEMA
 
@@ -65,9 +67,27 @@ def _start_in_reasoning_mode(prompt: str) -> bool:
     num_end_tokens = prompt.count(model.container.reasoning_end_token)
     return num_start_tokens == num_end_tokens + 1
 
+def _serialize_stream_chunk(chunk) -> str:
+    """Serialize a streaming chunk with OpenAI-compatible field handling.
+
+    Uses exclude_none=True to strip irrelevant null fields (tool_calls,
+    tool_call_id, logprobs, usage) while ensuring finish_reason is always
+    present on each choice (as null when not set), matching OpenAI's
+    observed streaming behavior.
+    """
+    d = chunk.model_dump(exclude_none=True)
+    for choice in d.get("choices", []):
+        if "finish_reason" not in choice:
+            choice["finish_reason"] = None
+    return json.dumps(d, ensure_ascii=False)
+
 
 def _create_response(
-    request_id: str, generations: List[dict], model_name: Optional[str]
+    request_id: str,
+    generations: List[dict],
+    model_name: Optional[str],
+    tool_call_format: str = "json",
+    tool_choice=None,
 ):
     """Create a chat completion response from the provided text."""
 
@@ -84,9 +104,39 @@ def _create_response(
                 role="assistant", content=unwrap(generation.get("text"), "")
             )
 
-        tool_calls = generation["tool_calls"]
-        if tool_calls:
-            message.tool_calls = ToolCallProcessor.from_json(tool_calls)
+        tool_calls_raw = generation.get("tool_calls")
+        if tool_calls_raw:
+            parsed = ToolCallProcessor.parse(tool_calls_raw, format=tool_call_format)
+            if parsed and isinstance(tool_choice, NamedToolChoice):
+                parsed = ToolCallProcessor.filter_by_name(
+                    parsed, tool_choice.function.name
+                )
+            if parsed:
+                message.tool_calls = parsed
+            else:
+                logger.warning(
+                    "Tool call text present but parsing returned no results "
+                    f"(format={tool_call_format})"
+                )
+
+        # Fallback: detect bare XML tool calls in content that were not
+        # caught by the two-pass system (model never emitted tool_start)
+        if (
+            tool_call_format in ("xml", "auto")
+            and not message.tool_calls
+            and message.content
+            and "<function=" in message.content
+        ):
+            logger.warning(
+                "Fallback: Detected bare XML function blocks in content "
+                "(tool_start was likely not emitted by model)"
+            )
+            remaining, parsed = ToolCallProcessor.extract_content_and_tools(
+                message.content
+            )
+            if parsed:
+                message.tool_calls = parsed
+                message.content = remaining if remaining else None
 
         logprob_response = None
 
@@ -157,7 +207,12 @@ def _create_stream_chunk(
     is_usage_chunk: bool = False,
     is_reasoning_chunk: bool = False,
 ):
-    """Create a chat completion stream chunk from the provided text."""
+    """Create a chat completion stream chunk from the provided text.
+
+    Note: Tool-call streaming is handled separately by
+    _build_tool_call_chunks() which emits the proper three-phase
+    OpenAI-standard chunk sequence.
+    """
 
     index = generation.get("index")
     choices = []
@@ -178,20 +233,10 @@ def _create_stream_chunk(
             total_time=generation.get("total_time"),
         )
     elif "finish_reason" in generation:
-        # Get the finish reason from the generation
         finish_reason = generation.get("finish_reason")
-        choice = ChatCompletionStreamChoice(index=index, finish_reason=finish_reason)
-
-        # lets check if we have tool calls since we are at the end of the generation
-        # Mark finish_reason as tool_calls since this is the last chunk
-        if "tool_calls" in generation:
-            tool_calls = generation["tool_calls"]
-            message = ChatCompletionMessage(
-                tool_calls=ToolCallProcessor.from_json(tool_calls)
-            )
-            choice.delta = message
-            choice.finish_reason = "tool_calls"
-
+        choice = ChatCompletionStreamChoice(
+            index=index, finish_reason=finish_reason, delta={}
+        )
         choices.append(choice)
     else:
         message = (
@@ -241,6 +286,68 @@ def _create_stream_chunk(
     return chunk
 
 
+def _build_tool_call_chunks(
+    tool_calls: List[ToolCall],
+    request_id: str,
+    model_name: str,
+) -> List[ChatCompletionStreamChunk]:
+    """Build the OpenAI-standard streaming sequence for tool calls.
+
+    Emits two chunks:
+      1. Tool-call chunk: role="assistant", complete tool_calls with
+         index/id/type/name/arguments (all data in one chunk).
+      2. Finish chunk: empty delta, finish_reason="tool_calls".
+
+    Complete arguments are sent in a single chunk rather than streamed
+    incrementally, which is valid per OpenAI's spec (clients concatenate
+    argument strings across deltas) and maximizes compatibility with
+    clients that may not implement multi-chunk tool-call assembly.
+
+    The tool_calls are placed directly into a ChatCompletionMessage
+    (not a raw dict) so Pydantic validates them as ToolCall objects
+    with the index field preserved (ToolCall declares index as Optional[int]).
+    """
+    chunk_id = f"chatcmpl-{request_id}"
+
+    # Set index on each tool call for streaming
+    for idx, tc in enumerate(tool_calls):
+        tc.index = idx
+
+    # Chunk 1: Complete tool call data
+    tool_call_message = ChatCompletionMessage(
+        role="assistant",
+        tool_calls=tool_calls,
+    )
+    tool_chunk = ChatCompletionStreamChunk(
+        id=chunk_id,
+        choices=[
+            ChatCompletionStreamChoice(
+                index=0,
+                delta=tool_call_message,
+                finish_reason=None,
+            )
+        ],
+        model=model_name,
+    )
+
+    # Chunk 2: Finish signal
+    # Use model_construct to prevent Pydantic's smart Union from
+    # coercing the empty dict {} into ChatCompletionMessage(role="user")
+    finish_choice = ChatCompletionStreamChoice.model_construct(
+        index=0,
+        delta={},
+        finish_reason="tool_calls",
+        logprobs=None,
+    )
+    finish_chunk = ChatCompletionStreamChunk(
+        id=chunk_id,
+        choices=[finish_choice],
+        model=model_name,
+    )
+
+    return [tool_chunk, finish_chunk]
+
+
 async def _append_template_metadata(data: ChatCompletionRequest, template_vars: dict):
     """Adding metadata is a one-time process."""
 
@@ -285,6 +392,24 @@ async def format_messages_with_template(
 
         message_dicts.append(message.model_dump(exclude_none=True))
 
+    # Pre-template: convert tool_call arguments from JSON strings to dicts.
+    # OpenAI-compatible clients (Kilo, Roo, etc.) send arguments as JSON
+    # strings per the OAI spec, but Qwen3-Coder's template calls
+    # .items() on arguments which requires a dict/mapping.
+    for msg in message_dicts:
+        if msg.get("tool_calls"):
+            for tc in msg["tool_calls"]:
+                func = tc.get("function", {})
+                args = func.get("arguments")
+                if isinstance(args, str):
+                    try:
+                        func["arguments"] = json.loads(args)
+                    except (json.JSONDecodeError, ValueError):
+                        logger.warning(
+                            "Failed to parse tool_call arguments JSON "
+                            "string to dict, keeping as string"
+                        )
+
     # Get all special tokens
     special_tokens_dict = model.container.get_special_tokens()
 
@@ -367,6 +492,7 @@ async def stream_generate_chat_completion(
     gen_queue = asyncio.Queue()
     gen_tasks: List[asyncio.Task] = []
     tool_start = model.container.prompt_template.metadata.tool_start
+    tool_call_format = model.container.prompt_template.metadata.tool_call_format
     disconnect_task = asyncio.create_task(request_disconnect_loop(request))
 
     try:
@@ -401,13 +527,26 @@ async def stream_generate_chat_completion(
 
         # Consumer loop
         while True:
+            # Fast path: items already queued β€” no task overhead
+            if not gen_queue.empty():
+                generation = gen_queue.get_nowait()
+            else:
+                # Slow path: queue empty β€” race get against disconnect
+                get_task = asyncio.create_task(gen_queue.get())
+                done, _ = await asyncio.wait(
+                    [get_task, disconnect_task],
+                    return_when=asyncio.FIRST_COMPLETED,
+                )
+                if disconnect_task in done:
+                    get_task.cancel()
+                    raise CancelledError()
+                generation = get_task.result()
+
             if disconnect_task.done():
                 raise CancelledError()
 
-            generation = await gen_queue.get()
-
             # Handle options if a tool model is present
-            if tool_start:
+            if tool_start and data.tool_choice != "none":
                 if "stop_str" in generation:
                     generations = await generate_tool_calls(
                         prompt,
@@ -419,6 +558,50 @@ async def stream_generate_chat_completion(
 
                     # Only one generation present in this case
                     generation = generations[0]
+
+                    # Emit proper three-phase tool-call streaming sequence
+                    if "tool_calls" in generation:
+                        tool_calls_raw = generation["tool_calls"]
+                        parsed = ToolCallProcessor.parse(
+                            tool_calls_raw, format=tool_call_format
+                        )
+                        if parsed and isinstance(data.tool_choice, NamedToolChoice):
+                            parsed = ToolCallProcessor.filter_by_name(
+                                parsed, data.tool_choice.function.name
+                            )
+                        if parsed:
+                            for tc_chunk in _build_tool_call_chunks(
+                                parsed,
+                                request.state.id,
+                                model_path.name,
+                            ):
+                                yield _serialize_stream_chunk(tc_chunk)
+
+                            # Handle completion and usage after tool calls
+                            if (
+                                all(task.done() for task in gen_tasks)
+                                and gen_queue.empty()
+                            ):
+                                if (
+                                    data.stream_options
+                                    and data.stream_options.include_usage
+                                ):
+                                    usage_chunk = _create_stream_chunk(
+                                        request.state.id,
+                                        generation,
+                                        model_path.name,
+                                        is_usage_chunk=True,
+                                    )
+                                    yield _serialize_stream_chunk(usage_chunk)
+
+                                logger.info(
+                                    "Finished chat completion streaming "
+                                    f"request {request.state.id}"
+                                )
+                                yield "[DONE]"
+                                break
+                            continue
+
                 elif "text" in generation:
                     current_generation_text += generation["text"]
 
@@ -445,7 +628,7 @@ async def stream_generate_chat_completion(
                 model_path.name,
                 is_reasoning_chunk=is_reasoning_chunk,
             )
-            yield response.model_dump_json()
+            yield _serialize_stream_chunk(response)
 
             # Check if all tasks are completed
             if all(task.done() for task in gen_tasks) and gen_queue.empty():
@@ -457,7 +640,7 @@ async def stream_generate_chat_completion(
                         model_path.name,
                         is_usage_chunk=True,
                     )
-                    yield usage_chunk.model_dump_json()
+                    yield _serialize_stream_chunk(usage_chunk)
 
                 logger.info(
                     f"Finished chat completion streaming request {request.state.id}"
@@ -468,13 +651,14 @@ async def stream_generate_chat_completion(
     except CancelledError:
         # Get out if the request gets disconnected
 
-        if not abort_event.is_set():
-            abort_event.set()
-            handle_request_disconnect("Chat completion generation cancelled by user.")
+        handle_request_disconnect("Chat completion generation cancelled by user.")
     except Exception:
         yield get_generator_error(
             "Chat completion aborted. Please check the server console."
         )
+    finally:
+        abort_event.set()
+        disconnect_task.cancel()
 
 
 async def generate_chat_completion(
@@ -486,6 +670,7 @@ async def generate_chat_completion(
 ):
     gen_tasks: List[asyncio.Task] = []
     tool_start = model.container.prompt_template.metadata.tool_start
+    tool_call_format = model.container.prompt_template.metadata.tool_call_format
 
     try:
         logger.info(f"Received chat completion request {request.state.id}")
@@ -507,12 +692,21 @@ async def generate_chat_completion(
         generations = await asyncio.gather(*gen_tasks)
 
         # Check all the generations and see if a tool call is required
-        if tool_start:
+        force_tool_pass = data.tool_choice == "required" or isinstance(
+            data.tool_choice, NamedToolChoice
+        )
+        if tool_start or force_tool_pass:
             generations = await generate_tool_calls(
                 prompt, embeddings, data, generations, request
             )
 
-        response = _create_response(request.state.id, generations, model_path.name)
+        response = _create_response(
+            request.state.id,
+            generations,
+            model_path.name,
+            tool_call_format=tool_call_format,
+            tool_choice=data.tool_choice,
+        )
 
         logger.info(f"Finished chat completion request {request.state.id}")
 
@@ -537,24 +731,72 @@ async def generate_tool_calls(
 ):
     gen_tasks: List[asyncio.Task] = []
     tool_start = model.container.prompt_template.metadata.tool_start
+    tool_call_format = model.container.prompt_template.metadata.tool_call_format
+    tool_choice = data.tool_choice
+
+    if tool_choice == "none":
+        return generations
 
     # Tracks which generations asked for a tool call
     tool_idx: List[int] = []
 
     # Copy to make sure the parent JSON schema doesn't get modified
     tool_data = data.model_copy(deep=True)
-    tool_data.json_schema = TOOL_CALL_SCHEMA
+
+    if tool_call_format in ("xml", "auto"):
+        # XML / auto mode: let the model generate its natural output
+        # without JSON schema constraint
+        logger.debug(
+            f"generate_tool_calls: Using '{tool_call_format}' mode "
+            f"(no JSON schema constraint)"
+        )
+
+        # Remove tool_start from stop strings so the model can emit
+        # multiple sequential <tool_call> blocks without stopping early
+        if (
+            tool_start
+            and isinstance(tool_data.stop, list)
+            and tool_start in tool_data.stop
+        ):
+            tool_data.stop = [s for s in tool_data.stop if s != tool_start]
+            logger.debug(
+                f"generate_tool_calls: Removed '{tool_start}' from "
+                f"second-pass stop strings"
+            )
+    else:
+        # JSON mode: constrained generation (existing behavior)
+        tool_data.json_schema = TOOL_CALL_SCHEMA
 
     for idx, gen in enumerate(generations):
-        if gen["stop_str"] != tool_start:
+        stop_str = gen.get("stop_str")
+        should_generate = stop_str == tool_start
+
+        # Force tool generation if tool_choice requires it
+        if not should_generate and (
+            tool_choice == "required" or isinstance(tool_choice, NamedToolChoice)
+        ):
+            should_generate = True
+
+        if not should_generate:
             continue
 
-        logger.info(f"Detected tool call in chat completion request {request.state.id}")
+        logger.info(
+            f"Detected tool call in chat completion request "
+            f"{request.state.id} (format={tool_call_format})"
+        )
 
-        # Append the existing generation text if present
+        # Build per-generation prompt (avoid mutating shared prompt)
+        tool_prompt = prompt
         precursor_text = gen.get("full_text")
         if precursor_text:
-            prompt = prompt + precursor_text
+            tool_prompt = tool_prompt + precursor_text
+
+        # For XML/auto mode: append tool_start back to prompt.
+        # The stop string was consumed by the first pass and not included
+        # in full_text, but the model expects to continue after <tool_call>.
+        # Include a trailing newline to match the canonical template format.
+        if tool_call_format in ("xml", "auto") and tool_start:
+            tool_prompt = tool_prompt + tool_start + "\n"
 
         gen_request_id = gen.get("request_id")
         tool_request_id = f"{gen_request_id}-tool"
@@ -563,7 +805,7 @@ async def generate_tool_calls(
             asyncio.create_task(
                 model.container.generate(
                     tool_request_id,
-                    prompt,
+                    tool_prompt,
                     tool_data,
                     mm_embeddings=embeddings,
                 )
@@ -577,6 +819,12 @@ async def generate_tool_calls(
 
         # Map tool calls to their appropriate generation
         for gen_idx, tool_call in zip(tool_idx, tool_calls, strict=True):
-            generations[gen_idx]["tool_calls"] = tool_call["text"]
+            raw_text = tool_call["text"]
+
+            if tool_call_format in ("xml", "auto"):
+                # Prepend tool_start to reconstruct complete XML for parser
+                raw_text = tool_start + "\n" + raw_text
+
+            generations[gen_idx]["tool_calls"] = raw_text
 
     return generations
diff --git a/endpoints/OAI/utils/completion.py b/endpoints/OAI/utils/completion.py
index f66d381..c11a25b 100644
--- a/endpoints/OAI/utils/completion.py
+++ b/endpoints/OAI/utils/completion.py
@@ -225,11 +225,24 @@ async def stream_generate_completion(
 
         # Consumer loop
         while True:
+            # Fast path: items already queued β€” no task overhead
+            if not gen_queue.empty():
+                generation = gen_queue.get_nowait()
+            else:
+                # Slow path: queue empty β€” race get against disconnect
+                get_task = asyncio.create_task(gen_queue.get())
+                done, _ = await asyncio.wait(
+                    [get_task, disconnect_task],
+                    return_when=asyncio.FIRST_COMPLETED,
+                )
+                if disconnect_task in done:
+                    get_task.cancel()
+                    raise CancelledError()
+                generation = get_task.result()
+
             if disconnect_task.done():
                 raise CancelledError()
 
-            generation = await gen_queue.get()
-
             # Stream collector will push an exception to the queue if it fails
             if isinstance(generation, Exception):
                 raise generation
@@ -245,15 +258,16 @@ async def stream_generate_completion(
     except CancelledError:
         # Get out if the request gets disconnected
 
-        if not abort_event.is_set():
-            abort_event.set()
-            handle_request_disconnect(
-                f"Completion generation {request.state.id} cancelled by user."
-            )
+        handle_request_disconnect(
+            f"Completion generation {request.state.id} cancelled by user."
+        )
     except Exception:
         yield get_generator_error(
             f"Completion {request.state.id} aborted. Please check the server console."
         )
+    finally:
+        abort_event.set()
+        disconnect_task.cancel()
 
 
 async def generate_completion(
diff --git a/endpoints/OAI/utils/tools.py b/endpoints/OAI/utils/tools.py
index c1ebded..05eaf14 100644
--- a/endpoints/OAI/utils/tools.py
+++ b/endpoints/OAI/utils/tools.py
@@ -1,8 +1,11 @@
+"""Tool call processing utilities for OAI server."""
+
 import json
+import re
 from loguru import logger
-from typing import List
+from typing import Any, List, Tuple
 
-from endpoints.OAI.types.tools import ToolCall
+from endpoints.OAI.types.tools import ToolCall, Tool
 
 
 TOOL_CALL_SCHEMA = {
@@ -27,24 +30,480 @@ TOOL_CALL_SCHEMA = {
     },
 }
 
+# ---------------------------------------------------------------------------
+# XML parsing regex patterns
+# Derived from vLLM's Qwen3CoderToolParser and the official Qwen parser.
+# These handle both complete and partially-closed tags.
+# ---------------------------------------------------------------------------
+
+# Matches complete <tool_call>...</tool_call> blocks
+TOOL_CALL_BLOCK_RE = re.compile(
+    r"<tool_call>(.*?)</tool_call>",
+    re.DOTALL,
+)
+
+# Matches <function=NAME>BODY</function> blocks
+FUNCTION_RE = re.compile(
+    r"<function=(.*?)>(.*?)</function>",
+    re.DOTALL,
+)
+
+# Matches <parameter=KEY>VALUE</terminator>
+# Terminates on: </parameter>, next <parameter=, </function>, or <tool_call>
+PARAMETER_RE = re.compile(
+    r"<parameter=(.*?)>(.*?)"
+    r"(?:</parameter>|(?=<parameter=)|(?=</function>)|(?=<tool_call>))",
+    re.DOTALL,
+)
+
+# Think block patterns
+THINK_BLOCK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
+THINK_UNCLOSED_RE = re.compile(r"<think>(?!.*</think>).*$", re.DOTALL)
+
+# Markdown code fence patterns
+CODE_FENCE_RE = re.compile(r"^```(?:json)?\s*", re.MULTILINE)
+CODE_FENCE_END_RE = re.compile(r"\s*```\s*$", re.MULTILINE)
+
+
+def _strip_think_blocks(text: str) -> str:
+    """Strip <think>...</think> blocks from text.
+
+    Handles both complete and unclosed blocks (quantization can cause
+    the model to never close a think tag).
+    """
+    original = text
+
+    # Complete blocks first
+    text = THINK_BLOCK_RE.sub("", text)
+
+    # Unclosed block (think started but never closed β€” strip to end)
+    text = THINK_UNCLOSED_RE.sub("", text)
+
+    if text != original:
+        if THINK_UNCLOSED_RE.search(original):
+            logger.warning(
+                "XML Parser: Stripped unclosed <think> block "
+                "(possible quantization degradation)"
+            )
+        else:
+            logger.debug("XML Parser: Stripped <think> block(s) from output")
+
+    return text
+
+
+def _coerce_param_value(raw: str) -> Any:
+    """Coerce a raw parameter value string to the appropriate Python type.
+
+    Strategy (safe, no eval()):
+      1. Strip leading/trailing newlines (official template emits \\n
+         after opening tag and before closing tag).
+      2. Try json.loads β€” handles objects, arrays, numbers, bools, null.
+      3. Fall back to plain string.
+    """
+    # Strip template-inserted newlines around values
+    if raw.startswith("\n"):
+        raw = raw[1:]
+    if raw.endswith("\n"):
+        raw = raw[:-1]
+
+    stripped = raw.strip()
+
+    # Empty string
+    if not stripped:
+        return ""
+
+    # Try JSON parse (handles objects, arrays, numbers, booleans, null)
+    try:
+        return json.loads(stripped)
+    except (json.JSONDecodeError, ValueError):
+        pass
+
+    # Fall back to string β€” never eval()
+    return stripped
+
 
 class ToolCallProcessor:
+
+    # ------------------------------------------------------------------
+    # JSON normalization helpers
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def _normalize_tool_calls(raw) -> list:
+        """Normalize model-emitted tool call payloads into OAI-like objects.
+
+        Accepted forms:
+        - [{"type":"function","function":{"name":...,"arguments":{...}}}]
+        - [{"name":...,"arguments":{...}}]
+        - {"name":...,"arguments":{...}}
+        """
+        if isinstance(raw, dict):
+            raw = [raw]
+        if not isinstance(raw, list):
+            raise ValueError("tool_calls payload is not list/dict")
+
+        normalized: list = []
+        for item in raw:
+            if not isinstance(item, dict):
+                continue
+
+            if "function" in item and isinstance(item["function"], dict):
+                fn = item["function"]
+                name = fn.get("name")
+                arguments = fn.get("arguments", {})
+            else:
+                name = item.get("name")
+                arguments = item.get("arguments", {})
+
+            if name is None:
+                continue
+
+            if isinstance(arguments, str):
+                try:
+                    arguments = json.loads(arguments)
+                except json.JSONDecodeError:
+                    arguments = {"input": arguments}
+
+            normalized.append(
+                {
+                    "type": "function",
+                    "function": {
+                        "name": name,
+                        "arguments": arguments if isinstance(arguments, dict) else {},
+                    },
+                }
+            )
+        return normalized
+
+    @staticmethod
+    def _safe_json_loads(payload: str) -> list:
+        """Best-effort JSON parse for model-emitted tool payloads.
+
+        Handles: clean JSON, markdown-fenced JSON, JSON substrings in
+        surrounding text, flat {name, arguments} dicts, and single objects.
+        """
+        # Direct parse
+        try:
+            return ToolCallProcessor._normalize_tool_calls(json.loads(payload))
+        except (json.JSONDecodeError, ValueError):
+            pass
+
+        # Clean up common model artifacts (markdown fences, whitespace)
+        cleaned = payload.strip()
+        cleaned = CODE_FENCE_RE.sub("", cleaned)
+        cleaned = CODE_FENCE_END_RE.sub("", cleaned)
+        cleaned = cleaned.strip()
+
+        # Try cleaned
+        try:
+            return ToolCallProcessor._normalize_tool_calls(json.loads(cleaned))
+        except (json.JSONDecodeError, ValueError):
+            pass
+
+        # Find JSON array substring
+        start = cleaned.find("[")
+        end = cleaned.rfind("]")
+        if start != -1 and end != -1 and end > start:
+            try:
+                return ToolCallProcessor._normalize_tool_calls(
+                    json.loads(cleaned[start : end + 1])
+                )
+            except (json.JSONDecodeError, ValueError):
+                pass
+
+        # Find JSON object substring
+        obj_start = cleaned.find("{")
+        obj_end = cleaned.rfind("}")
+        if obj_start != -1 and obj_end != -1 and obj_end > obj_start:
+            try:
+                return ToolCallProcessor._normalize_tool_calls(
+                    json.loads(cleaned[obj_start : obj_end + 1])
+                )
+            except (json.JSONDecodeError, ValueError):
+                pass
+
+        raise json.JSONDecodeError(
+            "Could not extract valid JSON from payload", payload, 0
+        )
+
+    # ------------------------------------------------------------------
+    # JSON parsing
+    # ------------------------------------------------------------------
+
     @staticmethod
     def from_json(tool_calls_str: str) -> List[ToolCall]:
-        """Postprocess tool call JSON to a parseable class"""
+        """Postprocess tool call JSON to a parseable class.
 
-        tool_calls = json.loads(tool_calls_str)
+        Handles clean JSON arrays, markdown-fenced output, flat dicts,
+        and other common model output variations via _safe_json_loads.
+        """
+        logger.debug(f"JSON Parser: Parsing tool calls ({len(tool_calls_str)} chars)")
+
+        tool_calls = ToolCallProcessor._safe_json_loads(tool_calls_str)
         for tool_call in tool_calls:
             tool_call["function"]["arguments"] = json.dumps(
                 tool_call["function"]["arguments"]
             )
 
-        return [ToolCall(**tool_call) for tool_call in tool_calls]
+        result = [ToolCall(**tool_call) for tool_call in tool_calls]
+        logger.debug(f"JSON Parser: Successfully parsed {len(result)} tool call(s)")
+        return result
+
+    # ------------------------------------------------------------------
+    # XML parsing (Qwen3-Coder / GLM-4.5 style)
+    # ------------------------------------------------------------------
 
     @staticmethod
-    def dump(tool_calls: List[ToolCall]) -> List[dict]:
+    def from_xml(raw_text: str) -> List[ToolCall]:
+        """Parse Qwen3-Coder XML-format tool calls into ToolCall objects.
+
+        Handles:
+          - Wrapped: <tool_call><function=name>...</function></tool_call>
+          - Bare: <function=name>...</function> (missing wrapper)
+          - Multiple sequential tool call blocks
+          - <think> blocks (stripped)
+          - Multi-line parameter values
+          - Missing </parameter> closing tags
+        """
+        logger.debug(f"XML Parser: Parsing tool calls ({len(raw_text)} chars)")
+
+        # Stage 1: Strip think blocks
+        text = _strip_think_blocks(raw_text)
+
+        # Stage 2: Check for incomplete XML at end (generation cutoff)
+        stripped_end = text.rstrip()
+        if stripped_end.endswith(("<", "</", "<parameter", "<function")):
+            logger.warning(
+                f"XML Parser: Detected incomplete XML tag at end: "
+                f"...{stripped_end[-80:]}"
+            )
+            text = re.sub(r"<[^>]*$", "", text)
+
+        # Stage 3: Extract function blocks
+        # First, find all wrapped <tool_call>...</tool_call> blocks
+        wrapped_positions = [
+            (m.start(), m.end()) for m in TOOL_CALL_BLOCK_RE.finditer(text)
+        ]
+
+        # Collect function blocks from inside wrapped regions
+        function_blocks = []
+        for match in TOOL_CALL_BLOCK_RE.finditer(text):
+            inner = match.group(1)
+            for func_match in FUNCTION_RE.finditer(inner):
+                function_blocks.append((func_match.group(1), func_match.group(2)))
+
+        # Find bare <function> blocks NOT inside any wrapped region
+        for func_match in FUNCTION_RE.finditer(text):
+            pos = func_match.start()
+            is_wrapped = any(start <= pos < end for start, end in wrapped_positions)
+            if not is_wrapped:
+                logger.debug(
+                    "XML Parser: Found bare <function> block without "
+                    "<tool_call> wrapper"
+                )
+                function_blocks.append((func_match.group(1), func_match.group(2)))
+
+        if not function_blocks:
+            logger.warning("XML Parser: No <function=...> blocks found")
+            return []
+
+        # Stage 4: Parse each function block into a ToolCall
+        tool_calls = []
+        for func_name_raw, func_body in function_blocks:
+            func_name = func_name_raw.strip()
+
+            # Extract parameters
+            params = {}
+            for param_match in PARAMETER_RE.finditer(func_body):
+                key = param_match.group(1).strip()
+                value_raw = param_match.group(2)
+                value = _coerce_param_value(value_raw)
+                params[key] = value
+
+            arguments_json = json.dumps(params, ensure_ascii=False)
+
+            tool_call = ToolCall(
+                function=Tool(name=func_name, arguments=arguments_json)
+            )
+            tool_calls.append(tool_call)
+
+        logger.debug(f"XML Parser: Successfully parsed {len(tool_calls)} tool call(s)")
+        return tool_calls
+
+    # ------------------------------------------------------------------
+    # Auto-detect parsing (JSON β†’ JSON-in-tool_call β†’ XML)
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def from_auto(raw_text: str) -> List[ToolCall]:
+        """Auto-detect format and parse.
+
+        Tries in order:
+          1. Pure JSON (standard TabbyAPI / Llama)
+          2. JSON inside <tool_call> wrappers (Qwen3-Instruct style)
+          3. XML with <function=...> tags (Qwen3-Coder style)
         """
-        Convert ToolCall objects to a list of dictionaries.
+        logger.debug("Auto Parser: Attempting format auto-detection")
+
+        # Attempt 1: Pure JSON array
+        try:
+            result = ToolCallProcessor.from_json(raw_text)
+            logger.debug("Auto Parser: Detected JSON format")
+            return result
+        except (json.JSONDecodeError, ValueError, KeyError) as e:
+            logger.debug(f"Auto Parser: Not JSON ({e}), trying next format")
+
+        # Attempt 2: JSON inside <tool_call> wrappers (Qwen3-Instruct)
+        try:
+            all_tool_calls = []
+            for match in TOOL_CALL_BLOCK_RE.finditer(raw_text):
+                inner = match.group(1).strip()
+                if inner.startswith("{") or inner.startswith("["):
+                    parsed = json.loads(inner)
+                    if isinstance(parsed, dict):
+                        parsed = [parsed]
+                    if isinstance(parsed, list):
+                        for tc in parsed:
+                            name = tc.get("name", "")
+                            arguments = tc.get("arguments", {})
+                            if isinstance(arguments, dict):
+                                arguments = json.dumps(arguments)
+                            elif not isinstance(arguments, str):
+                                arguments = json.dumps(arguments)
+                            all_tool_calls.append(
+                                ToolCall(function=Tool(name=name, arguments=arguments))
+                            )
+            if all_tool_calls:
+                logger.debug(
+                    "Auto Parser: Detected JSON-inside-tool_call "
+                    f"format ({len(all_tool_calls)} call(s))"
+                )
+                return all_tool_calls
+        except (json.JSONDecodeError, ValueError, KeyError) as e:
+            logger.debug(f"Auto Parser: Not JSON-in-tool_call ({e}), trying XML")
+
+        # Attempt 3: XML format (Qwen3-Coder style)
+        result = ToolCallProcessor.from_xml(raw_text)
+        if result:
+            logger.debug("Auto Parser: Detected XML format")
+        else:
+            logger.warning("Auto Parser: All format detection attempts failed")
+        return result
+
+    # ------------------------------------------------------------------
+    # Dispatcher
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def parse(tool_calls_str: str, format: str = "json") -> List[ToolCall]:
+        """Dispatch tool call parsing to the appropriate format handler.
+
+        Args:
+            tool_calls_str: Raw tool call text from model generation.
+            format: One of ``"json"``, ``"xml"``, ``"auto"``.
+
+        Returns:
+            List of parsed ToolCall objects.  Empty list on parse failure
+            (never raises).
+        """
+        try:
+            if format == "xml":
+                return ToolCallProcessor.from_xml(tool_calls_str)
+            elif format == "auto":
+                return ToolCallProcessor.from_auto(tool_calls_str)
+            else:
+                return ToolCallProcessor.from_json(tool_calls_str)
+        except Exception as e:
+            logger.error(
+                f"ToolCallProcessor.parse: Failed to parse tool calls "
+                f"(format={format}): {e}"
+            )
+            return []
+
+    # ------------------------------------------------------------------
+    # Filtering
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def filter_by_name(
+        tool_calls: List[ToolCall], function_name: str
+    ) -> List[ToolCall]:
+        """Filter parsed tool calls to only those matching a function name."""
+        filtered = [tc for tc in tool_calls if tc.function.name == function_name]
+        if not filtered:
+            logger.warning(
+                f"filter_by_name: No tool calls matched '{function_name}' "
+                f"(had {len(tool_calls)} call(s))"
+            )
+        return filtered
+
+    # ------------------------------------------------------------------
+    # Content / tool-call separation
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def extract_content_and_tools(
+        raw_text: str,
+    ) -> Tuple[str, List[ToolCall]]:
+        """Separate plain text content from XML tool call blocks.
+
+        Used when the model mixes reasoning text with tool calls, e.g.:
+        ``"I'll help with that: <tool_call><function=...>...``
+
+        Returns:
+            Tuple of (remaining_content, tool_calls).
+        """
+        text = _strip_think_blocks(raw_text)
+
+        # Collect all XML regions to exclude from content
+        xml_regions = []
+
+        # Wrapped tool call blocks
+        for match in TOOL_CALL_BLOCK_RE.finditer(text):
+            xml_regions.append((match.start(), match.end()))
+
+        # Bare function blocks not inside wrappers
+        for match in FUNCTION_RE.finditer(text):
+            pos = match.start()
+            is_wrapped = any(start <= pos < end for start, end in xml_regions)
+            if not is_wrapped:
+                xml_regions.append((match.start(), match.end()))
+
+        # Sort and extract content (everything outside XML regions)
+        xml_regions.sort()
+        content_parts = []
+        last_end = 0
+        for start, end in xml_regions:
+            if start > last_end:
+                part = text[last_end:start].strip()
+                if part:
+                    content_parts.append(part)
+            last_end = end
+        if last_end < len(text):
+            part = text[last_end:].strip()
+            if part:
+                content_parts.append(part)
+
+        content = " ".join(content_parts).strip()
+
+        # Parse tool calls from the full text
+        tool_calls = ToolCallProcessor.from_xml(text)
+
+        logger.debug(
+            f"extract_content_and_tools: Found {len(tool_calls)} tool "
+            f"call(s), content={'yes' if content else 'no'} "
+            f"({len(content)} chars)"
+        )
+
+        return content, tool_calls
+
+    # ------------------------------------------------------------------
+    # Serialisation helpers (unchanged from original)
+    # ------------------------------------------------------------------
+
+    @staticmethod
+    def dump(tool_calls: List[ToolCall]) -> List[dict]:
+        """Convert ToolCall objects to a list of dictionaries.
 
         Args:
             tool_calls (List[ToolCall]): List of ToolCall objects to convert
@@ -65,8 +524,7 @@ class ToolCallProcessor:
 
     @staticmethod
     def to_json(tool_calls: List[ToolCall]) -> str:
-        """
-        Convert ToolCall objects to JSON string representation.
+        """Convert ToolCall objects to JSON string representation.
 
         Args:
             tool_calls (List[ToolCall]): List of ToolCall objects to convert
diff --git a/templates/tool_calls/qwen3_coder.jinja b/templates/tool_calls/qwen3_coder.jinja
new file mode 100644
index 0000000..1527274
--- /dev/null
+++ b/templates/tool_calls/qwen3_coder.jinja
@@ -0,0 +1,123 @@
+{# TabbyAPI Metadata #}
+{%- set tool_call_format = "xml" -%}
+{%- set tool_start = "<tool_call>" -%}
+{%- set tool_end = "</tool_call>" -%}
+{%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%}
+
+{% macro render_extra_keys(json_dict, handled_keys) %}
+    {%- if json_dict is mapping %}
+        {%- for json_key in json_dict if json_key not in handled_keys %}
+            {%- if json_dict[json_key] is string %}
+                {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
+            {%- else %}
+                {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
+            {%- endif %}
+        {%- endfor %}
+    {%- endif %}
+{%- endmacro %}
+
+{%- if messages[0]["role"] == "system" %}
+    {%- set system_message = messages[0]["content"] %}
+    {%- set loop_messages = messages[1:] %}
+{%- else %}
+    {%- set loop_messages = messages %}
+{%- endif %}
+
+{%- if not tools is defined %}
+    {%- set tools = [] %}
+{%- endif %}
+
+{%- if system_message is defined %}
+    {{- "<|im_start|>system\n" + system_message }}
+{%- else %}
+    {%- if tools is iterable and tools | length > 0 %}
+        {{- "<|im_start|>system\nYou are Qwen, a helpful AI assistant that can interact with a computer to solve tasks." }}
+    {%- endif %}
+{%- endif %}
+{%- if tools is iterable and tools | length > 0 %}
+    {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n" }}
+    {{- "<tools>" }}
+    {%- for tool in tools %}
+        {%- if tool.function is defined %}
+            {%- set tool = tool.function %}
+        {%- endif %}
+        {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
+        {%- if tool.description is defined %}
+            {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
+        {%- endif %}
+        {{- '\n<parameters>' }}
+        {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
+            {%- for param_name, param_fields in tool.parameters.properties|items %}
+                {{- '\n<parameter>' }}
+                {{- '\n<name>' ~ param_name ~ '</name>' }}
+                {%- if param_fields.type is defined %}
+                    {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
+                {%- endif %}
+                {%- if param_fields.description is defined %}
+                    {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
+                {%- endif %}
+                {%- set handled_keys = ['name', 'type', 'description'] %}
+                {{- render_extra_keys(param_fields, handled_keys) }}
+                {{- '\n</parameter>' }}
+            {%- endfor %}
+        {%- endif %}
+        {%- set handled_keys = ['type', 'properties'] %}
+        {{- render_extra_keys(tool.parameters, handled_keys) }}
+        {{- '\n</parameters>' }}
+        {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
+        {{- render_extra_keys(tool, handled_keys) }}
+        {{- '\n</function>' }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+{%- endif %}
+{%- if system_message is defined %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if tools is iterable and tools | length > 0 %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in loop_messages %}
+    {%- if message.role == "assistant" and message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
+            {{- '\n' + message.content | trim + '\n' }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+            {%- if tool_call.arguments is defined %}
+                {%- for args_name, args_value in tool_call.arguments|items %}
+                    {{- '<parameter=' + args_name + '>\n' }}
+                    {%- set args_value = args_value if args_value is string else args_value | tojson | safe %}
+                    {{- args_value }}
+                    {{- '\n</parameter>\n' }}
+                {%- endfor %}
+            {%- endif %}
+            {{- '</function>\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "user" or message.role == "system" or message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}
\ No newline at end of file

Sorry guys, I totally missed this topic.

If it's still relevant, you can also use my fork of tabby for Qwen/GLM/MiniMax tools support (the last one isn't tested well tbh).

I've added info about this fork to the readme:
https://huggingface.co/NeuroSenko/Qwen3.5-397B-A17B-exl3#tool-calls-support-for-qwenglm-models

UPDATE: I've got it to work with mratsim solution (I've missed the custom chat template intially!).
However the model quant I use (Qwen3.5-397B-A17B-exl3 ---> 3.5bpw_opt) is weird, lot's of Chinese characters in it...

Sign up or log in to comment