Tool calling
Sorry if am asking a stupid question but I haven't checked lately with exl3 development progress, but is the tool calling working, like in known TUIs like Claude Code or Open code?
And thanks for the quants! π
I've seen your and remichu_sm's stance on the tool calling transition on the Exllama discord! Interesting! I will try the PR and report...
I can confirm tool calls work with #413 and Qwen3.5
Using git clone https://github.com/devnen/tabbyAPI.git -b full-tool-calling-support but still no luck (in Opencode) with "Qwen3.5-397B-A17B-exl3". Any idea what am I doing wrong?
This is my Docker build
# Use an official CUDA runtime with Ubuntu as a parent image
FROM nvidia/cuda:12.8.1-runtime-ubuntu24.04
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
ca-certificates \
python3.12 \
python3-pip \
python3.12-venv \
python3.12-dev \
git \
&& rm -rf /var/lib/apt/lists/*
# Create a virtual environment
RUN python3 -m venv /opt/venv
# Activate the venv and set the PATH
ENV PATH="/opt/venv/bin:$PATH"
# Upgrade pip and install uv
RUN pip install --no-cache-dir --upgrade pip
# Set the working directory in the container
WORKDIR /app
# Clone tabbyAPI repository. 0d1a8ba (fix for proper reasoning support) has issues with PR #413 so use a manual patch
RUN git clone https://github.com/theroyallab/tabbyAPI.git /app
# RUN git checkout -b app 803ca5c
# Configure git user (required for merge)
RUN git config --global user.email "docker@tabbyapi.local" && \
git config --global user.name "Docker Build"
# Fetch and merge PR #413 - Tool-calling
# conflict resolution '--strategy-option theirs'
# RUN git fetch origin pull/413/head:pr-413 && \
# git merge --strategy-option theirs pr-413
COPY reasoning_tool_call_pr413.patch reasoning_tool_call_pr413.patch
RUN git apply reasoning_tool_call_pr413.patch
# Install packages specified in pyproject.toml cu12, extras
# RUN pip install --no-cache-dir .[cu12,extras]
RUN pip install --no-cache-dir .[cu12]
# Triton needs `apt get install python3.12-dev` for <Python.h>
RUN pip install triton flash-linear-attention
# Impossible to compile by itself, if fails in PyTorch cpp_extension.py
# with a 404 error
# similar to https://github.com/Dao-AILab/causal-conv1d/issues/4
RUN pip install https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.6.1.post4/causal_conv1d-1.6.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
# Make port 5000 available to the world outside this container
EXPOSE 5000
# Set the entry point
ENTRYPOINT ["python3"]
# Run main.py when the container launches
CMD ["main.py"]
{# TabbyAPI Metadata #}
{%- set tool_call_format = "xml" -%}
{%- set tool_start = "<tool_call>" -%}
{%- set tool_end = "</tool_call>" -%}
{%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%}
{%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
{%- if content is string %}
{{- content }}
{%- elif content is iterable and content is not mapping %}
{%- for item in content %}
{%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
{%- if is_system_content %}
{{- raise_exception('System message cannot contain images.') }}
{%- endif %}
{%- if do_vision_count %}
{%- set image_count.value = image_count.value + 1 %}
{%- endif %}
{%- if add_vision_id %}
{{- 'Picture ' ~ image_count.value ~ ': ' }}
{%- endif %}
{{- '<|vision_start|><|image_pad|><|vision_end|>' }}
{%- elif 'video' in item or item.type == 'video' %}
{%- if is_system_content %}
{{- raise_exception('System message cannot contain videos.') }}
{%- endif %}
{%- if do_vision_count %}
{%- set video_count.value = video_count.value + 1 %}
{%- endif %}
{%- if add_vision_id %}
{{- 'Video ' ~ video_count.value ~ ': ' }}
{%- endif %}
{{- '<|vision_start|><|video_pad|><|vision_end|>' }}
{%- elif 'text' in item %}
{{- item.text }}
{%- else %}
{{- raise_exception('Unexpected item type in content.') }}
{%- endif %}
{%- endfor %}
{%- elif content is none or content is undefined %}
{{- '' }}
{%- else %}
{{- raise_exception('Unexpected content type.') }}
{%- endif %}
{%- endmacro %}
{%- if not messages %}
{{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
{{- '<|im_start|>system\n' }}
{{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>" }}
{{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
{%- if messages[0].role == 'system' %}
{%- set content = render_content(messages[0].content, false, true)|trim %}
{%- if content %}
{{- '\n\n' + content }}
{%- endif %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- else %}
{%- if messages[0].role == 'system' %}
{%- set content = render_content(messages[0].content, false, true)|trim %}
{{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" %}
{%- set content = render_content(message.content, false)|trim %}
{%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
{{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
{%- set content = render_content(message.content, true)|trim %}
{%- if message.role == "system" %}
{%- if not loop.first %}
{{- raise_exception('System message must be at the beginning.') }}
{%- endif %}
{%- elif message.role == "user" %}
{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- set reasoning_content = reasoning_content|trim %}
{%- if loop.index0 > ns.last_query_index %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{%- if loop.first %}
{%- if content|trim %}
{{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- else %}
{{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- endif %}
{%- else %}
{{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- endif %}
{%- if tool_call.arguments is defined %}
{%- for args_name, args_value in tool_call.arguments|items %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
{{- '</function>\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.previtem and loop.previtem.role != "tool" %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- content }}
{{- '\n</tool_response>' }}
{%- if not loop.last and loop.nextitem.role != "tool" %}
{{- '<|im_end|>\n' }}
{%- elif loop.last %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- else %}
{{- raise_exception('Unexpected message role.') }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}
{%- endif %}
Sampler (using qwen recommended settings)
temperature:
override: 0.7
force: false
top_p:
override: 0.8
force: false
top_k:
override: 20
force: false
# min_p:
# override: 0
# force: false
diff --git a/backends/exllamav3/model.py b/backends/exllamav3/model.py
index a6d7968..b49b303 100644
--- a/backends/exllamav3/model.py
+++ b/backends/exllamav3/model.py
@@ -1021,6 +1021,7 @@ class ExllamaV3Container(BaseModelContainer):
max_rq_tokens=self.max_rq_tokens,
filters=grammar_handler.filters,
)
+ self.active_job_ids[request_id] = job
generated_tokens = 0
full_response = ""
@@ -1038,8 +1039,21 @@ class ExllamaV3Container(BaseModelContainer):
if chunk:
chunk_tokens = result.get("token_ids", self.tokenizer.encode(chunk))
full_response += chunk
+
+ # Extract token IDs as a plain list for downstream consumers
if isinstance(chunk_tokens, torch.Tensor):
+ token_id_list = chunk_tokens.flatten().tolist()
generated_tokens += chunk_tokens.size(dim=0)
+ elif isinstance(chunk_tokens, tuple):
+ first = chunk_tokens[0]
+ if isinstance(first, torch.Tensor):
+ token_id_list = first.flatten().tolist()
+ else:
+ token_id_list = list(first)
+ generated_tokens += len(token_id_list)
+ else:
+ token_id_list = list(chunk_tokens)
+ generated_tokens += len(token_id_list)
# Increase penalty range to generated token amount
# TODO:
@@ -1049,6 +1063,7 @@ class ExllamaV3Container(BaseModelContainer):
generation = {
"request_id": request_id,
"text": chunk,
+ "token_ids": token_id_list,
"prompt_tokens": context_len,
"generated_tokens": generated_tokens,
"offset": len(full_response),
@@ -1069,8 +1084,6 @@ class ExllamaV3Container(BaseModelContainer):
yield finish_chunk
break
- # Assign the active job to the request ID
- self.active_job_ids[request_id] = job
except asyncio.CancelledError:
await job.cancel()
diff --git a/common/templating.py b/common/templating.py
index cc0cceb..dda06d8 100644
--- a/common/templating.py
+++ b/common/templating.py
@@ -12,6 +12,7 @@ from jinja2 import Template, TemplateError
from jinja2.ext import loopcontrols
from jinja2.sandbox import ImmutableSandboxedEnvironment
from loguru import logger
+from markupsafe import Markup
from packaging import version
@@ -24,12 +25,17 @@ class TemplateLoadError(Exception):
pass
+VALID_TOOL_CALL_FORMATS = {"json", "xml", "auto"}
+
+
@dataclass
class TemplateMetadata:
"""Represents the parsed metadata from a template."""
stop_strings: List[str] = field(default_factory=list)
tool_start: Optional[str] = None
+ tool_end: Optional[str] = None
+ tool_call_format: str = "json"
class PromptTemplate:
@@ -46,6 +52,22 @@ class PromptTemplate:
)
metadata: Optional[TemplateMetadata] = None
+ @staticmethod
+ def _tojson_compat(value, indent=None, ensure_ascii=True):
+ """Compatibility JSON filter for chat templates.
+
+ Some model templates call ``tojson(ensure_ascii=False)`` while the
+ bundled Jinja filter may not accept that keyword in sandboxed mode.
+ """
+ return Markup(
+ json.dumps(
+ value,
+ indent=indent,
+ ensure_ascii=ensure_ascii,
+ separators=(",", ": "),
+ )
+ )
+
async def extract_metadata(self, template_vars: dict):
"""
Returns deserialized template metadata from a chat template.
@@ -76,6 +98,22 @@ class PromptTemplate:
if isinstance(template_module.tool_start, str):
template_metadata.tool_start = template_module.tool_start
+ if hasattr(template_module, "tool_end"):
+ if isinstance(template_module.tool_end, str):
+ template_metadata.tool_end = template_module.tool_end
+
+ if hasattr(template_module, "tool_call_format"):
+ fmt = template_module.tool_call_format
+ if isinstance(fmt, str) and fmt in VALID_TOOL_CALL_FORMATS:
+ template_metadata.tool_call_format = fmt
+ logger.debug(f"Template tool_call_format: {fmt}")
+ else:
+ logger.warning(
+ f"Invalid tool_call_format '{fmt}' in template, "
+ f"defaulting to 'json'. "
+ f"Valid values: {VALID_TOOL_CALL_FORMATS}"
+ )
+
self.metadata = template_metadata
return template_metadata
@@ -107,6 +145,7 @@ class PromptTemplate:
self.environment.globals["strftime_now"] = strftime_now
self.environment.globals["raise_exception"] = raise_exception
+ self.environment.filters["tojson"] = self._tojson_compat
return self.environment.from_string(template_str)
diff --git a/endpoints/OAI/types/chat_completion.py b/endpoints/OAI/types/chat_completion.py
index ca311dd..05aec35 100644
--- a/endpoints/OAI/types/chat_completion.py
+++ b/endpoints/OAI/types/chat_completion.py
@@ -4,7 +4,7 @@ from typing import Literal, Union, List, Optional, Dict
from uuid import uuid4
from endpoints.OAI.types.common import UsageStats, CommonCompletionRequest
-from endpoints.OAI.types.tools import ToolSpec, ToolCall
+from endpoints.OAI.types.tools import NamedToolChoice, ToolSpec, ToolCall
class ChatCompletionLogprob(BaseModel):
@@ -72,6 +72,10 @@ class ChatCompletionRequest(CommonCompletionRequest):
tools: Optional[List[ToolSpec]] = None
functions: Optional[List[Dict]] = None
+ tool_choice: Optional[
+ Union[Literal["none", "auto", "required"], NamedToolChoice]
+ ] = None
+ parallel_tool_calls: Optional[bool] = True
# Chat completions requests do not have a BOS token preference. Backend
# respects the tokenization config for the individual model.
diff --git a/endpoints/OAI/types/tools.py b/endpoints/OAI/types/tools.py
index b5b9611..1e57266 100644
--- a/endpoints/OAI/types/tools.py
+++ b/endpoints/OAI/types/tools.py
@@ -1,5 +1,5 @@
from pydantic import BaseModel, Field
-from typing import Dict, Literal
+from typing import Dict, Literal, Optional
from uuid import uuid4
@@ -28,8 +28,28 @@ class Tool(BaseModel):
class ToolCall(BaseModel):
- """Represents an OAI tool description."""
+ """Represents an OAI tool call.
+
+ The ``index`` field is optional so it can be omitted in non-streaming
+ responses (where OpenAI does not include it) via ``exclude_none=True``,
+ while being set explicitly for streaming deltas where it is required
+ by strict validators like the Vercel AI SDK.
+ """
- id: str = Field(default_factory=lambda: str(uuid4()).replace("-", "")[:9])
+ id: str = Field(default_factory=lambda: f"call_{uuid4().hex[:24]}")
function: Tool
type: Literal["function"] = "function"
+ index: Optional[int] = None
+
+
+class NamedToolFunction(BaseModel):
+ """Represents a named function reference for tool_choice."""
+
+ name: str
+
+
+class NamedToolChoice(BaseModel):
+ """Represents a named tool choice (forces a specific function call)."""
+
+ function: NamedToolFunction
+ type: Literal["function"] = "function"
diff --git a/endpoints/OAI/utils/chat_completion.py b/endpoints/OAI/utils/chat_completion.py
index fee51a6..a95d00d 100644
--- a/endpoints/OAI/utils/chat_completion.py
+++ b/endpoints/OAI/utils/chat_completion.py
@@ -1,6 +1,7 @@
"""Chat completion utilities for OAI server."""
import asyncio
+import json
import pathlib
from asyncio import CancelledError
from typing import List, Optional
@@ -28,6 +29,7 @@ from endpoints.OAI.types.chat_completion import (
ChatCompletionStreamChoice,
)
from endpoints.OAI.types.common import UsageStats
+from endpoints.OAI.types.tools import NamedToolChoice, ToolCall
from endpoints.OAI.utils.completion import _parse_gen_request_id, _stream_collector
from endpoints.OAI.utils.tools import ToolCallProcessor, TOOL_CALL_SCHEMA
@@ -65,9 +67,27 @@ def _start_in_reasoning_mode(prompt: str) -> bool:
num_end_tokens = prompt.count(model.container.reasoning_end_token)
return num_start_tokens == num_end_tokens + 1
+def _serialize_stream_chunk(chunk) -> str:
+ """Serialize a streaming chunk with OpenAI-compatible field handling.
+
+ Uses exclude_none=True to strip irrelevant null fields (tool_calls,
+ tool_call_id, logprobs, usage) while ensuring finish_reason is always
+ present on each choice (as null when not set), matching OpenAI's
+ observed streaming behavior.
+ """
+ d = chunk.model_dump(exclude_none=True)
+ for choice in d.get("choices", []):
+ if "finish_reason" not in choice:
+ choice["finish_reason"] = None
+ return json.dumps(d, ensure_ascii=False)
+
def _create_response(
- request_id: str, generations: List[dict], model_name: Optional[str]
+ request_id: str,
+ generations: List[dict],
+ model_name: Optional[str],
+ tool_call_format: str = "json",
+ tool_choice=None,
):
"""Create a chat completion response from the provided text."""
@@ -84,9 +104,39 @@ def _create_response(
role="assistant", content=unwrap(generation.get("text"), "")
)
- tool_calls = generation["tool_calls"]
- if tool_calls:
- message.tool_calls = ToolCallProcessor.from_json(tool_calls)
+ tool_calls_raw = generation.get("tool_calls")
+ if tool_calls_raw:
+ parsed = ToolCallProcessor.parse(tool_calls_raw, format=tool_call_format)
+ if parsed and isinstance(tool_choice, NamedToolChoice):
+ parsed = ToolCallProcessor.filter_by_name(
+ parsed, tool_choice.function.name
+ )
+ if parsed:
+ message.tool_calls = parsed
+ else:
+ logger.warning(
+ "Tool call text present but parsing returned no results "
+ f"(format={tool_call_format})"
+ )
+
+ # Fallback: detect bare XML tool calls in content that were not
+ # caught by the two-pass system (model never emitted tool_start)
+ if (
+ tool_call_format in ("xml", "auto")
+ and not message.tool_calls
+ and message.content
+ and "<function=" in message.content
+ ):
+ logger.warning(
+ "Fallback: Detected bare XML function blocks in content "
+ "(tool_start was likely not emitted by model)"
+ )
+ remaining, parsed = ToolCallProcessor.extract_content_and_tools(
+ message.content
+ )
+ if parsed:
+ message.tool_calls = parsed
+ message.content = remaining if remaining else None
logprob_response = None
@@ -157,7 +207,12 @@ def _create_stream_chunk(
is_usage_chunk: bool = False,
is_reasoning_chunk: bool = False,
):
- """Create a chat completion stream chunk from the provided text."""
+ """Create a chat completion stream chunk from the provided text.
+
+ Note: Tool-call streaming is handled separately by
+ _build_tool_call_chunks() which emits the proper three-phase
+ OpenAI-standard chunk sequence.
+ """
index = generation.get("index")
choices = []
@@ -178,20 +233,10 @@ def _create_stream_chunk(
total_time=generation.get("total_time"),
)
elif "finish_reason" in generation:
- # Get the finish reason from the generation
finish_reason = generation.get("finish_reason")
- choice = ChatCompletionStreamChoice(index=index, finish_reason=finish_reason)
-
- # lets check if we have tool calls since we are at the end of the generation
- # Mark finish_reason as tool_calls since this is the last chunk
- if "tool_calls" in generation:
- tool_calls = generation["tool_calls"]
- message = ChatCompletionMessage(
- tool_calls=ToolCallProcessor.from_json(tool_calls)
- )
- choice.delta = message
- choice.finish_reason = "tool_calls"
-
+ choice = ChatCompletionStreamChoice(
+ index=index, finish_reason=finish_reason, delta={}
+ )
choices.append(choice)
else:
message = (
@@ -241,6 +286,68 @@ def _create_stream_chunk(
return chunk
+def _build_tool_call_chunks(
+ tool_calls: List[ToolCall],
+ request_id: str,
+ model_name: str,
+) -> List[ChatCompletionStreamChunk]:
+ """Build the OpenAI-standard streaming sequence for tool calls.
+
+ Emits two chunks:
+ 1. Tool-call chunk: role="assistant", complete tool_calls with
+ index/id/type/name/arguments (all data in one chunk).
+ 2. Finish chunk: empty delta, finish_reason="tool_calls".
+
+ Complete arguments are sent in a single chunk rather than streamed
+ incrementally, which is valid per OpenAI's spec (clients concatenate
+ argument strings across deltas) and maximizes compatibility with
+ clients that may not implement multi-chunk tool-call assembly.
+
+ The tool_calls are placed directly into a ChatCompletionMessage
+ (not a raw dict) so Pydantic validates them as ToolCall objects
+ with the index field preserved (ToolCall declares index as Optional[int]).
+ """
+ chunk_id = f"chatcmpl-{request_id}"
+
+ # Set index on each tool call for streaming
+ for idx, tc in enumerate(tool_calls):
+ tc.index = idx
+
+ # Chunk 1: Complete tool call data
+ tool_call_message = ChatCompletionMessage(
+ role="assistant",
+ tool_calls=tool_calls,
+ )
+ tool_chunk = ChatCompletionStreamChunk(
+ id=chunk_id,
+ choices=[
+ ChatCompletionStreamChoice(
+ index=0,
+ delta=tool_call_message,
+ finish_reason=None,
+ )
+ ],
+ model=model_name,
+ )
+
+ # Chunk 2: Finish signal
+ # Use model_construct to prevent Pydantic's smart Union from
+ # coercing the empty dict {} into ChatCompletionMessage(role="user")
+ finish_choice = ChatCompletionStreamChoice.model_construct(
+ index=0,
+ delta={},
+ finish_reason="tool_calls",
+ logprobs=None,
+ )
+ finish_chunk = ChatCompletionStreamChunk(
+ id=chunk_id,
+ choices=[finish_choice],
+ model=model_name,
+ )
+
+ return [tool_chunk, finish_chunk]
+
+
async def _append_template_metadata(data: ChatCompletionRequest, template_vars: dict):
"""Adding metadata is a one-time process."""
@@ -285,6 +392,24 @@ async def format_messages_with_template(
message_dicts.append(message.model_dump(exclude_none=True))
+ # Pre-template: convert tool_call arguments from JSON strings to dicts.
+ # OpenAI-compatible clients (Kilo, Roo, etc.) send arguments as JSON
+ # strings per the OAI spec, but Qwen3-Coder's template calls
+ # .items() on arguments which requires a dict/mapping.
+ for msg in message_dicts:
+ if msg.get("tool_calls"):
+ for tc in msg["tool_calls"]:
+ func = tc.get("function", {})
+ args = func.get("arguments")
+ if isinstance(args, str):
+ try:
+ func["arguments"] = json.loads(args)
+ except (json.JSONDecodeError, ValueError):
+ logger.warning(
+ "Failed to parse tool_call arguments JSON "
+ "string to dict, keeping as string"
+ )
+
# Get all special tokens
special_tokens_dict = model.container.get_special_tokens()
@@ -367,6 +492,7 @@ async def stream_generate_chat_completion(
gen_queue = asyncio.Queue()
gen_tasks: List[asyncio.Task] = []
tool_start = model.container.prompt_template.metadata.tool_start
+ tool_call_format = model.container.prompt_template.metadata.tool_call_format
disconnect_task = asyncio.create_task(request_disconnect_loop(request))
try:
@@ -401,13 +527,26 @@ async def stream_generate_chat_completion(
# Consumer loop
while True:
+ # Fast path: items already queued β no task overhead
+ if not gen_queue.empty():
+ generation = gen_queue.get_nowait()
+ else:
+ # Slow path: queue empty β race get against disconnect
+ get_task = asyncio.create_task(gen_queue.get())
+ done, _ = await asyncio.wait(
+ [get_task, disconnect_task],
+ return_when=asyncio.FIRST_COMPLETED,
+ )
+ if disconnect_task in done:
+ get_task.cancel()
+ raise CancelledError()
+ generation = get_task.result()
+
if disconnect_task.done():
raise CancelledError()
- generation = await gen_queue.get()
-
# Handle options if a tool model is present
- if tool_start:
+ if tool_start and data.tool_choice != "none":
if "stop_str" in generation:
generations = await generate_tool_calls(
prompt,
@@ -419,6 +558,50 @@ async def stream_generate_chat_completion(
# Only one generation present in this case
generation = generations[0]
+
+ # Emit proper three-phase tool-call streaming sequence
+ if "tool_calls" in generation:
+ tool_calls_raw = generation["tool_calls"]
+ parsed = ToolCallProcessor.parse(
+ tool_calls_raw, format=tool_call_format
+ )
+ if parsed and isinstance(data.tool_choice, NamedToolChoice):
+ parsed = ToolCallProcessor.filter_by_name(
+ parsed, data.tool_choice.function.name
+ )
+ if parsed:
+ for tc_chunk in _build_tool_call_chunks(
+ parsed,
+ request.state.id,
+ model_path.name,
+ ):
+ yield _serialize_stream_chunk(tc_chunk)
+
+ # Handle completion and usage after tool calls
+ if (
+ all(task.done() for task in gen_tasks)
+ and gen_queue.empty()
+ ):
+ if (
+ data.stream_options
+ and data.stream_options.include_usage
+ ):
+ usage_chunk = _create_stream_chunk(
+ request.state.id,
+ generation,
+ model_path.name,
+ is_usage_chunk=True,
+ )
+ yield _serialize_stream_chunk(usage_chunk)
+
+ logger.info(
+ "Finished chat completion streaming "
+ f"request {request.state.id}"
+ )
+ yield "[DONE]"
+ break
+ continue
+
elif "text" in generation:
current_generation_text += generation["text"]
@@ -445,7 +628,7 @@ async def stream_generate_chat_completion(
model_path.name,
is_reasoning_chunk=is_reasoning_chunk,
)
- yield response.model_dump_json()
+ yield _serialize_stream_chunk(response)
# Check if all tasks are completed
if all(task.done() for task in gen_tasks) and gen_queue.empty():
@@ -457,7 +640,7 @@ async def stream_generate_chat_completion(
model_path.name,
is_usage_chunk=True,
)
- yield usage_chunk.model_dump_json()
+ yield _serialize_stream_chunk(usage_chunk)
logger.info(
f"Finished chat completion streaming request {request.state.id}"
@@ -468,13 +651,14 @@ async def stream_generate_chat_completion(
except CancelledError:
# Get out if the request gets disconnected
- if not abort_event.is_set():
- abort_event.set()
- handle_request_disconnect("Chat completion generation cancelled by user.")
+ handle_request_disconnect("Chat completion generation cancelled by user.")
except Exception:
yield get_generator_error(
"Chat completion aborted. Please check the server console."
)
+ finally:
+ abort_event.set()
+ disconnect_task.cancel()
async def generate_chat_completion(
@@ -486,6 +670,7 @@ async def generate_chat_completion(
):
gen_tasks: List[asyncio.Task] = []
tool_start = model.container.prompt_template.metadata.tool_start
+ tool_call_format = model.container.prompt_template.metadata.tool_call_format
try:
logger.info(f"Received chat completion request {request.state.id}")
@@ -507,12 +692,21 @@ async def generate_chat_completion(
generations = await asyncio.gather(*gen_tasks)
# Check all the generations and see if a tool call is required
- if tool_start:
+ force_tool_pass = data.tool_choice == "required" or isinstance(
+ data.tool_choice, NamedToolChoice
+ )
+ if tool_start or force_tool_pass:
generations = await generate_tool_calls(
prompt, embeddings, data, generations, request
)
- response = _create_response(request.state.id, generations, model_path.name)
+ response = _create_response(
+ request.state.id,
+ generations,
+ model_path.name,
+ tool_call_format=tool_call_format,
+ tool_choice=data.tool_choice,
+ )
logger.info(f"Finished chat completion request {request.state.id}")
@@ -537,24 +731,72 @@ async def generate_tool_calls(
):
gen_tasks: List[asyncio.Task] = []
tool_start = model.container.prompt_template.metadata.tool_start
+ tool_call_format = model.container.prompt_template.metadata.tool_call_format
+ tool_choice = data.tool_choice
+
+ if tool_choice == "none":
+ return generations
# Tracks which generations asked for a tool call
tool_idx: List[int] = []
# Copy to make sure the parent JSON schema doesn't get modified
tool_data = data.model_copy(deep=True)
- tool_data.json_schema = TOOL_CALL_SCHEMA
+
+ if tool_call_format in ("xml", "auto"):
+ # XML / auto mode: let the model generate its natural output
+ # without JSON schema constraint
+ logger.debug(
+ f"generate_tool_calls: Using '{tool_call_format}' mode "
+ f"(no JSON schema constraint)"
+ )
+
+ # Remove tool_start from stop strings so the model can emit
+ # multiple sequential <tool_call> blocks without stopping early
+ if (
+ tool_start
+ and isinstance(tool_data.stop, list)
+ and tool_start in tool_data.stop
+ ):
+ tool_data.stop = [s for s in tool_data.stop if s != tool_start]
+ logger.debug(
+ f"generate_tool_calls: Removed '{tool_start}' from "
+ f"second-pass stop strings"
+ )
+ else:
+ # JSON mode: constrained generation (existing behavior)
+ tool_data.json_schema = TOOL_CALL_SCHEMA
for idx, gen in enumerate(generations):
- if gen["stop_str"] != tool_start:
+ stop_str = gen.get("stop_str")
+ should_generate = stop_str == tool_start
+
+ # Force tool generation if tool_choice requires it
+ if not should_generate and (
+ tool_choice == "required" or isinstance(tool_choice, NamedToolChoice)
+ ):
+ should_generate = True
+
+ if not should_generate:
continue
- logger.info(f"Detected tool call in chat completion request {request.state.id}")
+ logger.info(
+ f"Detected tool call in chat completion request "
+ f"{request.state.id} (format={tool_call_format})"
+ )
- # Append the existing generation text if present
+ # Build per-generation prompt (avoid mutating shared prompt)
+ tool_prompt = prompt
precursor_text = gen.get("full_text")
if precursor_text:
- prompt = prompt + precursor_text
+ tool_prompt = tool_prompt + precursor_text
+
+ # For XML/auto mode: append tool_start back to prompt.
+ # The stop string was consumed by the first pass and not included
+ # in full_text, but the model expects to continue after <tool_call>.
+ # Include a trailing newline to match the canonical template format.
+ if tool_call_format in ("xml", "auto") and tool_start:
+ tool_prompt = tool_prompt + tool_start + "\n"
gen_request_id = gen.get("request_id")
tool_request_id = f"{gen_request_id}-tool"
@@ -563,7 +805,7 @@ async def generate_tool_calls(
asyncio.create_task(
model.container.generate(
tool_request_id,
- prompt,
+ tool_prompt,
tool_data,
mm_embeddings=embeddings,
)
@@ -577,6 +819,12 @@ async def generate_tool_calls(
# Map tool calls to their appropriate generation
for gen_idx, tool_call in zip(tool_idx, tool_calls, strict=True):
- generations[gen_idx]["tool_calls"] = tool_call["text"]
+ raw_text = tool_call["text"]
+
+ if tool_call_format in ("xml", "auto"):
+ # Prepend tool_start to reconstruct complete XML for parser
+ raw_text = tool_start + "\n" + raw_text
+
+ generations[gen_idx]["tool_calls"] = raw_text
return generations
diff --git a/endpoints/OAI/utils/completion.py b/endpoints/OAI/utils/completion.py
index f66d381..c11a25b 100644
--- a/endpoints/OAI/utils/completion.py
+++ b/endpoints/OAI/utils/completion.py
@@ -225,11 +225,24 @@ async def stream_generate_completion(
# Consumer loop
while True:
+ # Fast path: items already queued β no task overhead
+ if not gen_queue.empty():
+ generation = gen_queue.get_nowait()
+ else:
+ # Slow path: queue empty β race get against disconnect
+ get_task = asyncio.create_task(gen_queue.get())
+ done, _ = await asyncio.wait(
+ [get_task, disconnect_task],
+ return_when=asyncio.FIRST_COMPLETED,
+ )
+ if disconnect_task in done:
+ get_task.cancel()
+ raise CancelledError()
+ generation = get_task.result()
+
if disconnect_task.done():
raise CancelledError()
- generation = await gen_queue.get()
-
# Stream collector will push an exception to the queue if it fails
if isinstance(generation, Exception):
raise generation
@@ -245,15 +258,16 @@ async def stream_generate_completion(
except CancelledError:
# Get out if the request gets disconnected
- if not abort_event.is_set():
- abort_event.set()
- handle_request_disconnect(
- f"Completion generation {request.state.id} cancelled by user."
- )
+ handle_request_disconnect(
+ f"Completion generation {request.state.id} cancelled by user."
+ )
except Exception:
yield get_generator_error(
f"Completion {request.state.id} aborted. Please check the server console."
)
+ finally:
+ abort_event.set()
+ disconnect_task.cancel()
async def generate_completion(
diff --git a/endpoints/OAI/utils/tools.py b/endpoints/OAI/utils/tools.py
index c1ebded..05eaf14 100644
--- a/endpoints/OAI/utils/tools.py
+++ b/endpoints/OAI/utils/tools.py
@@ -1,8 +1,11 @@
+"""Tool call processing utilities for OAI server."""
+
import json
+import re
from loguru import logger
-from typing import List
+from typing import Any, List, Tuple
-from endpoints.OAI.types.tools import ToolCall
+from endpoints.OAI.types.tools import ToolCall, Tool
TOOL_CALL_SCHEMA = {
@@ -27,24 +30,480 @@ TOOL_CALL_SCHEMA = {
},
}
+# ---------------------------------------------------------------------------
+# XML parsing regex patterns
+# Derived from vLLM's Qwen3CoderToolParser and the official Qwen parser.
+# These handle both complete and partially-closed tags.
+# ---------------------------------------------------------------------------
+
+# Matches complete <tool_call>...</tool_call> blocks
+TOOL_CALL_BLOCK_RE = re.compile(
+ r"<tool_call>(.*?)</tool_call>",
+ re.DOTALL,
+)
+
+# Matches <function=NAME>BODY</function> blocks
+FUNCTION_RE = re.compile(
+ r"<function=(.*?)>(.*?)</function>",
+ re.DOTALL,
+)
+
+# Matches <parameter=KEY>VALUE</terminator>
+# Terminates on: </parameter>, next <parameter=, </function>, or <tool_call>
+PARAMETER_RE = re.compile(
+ r"<parameter=(.*?)>(.*?)"
+ r"(?:</parameter>|(?=<parameter=)|(?=</function>)|(?=<tool_call>))",
+ re.DOTALL,
+)
+
+# Think block patterns
+THINK_BLOCK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
+THINK_UNCLOSED_RE = re.compile(r"<think>(?!.*</think>).*$", re.DOTALL)
+
+# Markdown code fence patterns
+CODE_FENCE_RE = re.compile(r"^```(?:json)?\s*", re.MULTILINE)
+CODE_FENCE_END_RE = re.compile(r"\s*```\s*$", re.MULTILINE)
+
+
+def _strip_think_blocks(text: str) -> str:
+ """Strip <think>...</think> blocks from text.
+
+ Handles both complete and unclosed blocks (quantization can cause
+ the model to never close a think tag).
+ """
+ original = text
+
+ # Complete blocks first
+ text = THINK_BLOCK_RE.sub("", text)
+
+ # Unclosed block (think started but never closed β strip to end)
+ text = THINK_UNCLOSED_RE.sub("", text)
+
+ if text != original:
+ if THINK_UNCLOSED_RE.search(original):
+ logger.warning(
+ "XML Parser: Stripped unclosed <think> block "
+ "(possible quantization degradation)"
+ )
+ else:
+ logger.debug("XML Parser: Stripped <think> block(s) from output")
+
+ return text
+
+
+def _coerce_param_value(raw: str) -> Any:
+ """Coerce a raw parameter value string to the appropriate Python type.
+
+ Strategy (safe, no eval()):
+ 1. Strip leading/trailing newlines (official template emits \\n
+ after opening tag and before closing tag).
+ 2. Try json.loads β handles objects, arrays, numbers, bools, null.
+ 3. Fall back to plain string.
+ """
+ # Strip template-inserted newlines around values
+ if raw.startswith("\n"):
+ raw = raw[1:]
+ if raw.endswith("\n"):
+ raw = raw[:-1]
+
+ stripped = raw.strip()
+
+ # Empty string
+ if not stripped:
+ return ""
+
+ # Try JSON parse (handles objects, arrays, numbers, booleans, null)
+ try:
+ return json.loads(stripped)
+ except (json.JSONDecodeError, ValueError):
+ pass
+
+ # Fall back to string β never eval()
+ return stripped
+
class ToolCallProcessor:
+
+ # ------------------------------------------------------------------
+ # JSON normalization helpers
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def _normalize_tool_calls(raw) -> list:
+ """Normalize model-emitted tool call payloads into OAI-like objects.
+
+ Accepted forms:
+ - [{"type":"function","function":{"name":...,"arguments":{...}}}]
+ - [{"name":...,"arguments":{...}}]
+ - {"name":...,"arguments":{...}}
+ """
+ if isinstance(raw, dict):
+ raw = [raw]
+ if not isinstance(raw, list):
+ raise ValueError("tool_calls payload is not list/dict")
+
+ normalized: list = []
+ for item in raw:
+ if not isinstance(item, dict):
+ continue
+
+ if "function" in item and isinstance(item["function"], dict):
+ fn = item["function"]
+ name = fn.get("name")
+ arguments = fn.get("arguments", {})
+ else:
+ name = item.get("name")
+ arguments = item.get("arguments", {})
+
+ if name is None:
+ continue
+
+ if isinstance(arguments, str):
+ try:
+ arguments = json.loads(arguments)
+ except json.JSONDecodeError:
+ arguments = {"input": arguments}
+
+ normalized.append(
+ {
+ "type": "function",
+ "function": {
+ "name": name,
+ "arguments": arguments if isinstance(arguments, dict) else {},
+ },
+ }
+ )
+ return normalized
+
+ @staticmethod
+ def _safe_json_loads(payload: str) -> list:
+ """Best-effort JSON parse for model-emitted tool payloads.
+
+ Handles: clean JSON, markdown-fenced JSON, JSON substrings in
+ surrounding text, flat {name, arguments} dicts, and single objects.
+ """
+ # Direct parse
+ try:
+ return ToolCallProcessor._normalize_tool_calls(json.loads(payload))
+ except (json.JSONDecodeError, ValueError):
+ pass
+
+ # Clean up common model artifacts (markdown fences, whitespace)
+ cleaned = payload.strip()
+ cleaned = CODE_FENCE_RE.sub("", cleaned)
+ cleaned = CODE_FENCE_END_RE.sub("", cleaned)
+ cleaned = cleaned.strip()
+
+ # Try cleaned
+ try:
+ return ToolCallProcessor._normalize_tool_calls(json.loads(cleaned))
+ except (json.JSONDecodeError, ValueError):
+ pass
+
+ # Find JSON array substring
+ start = cleaned.find("[")
+ end = cleaned.rfind("]")
+ if start != -1 and end != -1 and end > start:
+ try:
+ return ToolCallProcessor._normalize_tool_calls(
+ json.loads(cleaned[start : end + 1])
+ )
+ except (json.JSONDecodeError, ValueError):
+ pass
+
+ # Find JSON object substring
+ obj_start = cleaned.find("{")
+ obj_end = cleaned.rfind("}")
+ if obj_start != -1 and obj_end != -1 and obj_end > obj_start:
+ try:
+ return ToolCallProcessor._normalize_tool_calls(
+ json.loads(cleaned[obj_start : obj_end + 1])
+ )
+ except (json.JSONDecodeError, ValueError):
+ pass
+
+ raise json.JSONDecodeError(
+ "Could not extract valid JSON from payload", payload, 0
+ )
+
+ # ------------------------------------------------------------------
+ # JSON parsing
+ # ------------------------------------------------------------------
+
@staticmethod
def from_json(tool_calls_str: str) -> List[ToolCall]:
- """Postprocess tool call JSON to a parseable class"""
+ """Postprocess tool call JSON to a parseable class.
- tool_calls = json.loads(tool_calls_str)
+ Handles clean JSON arrays, markdown-fenced output, flat dicts,
+ and other common model output variations via _safe_json_loads.
+ """
+ logger.debug(f"JSON Parser: Parsing tool calls ({len(tool_calls_str)} chars)")
+
+ tool_calls = ToolCallProcessor._safe_json_loads(tool_calls_str)
for tool_call in tool_calls:
tool_call["function"]["arguments"] = json.dumps(
tool_call["function"]["arguments"]
)
- return [ToolCall(**tool_call) for tool_call in tool_calls]
+ result = [ToolCall(**tool_call) for tool_call in tool_calls]
+ logger.debug(f"JSON Parser: Successfully parsed {len(result)} tool call(s)")
+ return result
+
+ # ------------------------------------------------------------------
+ # XML parsing (Qwen3-Coder / GLM-4.5 style)
+ # ------------------------------------------------------------------
@staticmethod
- def dump(tool_calls: List[ToolCall]) -> List[dict]:
+ def from_xml(raw_text: str) -> List[ToolCall]:
+ """Parse Qwen3-Coder XML-format tool calls into ToolCall objects.
+
+ Handles:
+ - Wrapped: <tool_call><function=name>...</function></tool_call>
+ - Bare: <function=name>...</function> (missing wrapper)
+ - Multiple sequential tool call blocks
+ - <think> blocks (stripped)
+ - Multi-line parameter values
+ - Missing </parameter> closing tags
+ """
+ logger.debug(f"XML Parser: Parsing tool calls ({len(raw_text)} chars)")
+
+ # Stage 1: Strip think blocks
+ text = _strip_think_blocks(raw_text)
+
+ # Stage 2: Check for incomplete XML at end (generation cutoff)
+ stripped_end = text.rstrip()
+ if stripped_end.endswith(("<", "</", "<parameter", "<function")):
+ logger.warning(
+ f"XML Parser: Detected incomplete XML tag at end: "
+ f"...{stripped_end[-80:]}"
+ )
+ text = re.sub(r"<[^>]*$", "", text)
+
+ # Stage 3: Extract function blocks
+ # First, find all wrapped <tool_call>...</tool_call> blocks
+ wrapped_positions = [
+ (m.start(), m.end()) for m in TOOL_CALL_BLOCK_RE.finditer(text)
+ ]
+
+ # Collect function blocks from inside wrapped regions
+ function_blocks = []
+ for match in TOOL_CALL_BLOCK_RE.finditer(text):
+ inner = match.group(1)
+ for func_match in FUNCTION_RE.finditer(inner):
+ function_blocks.append((func_match.group(1), func_match.group(2)))
+
+ # Find bare <function> blocks NOT inside any wrapped region
+ for func_match in FUNCTION_RE.finditer(text):
+ pos = func_match.start()
+ is_wrapped = any(start <= pos < end for start, end in wrapped_positions)
+ if not is_wrapped:
+ logger.debug(
+ "XML Parser: Found bare <function> block without "
+ "<tool_call> wrapper"
+ )
+ function_blocks.append((func_match.group(1), func_match.group(2)))
+
+ if not function_blocks:
+ logger.warning("XML Parser: No <function=...> blocks found")
+ return []
+
+ # Stage 4: Parse each function block into a ToolCall
+ tool_calls = []
+ for func_name_raw, func_body in function_blocks:
+ func_name = func_name_raw.strip()
+
+ # Extract parameters
+ params = {}
+ for param_match in PARAMETER_RE.finditer(func_body):
+ key = param_match.group(1).strip()
+ value_raw = param_match.group(2)
+ value = _coerce_param_value(value_raw)
+ params[key] = value
+
+ arguments_json = json.dumps(params, ensure_ascii=False)
+
+ tool_call = ToolCall(
+ function=Tool(name=func_name, arguments=arguments_json)
+ )
+ tool_calls.append(tool_call)
+
+ logger.debug(f"XML Parser: Successfully parsed {len(tool_calls)} tool call(s)")
+ return tool_calls
+
+ # ------------------------------------------------------------------
+ # Auto-detect parsing (JSON β JSON-in-tool_call β XML)
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def from_auto(raw_text: str) -> List[ToolCall]:
+ """Auto-detect format and parse.
+
+ Tries in order:
+ 1. Pure JSON (standard TabbyAPI / Llama)
+ 2. JSON inside <tool_call> wrappers (Qwen3-Instruct style)
+ 3. XML with <function=...> tags (Qwen3-Coder style)
"""
- Convert ToolCall objects to a list of dictionaries.
+ logger.debug("Auto Parser: Attempting format auto-detection")
+
+ # Attempt 1: Pure JSON array
+ try:
+ result = ToolCallProcessor.from_json(raw_text)
+ logger.debug("Auto Parser: Detected JSON format")
+ return result
+ except (json.JSONDecodeError, ValueError, KeyError) as e:
+ logger.debug(f"Auto Parser: Not JSON ({e}), trying next format")
+
+ # Attempt 2: JSON inside <tool_call> wrappers (Qwen3-Instruct)
+ try:
+ all_tool_calls = []
+ for match in TOOL_CALL_BLOCK_RE.finditer(raw_text):
+ inner = match.group(1).strip()
+ if inner.startswith("{") or inner.startswith("["):
+ parsed = json.loads(inner)
+ if isinstance(parsed, dict):
+ parsed = [parsed]
+ if isinstance(parsed, list):
+ for tc in parsed:
+ name = tc.get("name", "")
+ arguments = tc.get("arguments", {})
+ if isinstance(arguments, dict):
+ arguments = json.dumps(arguments)
+ elif not isinstance(arguments, str):
+ arguments = json.dumps(arguments)
+ all_tool_calls.append(
+ ToolCall(function=Tool(name=name, arguments=arguments))
+ )
+ if all_tool_calls:
+ logger.debug(
+ "Auto Parser: Detected JSON-inside-tool_call "
+ f"format ({len(all_tool_calls)} call(s))"
+ )
+ return all_tool_calls
+ except (json.JSONDecodeError, ValueError, KeyError) as e:
+ logger.debug(f"Auto Parser: Not JSON-in-tool_call ({e}), trying XML")
+
+ # Attempt 3: XML format (Qwen3-Coder style)
+ result = ToolCallProcessor.from_xml(raw_text)
+ if result:
+ logger.debug("Auto Parser: Detected XML format")
+ else:
+ logger.warning("Auto Parser: All format detection attempts failed")
+ return result
+
+ # ------------------------------------------------------------------
+ # Dispatcher
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def parse(tool_calls_str: str, format: str = "json") -> List[ToolCall]:
+ """Dispatch tool call parsing to the appropriate format handler.
+
+ Args:
+ tool_calls_str: Raw tool call text from model generation.
+ format: One of ``"json"``, ``"xml"``, ``"auto"``.
+
+ Returns:
+ List of parsed ToolCall objects. Empty list on parse failure
+ (never raises).
+ """
+ try:
+ if format == "xml":
+ return ToolCallProcessor.from_xml(tool_calls_str)
+ elif format == "auto":
+ return ToolCallProcessor.from_auto(tool_calls_str)
+ else:
+ return ToolCallProcessor.from_json(tool_calls_str)
+ except Exception as e:
+ logger.error(
+ f"ToolCallProcessor.parse: Failed to parse tool calls "
+ f"(format={format}): {e}"
+ )
+ return []
+
+ # ------------------------------------------------------------------
+ # Filtering
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def filter_by_name(
+ tool_calls: List[ToolCall], function_name: str
+ ) -> List[ToolCall]:
+ """Filter parsed tool calls to only those matching a function name."""
+ filtered = [tc for tc in tool_calls if tc.function.name == function_name]
+ if not filtered:
+ logger.warning(
+ f"filter_by_name: No tool calls matched '{function_name}' "
+ f"(had {len(tool_calls)} call(s))"
+ )
+ return filtered
+
+ # ------------------------------------------------------------------
+ # Content / tool-call separation
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def extract_content_and_tools(
+ raw_text: str,
+ ) -> Tuple[str, List[ToolCall]]:
+ """Separate plain text content from XML tool call blocks.
+
+ Used when the model mixes reasoning text with tool calls, e.g.:
+ ``"I'll help with that: <tool_call><function=...>...``
+
+ Returns:
+ Tuple of (remaining_content, tool_calls).
+ """
+ text = _strip_think_blocks(raw_text)
+
+ # Collect all XML regions to exclude from content
+ xml_regions = []
+
+ # Wrapped tool call blocks
+ for match in TOOL_CALL_BLOCK_RE.finditer(text):
+ xml_regions.append((match.start(), match.end()))
+
+ # Bare function blocks not inside wrappers
+ for match in FUNCTION_RE.finditer(text):
+ pos = match.start()
+ is_wrapped = any(start <= pos < end for start, end in xml_regions)
+ if not is_wrapped:
+ xml_regions.append((match.start(), match.end()))
+
+ # Sort and extract content (everything outside XML regions)
+ xml_regions.sort()
+ content_parts = []
+ last_end = 0
+ for start, end in xml_regions:
+ if start > last_end:
+ part = text[last_end:start].strip()
+ if part:
+ content_parts.append(part)
+ last_end = end
+ if last_end < len(text):
+ part = text[last_end:].strip()
+ if part:
+ content_parts.append(part)
+
+ content = " ".join(content_parts).strip()
+
+ # Parse tool calls from the full text
+ tool_calls = ToolCallProcessor.from_xml(text)
+
+ logger.debug(
+ f"extract_content_and_tools: Found {len(tool_calls)} tool "
+ f"call(s), content={'yes' if content else 'no'} "
+ f"({len(content)} chars)"
+ )
+
+ return content, tool_calls
+
+ # ------------------------------------------------------------------
+ # Serialisation helpers (unchanged from original)
+ # ------------------------------------------------------------------
+
+ @staticmethod
+ def dump(tool_calls: List[ToolCall]) -> List[dict]:
+ """Convert ToolCall objects to a list of dictionaries.
Args:
tool_calls (List[ToolCall]): List of ToolCall objects to convert
@@ -65,8 +524,7 @@ class ToolCallProcessor:
@staticmethod
def to_json(tool_calls: List[ToolCall]) -> str:
- """
- Convert ToolCall objects to JSON string representation.
+ """Convert ToolCall objects to JSON string representation.
Args:
tool_calls (List[ToolCall]): List of ToolCall objects to convert
diff --git a/templates/tool_calls/qwen3_coder.jinja b/templates/tool_calls/qwen3_coder.jinja
new file mode 100644
index 0000000..1527274
--- /dev/null
+++ b/templates/tool_calls/qwen3_coder.jinja
@@ -0,0 +1,123 @@
+{# TabbyAPI Metadata #}
+{%- set tool_call_format = "xml" -%}
+{%- set tool_start = "<tool_call>" -%}
+{%- set tool_end = "</tool_call>" -%}
+{%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%}
+
+{% macro render_extra_keys(json_dict, handled_keys) %}
+ {%- if json_dict is mapping %}
+ {%- for json_key in json_dict if json_key not in handled_keys %}
+ {%- if json_dict[json_key] is string %}
+ {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
+ {%- else %}
+ {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
+ {%- endif %}
+ {%- endfor %}
+ {%- endif %}
+{%- endmacro %}
+
+{%- if messages[0]["role"] == "system" %}
+ {%- set system_message = messages[0]["content"] %}
+ {%- set loop_messages = messages[1:] %}
+{%- else %}
+ {%- set loop_messages = messages %}
+{%- endif %}
+
+{%- if not tools is defined %}
+ {%- set tools = [] %}
+{%- endif %}
+
+{%- if system_message is defined %}
+ {{- "<|im_start|>system\n" + system_message }}
+{%- else %}
+ {%- if tools is iterable and tools | length > 0 %}
+ {{- "<|im_start|>system\nYou are Qwen, a helpful AI assistant that can interact with a computer to solve tasks." }}
+ {%- endif %}
+{%- endif %}
+{%- if tools is iterable and tools | length > 0 %}
+ {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n" }}
+ {{- "<tools>" }}
+ {%- for tool in tools %}
+ {%- if tool.function is defined %}
+ {%- set tool = tool.function %}
+ {%- endif %}
+ {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
+ {%- if tool.description is defined %}
+ {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
+ {%- endif %}
+ {{- '\n<parameters>' }}
+ {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
+ {%- for param_name, param_fields in tool.parameters.properties|items %}
+ {{- '\n<parameter>' }}
+ {{- '\n<name>' ~ param_name ~ '</name>' }}
+ {%- if param_fields.type is defined %}
+ {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
+ {%- endif %}
+ {%- if param_fields.description is defined %}
+ {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
+ {%- endif %}
+ {%- set handled_keys = ['name', 'type', 'description'] %}
+ {{- render_extra_keys(param_fields, handled_keys) }}
+ {{- '\n</parameter>' }}
+ {%- endfor %}
+ {%- endif %}
+ {%- set handled_keys = ['type', 'properties'] %}
+ {{- render_extra_keys(tool.parameters, handled_keys) }}
+ {{- '\n</parameters>' }}
+ {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
+ {{- render_extra_keys(tool, handled_keys) }}
+ {{- '\n</function>' }}
+ {%- endfor %}
+ {{- "\n</tools>" }}
+ {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+{%- endif %}
+{%- if system_message is defined %}
+ {{- '<|im_end|>\n' }}
+{%- else %}
+ {%- if tools is iterable and tools | length > 0 %}
+ {{- '<|im_end|>\n' }}
+ {%- endif %}
+{%- endif %}
+{%- for message in loop_messages %}
+ {%- if message.role == "assistant" and message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
+ {{- '<|im_start|>' + message.role }}
+ {%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
+ {{- '\n' + message.content | trim + '\n' }}
+ {%- endif %}
+ {%- for tool_call in message.tool_calls %}
+ {%- if tool_call.function is defined %}
+ {%- set tool_call = tool_call.function %}
+ {%- endif %}
+ {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+ {%- if tool_call.arguments is defined %}
+ {%- for args_name, args_value in tool_call.arguments|items %}
+ {{- '<parameter=' + args_name + '>\n' }}
+ {%- set args_value = args_value if args_value is string else args_value | tojson | safe %}
+ {{- args_value }}
+ {{- '\n</parameter>\n' }}
+ {%- endfor %}
+ {%- endif %}
+ {{- '</function>\n</tool_call>' }}
+ {%- endfor %}
+ {{- '<|im_end|>\n' }}
+ {%- elif message.role == "user" or message.role == "system" or message.role == "assistant" %}
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+ {%- elif message.role == "tool" %}
+ {%- if loop.previtem and loop.previtem.role != "tool" %}
+ {{- '<|im_start|>user' }}
+ {%- endif %}
+ {{- '\n<tool_response>\n' }}
+ {{- message.content }}
+ {{- '\n</tool_response>' }}
+ {%- if not loop.last and loop.nextitem.role != "tool" %}
+ {{- '<|im_end|>\n' }}
+ {%- elif loop.last %}
+ {{- '<|im_end|>\n' }}
+ {%- endif %}
+ {%- else %}
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
+ {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+ {{- '<|im_start|>assistant\n' }}
+{%- endif %}
\ No newline at end of file
Sorry guys, I totally missed this topic.
If it's still relevant, you can also use my fork of tabby for Qwen/GLM/MiniMax tools support (the last one isn't tested well tbh).
I've added info about this fork to the readme:
https://huggingface.co/NeuroSenko/Qwen3.5-397B-A17B-exl3#tool-calls-support-for-qwenglm-models
UPDATE: I've got it to work with mratsim solution (I've missed the custom chat template intially!).
However the model quant I use (Qwen3.5-397B-A17B-exl3 ---> 3.5bpw_opt) is weird, lot's of Chinese characters in it...
