StableQuant's picture
Upload 2 files
1f2e278 verified
|
raw
history blame
22.8 kB

Qwen3.5 / Qwen3.6 Jinja2 Chat Template β€” Implementation Writeup

File: qwen3_5-template.jinja
Validation: validate_template.py (17 fixtures, 0 failures)
Bugs fixed: BUG-001 through BUG-006


Table of Contents

  1. Why a New Template?
  2. Research Basis
  3. Model Format Fundamentals
  4. Implementation Premises
  5. enable_thinking Behavior
  6. Tool Call Rendering
  7. Bug Analysis and Fixes
  8. Template Architecture
  9. Test Coverage
  10. Tool Ecosystem Compatibility

1. Why a New Template?

The official Qwen3.5/3.6 chat template (as shipped with the HuggingFace model checkpoints) contains at least six correctness bugs that cause silent failures in production agent loops. These bugs were independently reported across GitHub issues, HuggingFace discussions, Reddit threads, and llama.cpp/vLLM bug trackers between early 2025 and mid-2026.

An analysis of approximately five widely-used community replacement templates showed that each one fixed a different subset of the bugs while introducing new ones. None were derived systematically from the model's training format as documented in the official technical report.

This template was written from scratch, grounded in:

  • Qwen3 Technical Report (arXiv:2505.09388) β€” authoritative description of the training format, thinking mechanism, and tool-calling protocol.
  • Mid-Think Paper (arXiv:2601.07036) β€” phase structure of reasoning chains and budget-stop format.
  • Hermes tool-call format spec (Nous Research / NousHermes) β€” the XML-based tool-call format on which Qwen3 tool-calling is modelled.
  • Community bug reports and vLLM/llama.cpp/Ollama source code analysis.

2. Research Basis

2.1 Qwen3 Technical Report (arXiv:2505.09388)

Key facts extracted for template construction:

  • No BOS token. The model was trained without one; inserting one degrades output.
  • <think> and </think> are regular BPE text tokens, not special tokens. Tokenizer ID 151644 = <|im_start|>, 151645 = <|im_end|>.
  • Non-thinking mode is implemented by prepending an empty think block to the assistant generation: <think>\n\n</think>\n\n. The report states explicitly: "For non-thinking mode samples, we retain an empty thinking block in the assistant's response. This design ensures internal format consistency."
  • /think and /no_think are plain text suffixes in user messages, not special tokens. The model was fine-tuned to follow the last such flag encountered in a multi-turn conversation.

2.2 Vocab and Tokenizer Notes

Token            ID       Note
<|endoftext|>   151643   End-of-document / pad fallback
<|im_start|>    151644   Begin-of-turn
<|im_end|>      151645   End-of-turn, eos_token

Qwen3.5/3.6 both use a padded vocabulary of 248,320 entries; tokens above 151,646 are padding with no semantics. The tokenizer class is Qwen2Tokenizer (BBPE, no <unk>).

2.3 Tool-Call Format Origin

Qwen3 tool-calling uses the Hermes-2 XML format (NousResearch):

<tool_call>
{"name": "function_name", "arguments": {"key": "value"}}
</tool_call>

This is identical to vLLM's hermes parser target and is the format recognised by Ollama's parseTag() heuristic (first text node following .ToolCalls).


3. Model Format Fundamentals

3.1 ChatML Base Structure

Every conversation is encoded as a sequence of turns delimited by im-start/end control tokens. No newline appears before <|im_end|>.

<|im_start|>system
{system_content}<|im_end|>
<|im_start|>user
{user_content}<|im_end|>
<|im_start|>assistant
<think>
{thinking}
</think>

{response}<|im_end|>

The blank line between </think> and the response is mandatory. The model was trained on this exact whitespace layout.

3.2 Non-Thinking Prefill (Character-Exact)

The non-thinking generation prefix is exactly 19 characters:

<think>\n\n</think>\n\n

Decomposed: <think> (7) + \n (1) + \n (1) + </think> (8) + \n (1) + \n (1) = 19. Any deviation (extra space, missing newline) moves the model off its training distribution.

3.3 Think-Block Scope Rules

Turn type Think-block treatment
Historical assistant turn (non-last, no tool_calls) Strip entirely β€” split('</think>')[-1].lstrip('\n')
Historical assistant turn (has tool_calls) Preserve β€” think block is part of the tool-call format
Last assistant turn in history (add_generation_prompt=False) Preserve verbatim
Last assistant turn, no existing think, enable_thinking=False Inject <think>\n\n</think>\n\n prefix
Generation prompt, enable_thinking=True No prefix β€” model generates its own <think>
Generation prompt, enable_thinking=False Inject <think>\n\n</think>\n\n prefix

4. Implementation Premises

4.1 Single Namespace Object

All mutable template state lives in one ns namespace object, avoiding Jinja2's scoping trap (variables set inside {% for %} blocks are not visible outside without a namespace):

{%- set ns = namespace(
    enable_thinking=true,
    image_count=0,
    video_count=0
) -%}

4.2 Pre-Scan Before Rendering

The template performs a full pre-scan of all messages before emitting any output. This is necessary because /no_think or /think can appear in any user message, and the final flag determines the generation prompt behaviour. A single-pass loop that both renders and tracks flags would have to look ahead, which Jinja2 cannot do.

{%- for i in range(messages | length) -%}
  {%- if messages[i].role == 'user' -%}
    {%- set _u = messages[i].content if messages[i].content is string else '' -%}
    {%- if _u.rstrip().endswith('/no_think') -%}
      {%- set ns.enable_thinking = false -%}
    {%- elif _u.rstrip().endswith('/think') -%}
      {%- set ns.enable_thinking = true -%}
    {%- endif -%}
  {%- endif -%}
{%- endfor -%}

4.3 Separate {{ }} Blocks for tojson Output

Jinja2's tojson filter returns a Markup object (already HTML-safe). When a Markup value is Python-concatenated with a plain string using +, Jinja2 auto-escapes the plain string and produces double-encoded output (&quot;, &#34;, etc.). This is BUG-003.

The fix is to never concatenate tojson output with plain strings inside a Jinja2 expression. Each fragment is emitted through its own {{ }} block:

{# WRONG β€” triggers HTML-escaping of the plain string #}
{{- '{"name": ' + tc.function.name | tojson + '}' -}}

{# CORRECT β€” separate blocks, no Python concatenation #}
{{- '{"name": ' -}}{{- tc.function.name | tojson -}}{{- '}' -}}

4.4 System Message Collection Phase

Multiple system messages are merged into a single <|im_start|>system turn with \n\n as separator (BUG-004 fix). This is done as a separate pre-pass (Section 4 in the template), so the main loop can unconditionally skip all role == 'system' messages.

The user's system content always appears before the tools block in the system turn, matching the training format.

4.5 Tool Normalisation

Some frameworks pass tool definitions with a top-level function key ({"type": "function", "function": {...}}), while others pass the function schema directly ({"name": ..., "parameters": ...}). The template normalises all entries to the canonical form before serialisation:

{%- if tool.function is defined -%}
  {%- set ns_tb.list = ns_tb.list + [tool] -%}
{%- else -%}
  {%- set ns_tb.list = ns_tb.list + [{"type": "function", "function": tool}] -%}
{%- endif -%}

5. enable_thinking Behavior

5.1 Resolution Priority (Highest to Lowest)

  1. /no_think or /think text suffix in the last user message that contains one. This is the highest priority because it represents the most recent explicit user intent and mirrors the model's fine-tuning data.
  2. enable_thinking template variable passed at render time (e.g., via tokenizer.apply_chat_template(..., enable_thinking=False)).
  3. Default value of true (thinking on by default, consistent with the model's training distribution).

5.2 Generation Prompt Behaviour

When add_generation_prompt=True:

enable_thinking=True  β†’  <|im_start|>assistant\n
                         (model generates <think> itself)

enable_thinking=False β†’  <|im_start|>assistant\n<think>\n\n</think>\n\n
                         (forces non-thinking mode by pre-filling empty block)

5.3 Last-History-Turn Behaviour (add_generation_prompt=False)

When the conversation ends with an assistant message and no generation prompt is requested β€” typical when scoring a complete conversation or when the assistant message is being appended to the prompt for continuation:

  • Think block present: preserved verbatim regardless of enable_thinking.
  • No think block, enable_thinking=True: content left as-is (historical turns are already stripped; the last one is the current generation context).
  • No think block, enable_thinking=False: inject <think>\n\n</think>\n\n before the content.

5.4 Historical Think-Block Stripping (BUG-001)

The official template collapses think blocks in historical turns to <think>\n\n</think> instead of removing them. In a long agentic loop this produces an ever-growing sequence of empty think blocks that degrades prompt quality ("prompt poisoning").

The correct operation is full removal:

# Python equivalent
content = content.split('</think>')[-1].lstrip('\n') if '</think>' in content else content
{# Jinja2 equivalent #}
{%- if '</think>' in _ac -%}
  {%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
{%- endif -%}

Exception: turns that also carry tool_calls keep their think block intact. The model is trained to produce thinking before tool invocations, and stripping the think block from a historical tool-call turn would misrepresent the prompt.


6. Tool Call Rendering

6.1 System Turn Tool Block Format

The exact text injected into the system message when tools are present matches the Qwen3 Hermes training format:

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "...", ...}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

All text β€” including the instruction sentences β€” is literal and must not be modified. The model was trained on this exact phrasing.

6.2 Assistant Tool-Call Block

Each tool call is rendered as:

<tool_call>
{"name": "function_name", "arguments": {JSON_OBJECT}}
</tool_call>

Multiple parallel calls appear as consecutive blocks separated by \n:

<tool_call>
{"name": "f1", "arguments": {...}}
</tool_call>
<tool_call>
{"name": "f2", "arguments": {...}}
</tool_call><|im_end|>

Note: the final </tool_call> is immediately followed by <|im_end|> with no intervening newline. This matches the training format.

6.3 Arguments: String vs Object (BUG-006)

Some frameworks (notably older OpenAI-compatible clients and some streaming implementations) serialise tool-call arguments as a JSON string ("{\"location\": \"Berlin\"}") rather than as an object ({"location": "Berlin"}). The template handles both:

{%- if tc.function.arguments is string -%}
  {{- ', "arguments": ' + tc.function.arguments -}}
{%- else -%}
  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
{%- endif -%}

When arguments are already a string they are passed through as-is (the caller is responsible for valid JSON). When they are a dict/object, tojson serialises them correctly including Unicode escaping and quote escaping.

This arrangement also prevents the """ crash (BUG-006): Python triple-quoted strings inside Jinja2 template strings would crash the Jinja2 parser if the arguments dict happened to contain a value like """. By using tojson (which produces a proper JSON string literal) the crash cannot occur.

6.4 Tool Results

Tool results are wrapped in a user turn using <tool_response>:

<|im_start|>user
<tool_response>
{result_content}
</tool_response><|im_end|>

Consecutive tool-response messages are merged into a single user turn β€” the template checks whether the previous message's role was also tool and suppresses the <|im_start|>user\n header if so.


7. Bug Analysis and Fixes

BUG-001 β€” Historical Think Blocks Leaked (CRITICAL)

Symptom: In multi-turn conversations with enable_thinking=True, every historical assistant message retains a collapsed <think>\n\n</think> block. Over many turns the prompt accumulates dozens of empty think blocks, degrading model performance.

Root cause: Official template strips think content but leaves the surrounding <think> tags.

Fix: Strip the entire block by splitting on </think> and taking the tail:

{%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}

Tests: T10, T13, T16


BUG-002 β€” KeyError on content=None / Missing content Key (HIGH)

Symptom: When an assistant message contains only tool_calls and no content (or content=None, which is the OpenAI convention for pure tool-call responses), the template throws UndefinedError or KeyError.

Root cause: Official template accesses message.content directly.

Fix: Guard the access:

{%- if message.content is defined and message.content is string -%}
  {%- set _ac = message.content -%}
{%- elif message.content is defined and message.content is iterable ... -%}
  {%- set _ac = render_content(message.content) -%}
{%- else -%}
  {%- set _ac = '' -%}
{%- endif -%}

Tests: T04, T11


BUG-003 β€” Markup HTML-Escaping in Tool JSON (MEDIUM)

Symptom: Tool definitions or tool-call arguments with characters like <, >, &, or " appear HTML-escaped in the rendered prompt (&lt;, &gt;, &amp;, &#34;). This causes the model to misread the tool schema.

Root cause: tojson returns a Jinja2 Markup object. When Markup is concatenated with a plain Python string using + inside a Jinja2 expression, the plain string is auto-escaped and then concatenated with the already-safe Markup value.

Fix: Never use + to join tojson output with plain strings. Emit each fragment through a separate {{ }} block:

{# Every fragment in its own block #}
{{- '{"name": ' -}}{{- tc.function.name | tojson -}}

Tests: T03, T04, T12


BUG-004 β€” Multiple System Messages Not Handled (MEDIUM)

Symptom: Frameworks such as Open WebUI send more than one role: system message. The official template either crashes or emits multiple system turns, both of which confuse the model.

Root cause: No merging logic for multiple system messages.

Fix: Pre-scan all messages and concatenate system content with \n\n:

{%- if ns_sys.content == '' -%}
  {%- set ns_sys.content = _c -%}
{%- else -%}
  {%- set ns_sys.content = ns_sys.content + '\n\n' + _c -%}
{%- endif -%}

Tests: T02, T14


BUG-005 β€” Wrong Non-Thinking Prefill Whitespace (LOW-MEDIUM)

Symptom: Non-thinking mode produces a think block with incorrect whitespace, moving the model off its training distribution and causing output quality degradation or refusal to honour the non-thinking instruction.

Root cause: The official template uses <think>\n</think>\n\n (missing the second newline inside the block), which does not match the format described in the technical report.

Fix: Use the exact 19-character sequence:

<think>\n\n</think>\n\n

Tests: T08, T17


BUG-006 β€” Triple-Quote Crash on Python String Arguments (MEDIUM)

Symptom: Jinja2 raises a TemplateSyntaxError or produces garbled output when tool-call arguments contain triple-quote sequences (""" or ''') because the template previously embedded argument values using Python string literal concatenation.

Root cause: Some community templates build the tool-call JSON via string interpolation ('{"arguments": "' + args + '"}'), which breaks for argument values containing quote characters.

Fix: Use tojson for all non-string arguments (produces well-formed JSON) and pass string arguments through unchanged (caller provides valid JSON strings):

{%- if tc.function.arguments is string -%}
  {{- ', "arguments": ' + tc.function.arguments -}}
{%- else -%}
  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
{%- endif -%}

Tests: T12


8. Template Architecture

The template is divided into eight clearly delimited sections, each with a comment header:

Section 1  render_content macro
           Handles str / list (image/video/text) / None β†’ plain text.
           Increments ns.image_count / ns.video_count for vision tokens.

Section 2  Namespace initialisation
           Single ns object; enable_thinking defaults to true.

Section 3  Pre-scan
           Walk all user messages; last /no_think or /think wins.

Section 4  Collect system content
           Merge all system / developer messages with \n\n.

Section 5  Build tools list
           Normalise every tool to {"type":"function","function":{...}}.

Section 6  Output system turn
           Emit one <|im_start|>system turn (user content + tools block).

Section 7  Main message loop
           7a  system/developer  β†’ skip (already emitted)
           7b  user              β†’ render with vision support
           7c  assistant         β†’ render with think-block logic + tool_calls
           7d  tool              β†’ group into user turns
           7e  unknown role      β†’ raise_exception

Section 8  Generation prompt
           enable_thinking=True  β†’ bare <|im_start|>assistant\n
           enable_thinking=False β†’ add <think>\n\n</think>\n\n prefix

Design Decisions

No default system prompt. Unlike some community templates, this template does not inject a default system prompt when none is provided. The model performs well without one, and injecting one would cause conflicts for applications that rely on the system prompt being exactly what they set.

No BOS token. The Qwen3 family was trained without a BOS token. Adding one would consume a context window slot unnecessarily and may harm performance.

No <|endoftext|> in conversation. This token is reserved for end-of-document signalling in the pre-training phase, not for conversation boundaries.


9. Test Coverage

The 17 test fixtures in validate_template.py cover:

ID Scenario Key assertion
T01 Simple user/assistant, no system, no tools Exact ChatML output
T02 System message System turn before user turn
T03 Tools defined, enable_thinking=True Tools block in system; no prefill
T04 Tool call, content=None No crash; <tool_call> present
T05 Parallel tool calls </tool_call>\n<tool_call> separator
T06 Tool result (role=tool) `<
T07 enable_thinking=True generation prompt No think prefix emitted
T08 enable_thinking=False generation prompt Exact 19-char prefill
T09 /no_think flag in user message Non-thinking prefill applied
T10 Historical think blocks Fully stripped, not collapsed
T11 Missing content key on assistant No KeyError / UndefinedError
T12 Special chars in arguments Correctly JSON-escaped
T13 Historical tool-call turn with think Think block preserved
T14 Multiple system messages Merged with \n\n; single system turn
T15 Parallel tool responses Both inside single user turn
T16 Last history turn with existing think Preserved verbatim
T17 Last history turn, no think, enable_thinking=False Prefill injected

Run the suite:

cd /workspace/project/qwen3_5-template
python validate_template.py
# Expected: 17 passed, 0 failed

10. Tool Ecosystem Compatibility

An analysis of 51 tool-calling frameworks and inference backends was conducted to verify that the template's output is consumable by the broadest possible set of tools. Key findings:

10.1 OpenAI JSON Format Dominance

31 of the 51 analysed tools use the OpenAI-compatible JSON function-call API (Group A). These tools pass tool definitions as a tools array and receive tool calls back as message.tool_calls objects. The template's input format is fully compatible with this convention.

Notable Group A members: OpenHands, LangChain, LangGraph, LiteLLM, CrewAI, Pydantic AI, Open WebUI, LibreChat, LM Studio, LlamaIndex, AutoGen, LiteLLM.

10.2 Inference Server Compatibility

Backend Compatibility note
vLLM Uses the hermes tool parser for Qwen models, matching this template's <tool_call> format exactly.
llama.cpp Recognises <tool_call> via the --jinja flag + chat template loading. Note: --jinja disables GBNF grammar (Issue #12204).
Ollama Auto-detects the tool-call tag via parseTag() which reads the first text node after .ToolCalls in the Go template tree β€” <tool_call> is one of the three known tags.
LM Studio Passes tool definitions as the tools API field; receives tool calls in message.tool_calls.
TabbyAPI Full OpenAI-compatible API; correct chat template is the only requirement.

10.3 Non-Native Tool-Calling Frameworks

Three framework groups (Cline/Roo Code XML, OpenCode <parameter>, Aider SEARCH/REPLACE) do not use the OpenAI tool-calling API at all. They inject their own tool descriptions into the system prompt and parse the model's text output directly. These frameworks do not interact with the chat template's tool-calling sections β€” they send no tools array and the template therefore emits no tool block.

10.4 Arguments as JSON String

Several frameworks (notably some streaming clients and older OpenAI SDK versions) serialise tool_calls[].function.arguments as a JSON string rather than a parsed object. The template's dual-path arguments handling (Section 6.3) accommodates both cases transparently.


Generated as part of the fix/qwen3-template-bugs implementation.