# Qwen3.5 / Qwen3.6 Jinja2 Chat Template — Implementation Writeup **File:** `qwen3_5-template.jinja` **Validation:** `validate_template.py` (17 fixtures, 0 failures) **Bugs fixed:** BUG-001 through BUG-006 --- ## Table of Contents 1. [Why a New Template?](#1-why-a-new-template) 2. [Research Basis](#2-research-basis) 3. [Model Format Fundamentals](#3-model-format-fundamentals) 4. [Implementation Premises](#4-implementation-premises) 5. [enable_thinking Behavior](#5-enable_thinking-behavior) 6. [Tool Call Rendering](#6-tool-call-rendering) 7. [Bug Analysis and Fixes](#7-bug-analysis-and-fixes) 8. [Template Architecture](#8-template-architecture) 9. [Test Coverage](#9-test-coverage) 10. [Tool Ecosystem Compatibility](#10-tool-ecosystem-compatibility) --- ## 1. Why a New Template? The official Qwen3.5/3.6 chat template (as shipped with the HuggingFace model checkpoints) contains at least six correctness bugs that cause silent failures in production agent loops. These bugs were independently reported across GitHub issues, HuggingFace discussions, Reddit threads, and llama.cpp/vLLM bug trackers between early 2025 and mid-2026. An analysis of approximately five widely-used community replacement templates showed that each one fixed a different subset of the bugs while introducing new ones. None were derived systematically from the model's training format as documented in the official technical report. This template was written from scratch, grounded in: - **Qwen3 Technical Report** (arXiv:2505.09388) — authoritative description of the training format, thinking mechanism, and tool-calling protocol. - **Mid-Think Paper** (arXiv:2601.07036) — phase structure of reasoning chains and budget-stop format. - **Hermes tool-call format spec** (Nous Research / NousHermes) — the XML-based tool-call format on which Qwen3 tool-calling is modelled. - Community bug reports and vLLM/llama.cpp/Ollama source code analysis. --- ## 2. Research Basis ### 2.1 Qwen3 Technical Report (arXiv:2505.09388) Key facts extracted for template construction: - No BOS token. The model was trained without one; inserting one degrades output. - `` and `` are **regular BPE text tokens**, not special tokens. Tokenizer ID 151644 = `<|im_start|>`, 151645 = `<|im_end|>`. - Non-thinking mode is implemented by prepending an **empty think block** to the assistant generation: `\n\n\n\n`. The report states explicitly: *"For non-thinking mode samples, we retain an empty thinking block in the assistant's response. This design ensures internal format consistency."* - `/think` and `/no_think` are plain text suffixes in user messages, not special tokens. The model was fine-tuned to follow the **last** such flag encountered in a multi-turn conversation. ### 2.2 Vocab and Tokenizer Notes ``` Token ID Note <|endoftext|> 151643 End-of-document / pad fallback <|im_start|> 151644 Begin-of-turn <|im_end|> 151645 End-of-turn, eos_token ``` Qwen3.5/3.6 both use a padded vocabulary of 248,320 entries; tokens above 151,646 are padding with no semantics. The tokenizer class is `Qwen2Tokenizer` (BBPE, no ``). ### 2.3 Tool-Call Format Origin Qwen3 tool-calling uses the **Hermes-2 XML format** (NousResearch): ``` {"name": "function_name", "arguments": {"key": "value"}} ``` This is identical to vLLM's `hermes` parser target and is the format recognised by Ollama's `parseTag()` heuristic (first text node following `.ToolCalls`). --- ## 3. Model Format Fundamentals ### 3.1 ChatML Base Structure Every conversation is encoded as a sequence of turns delimited by im-start/end control tokens. No newline appears before `<|im_end|>`. ``` <|im_start|>system {system_content}<|im_end|> <|im_start|>user {user_content}<|im_end|> <|im_start|>assistant {thinking} {response}<|im_end|> ``` The blank line between `` and the response is mandatory. The model was trained on this exact whitespace layout. ### 3.2 Non-Thinking Prefill (Character-Exact) The non-thinking generation prefix is exactly 19 characters: ``` \n\n\n\n ``` Decomposed: `` (7) + `\n` (1) + `\n` (1) + `` (8) + `\n` (1) + `\n` (1) = 19. Any deviation (extra space, missing newline) moves the model off its training distribution. ### 3.3 Think-Block Scope Rules | Turn type | Think-block treatment | |---|---| | Historical assistant turn (non-last, no tool_calls) | **Strip entirely** — `split('')[-1].lstrip('\n')` | | Historical assistant turn (has tool_calls) | **Preserve** — think block is part of the tool-call format | | Last assistant turn in history (`add_generation_prompt=False`) | **Preserve verbatim** | | Last assistant turn, no existing think, `enable_thinking=False` | **Inject** `\n\n\n\n` prefix | | Generation prompt, `enable_thinking=True` | **No prefix** — model generates its own `` | | Generation prompt, `enable_thinking=False` | **Inject** `\n\n\n\n` prefix | --- ## 4. Implementation Premises ### 4.1 Single Namespace Object All mutable template state lives in one `ns` namespace object, avoiding Jinja2's scoping trap (variables set inside `{% for %}` blocks are not visible outside without a namespace): ```jinja2 {%- set ns = namespace( enable_thinking=true, image_count=0, video_count=0 ) -%} ``` ### 4.2 Pre-Scan Before Rendering The template performs a full pre-scan of all messages before emitting any output. This is necessary because `/no_think` or `/think` can appear in any user message, and the final flag determines the generation prompt behaviour. A single-pass loop that both renders and tracks flags would have to look ahead, which Jinja2 cannot do. ```jinja2 {%- for i in range(messages | length) -%} {%- if messages[i].role == 'user' -%} {%- set _u = messages[i].content if messages[i].content is string else '' -%} {%- if _u.rstrip().endswith('/no_think') -%} {%- set ns.enable_thinking = false -%} {%- elif _u.rstrip().endswith('/think') -%} {%- set ns.enable_thinking = true -%} {%- endif -%} {%- endif -%} {%- endfor -%} ``` ### 4.3 Separate `{{ }}` Blocks for `tojson` Output Jinja2's `tojson` filter returns a `Markup` object (already HTML-safe). When a `Markup` value is Python-concatenated with a plain string using `+`, Jinja2 auto-escapes the plain string and produces double-encoded output (`"`, `"`, etc.). This is BUG-003. The fix is to never concatenate `tojson` output with plain strings inside a Jinja2 expression. Each fragment is emitted through its own `{{ }}` block: ```jinja2 {# WRONG — triggers HTML-escaping of the plain string #} {{- '{"name": ' + tc.function.name | tojson + '}' -}} {# CORRECT — separate blocks, no Python concatenation #} {{- '{"name": ' -}}{{- tc.function.name | tojson -}}{{- '}' -}} ``` ### 4.4 System Message Collection Phase Multiple system messages are merged into a single `<|im_start|>system` turn with `\n\n` as separator (BUG-004 fix). This is done as a separate pre-pass (Section 4 in the template), so the main loop can unconditionally skip all `role == 'system'` messages. The user's system content always appears **before** the tools block in the system turn, matching the training format. ### 4.5 Tool Normalisation Some frameworks pass tool definitions with a top-level `function` key (`{"type": "function", "function": {...}}`), while others pass the function schema directly (`{"name": ..., "parameters": ...}`). The template normalises all entries to the canonical form before serialisation: ```jinja2 {%- if tool.function is defined -%} {%- set ns_tb.list = ns_tb.list + [tool] -%} {%- else -%} {%- set ns_tb.list = ns_tb.list + [{"type": "function", "function": tool}] -%} {%- endif -%} ``` --- ## 5. `enable_thinking` Behavior ### 5.1 Resolution Priority (Highest to Lowest) 1. **`/no_think` or `/think` text suffix** in the last user message that contains one. This is the highest priority because it represents the most recent explicit user intent and mirrors the model's fine-tuning data. 2. **`enable_thinking` template variable** passed at render time (e.g., via `tokenizer.apply_chat_template(..., enable_thinking=False)`). 3. **Default value** of `true` (thinking on by default, consistent with the model's training distribution). ### 5.2 Generation Prompt Behaviour When `add_generation_prompt=True`: ``` enable_thinking=True → <|im_start|>assistant\n (model generates itself) enable_thinking=False → <|im_start|>assistant\n\n\n\n\n (forces non-thinking mode by pre-filling empty block) ``` ### 5.3 Last-History-Turn Behaviour (add_generation_prompt=False) When the conversation ends with an assistant message and no generation prompt is requested — typical when scoring a complete conversation or when the assistant message is being appended to the prompt for continuation: - **Think block present:** preserved verbatim regardless of `enable_thinking`. - **No think block, `enable_thinking=True`:** content left as-is (historical turns are already stripped; the last one is the current generation context). - **No think block, `enable_thinking=False`:** inject `\n\n\n\n` before the content. ### 5.4 Historical Think-Block Stripping (BUG-001) The official template collapses think blocks in historical turns to `\n\n` instead of removing them. In a long agentic loop this produces an ever-growing sequence of empty think blocks that degrades prompt quality ("prompt poisoning"). The correct operation is full removal: ```python # Python equivalent content = content.split('')[-1].lstrip('\n') if '' in content else content ``` ```jinja2 {# Jinja2 equivalent #} {%- if '' in _ac -%} {%- set _ac = _ac.split('')[-1].lstrip('\n') -%} {%- endif -%} ``` **Exception:** turns that also carry `tool_calls` keep their think block intact. The model is trained to produce thinking before tool invocations, and stripping the think block from a historical tool-call turn would misrepresent the prompt. --- ## 6. Tool Call Rendering ### 6.1 System Turn Tool Block Format The exact text injected into the system message when tools are present matches the Qwen3 Hermes training format: ``` # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within XML tags: {"type": "function", "function": {"name": "...", ...}} For each function call, return a json object with function name and arguments within XML tags: {"name": , "arguments": } ``` All text — including the instruction sentences — is literal and must not be modified. The model was trained on this exact phrasing. ### 6.2 Assistant Tool-Call Block Each tool call is rendered as: ``` {"name": "function_name", "arguments": {JSON_OBJECT}} ``` Multiple parallel calls appear as consecutive blocks separated by `\n`: ``` {"name": "f1", "arguments": {...}} {"name": "f2", "arguments": {...}} <|im_end|> ``` Note: the final `` is immediately followed by `<|im_end|>` with no intervening newline. This matches the training format. ### 6.3 Arguments: String vs Object (BUG-006) Some frameworks (notably older OpenAI-compatible clients and some streaming implementations) serialise tool-call arguments as a JSON string (`"{\"location\": \"Berlin\"}"`) rather than as an object (`{"location": "Berlin"}`). The template handles both: ```jinja2 {%- if tc.function.arguments is string -%} {{- ', "arguments": ' + tc.function.arguments -}} {%- else -%} {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}} {%- endif -%} ``` When arguments are already a string they are passed through as-is (the caller is responsible for valid JSON). When they are a dict/object, `tojson` serialises them correctly including Unicode escaping and quote escaping. This arrangement also prevents the `"""` crash (BUG-006): Python triple-quoted strings inside Jinja2 template strings would crash the Jinja2 parser if the arguments dict happened to contain a value like `"""`. By using `tojson` (which produces a proper JSON string literal) the crash cannot occur. ### 6.4 Tool Results Tool results are wrapped in a user turn using ``: ``` <|im_start|>user {result_content} <|im_end|> ``` Consecutive tool-response messages are merged into a single user turn — the template checks whether the previous message's role was also `tool` and suppresses the `<|im_start|>user\n` header if so. --- ## 7. Bug Analysis and Fixes ### BUG-001 — Historical Think Blocks Leaked (CRITICAL) **Symptom:** In multi-turn conversations with `enable_thinking=True`, every historical assistant message retains a collapsed `\n\n` block. Over many turns the prompt accumulates dozens of empty think blocks, degrading model performance. **Root cause:** Official template strips think content but leaves the surrounding `` tags. **Fix:** Strip the entire block by splitting on `` and taking the tail: ```jinja2 {%- set _ac = _ac.split('')[-1].lstrip('\n') -%} ``` **Tests:** T10, T13, T16 --- ### BUG-002 — KeyError on content=None / Missing content Key (HIGH) **Symptom:** When an assistant message contains only `tool_calls` and no `content` (or `content=None`, which is the OpenAI convention for pure tool-call responses), the template throws `UndefinedError` or `KeyError`. **Root cause:** Official template accesses `message.content` directly. **Fix:** Guard the access: ```jinja2 {%- if message.content is defined and message.content is string -%} {%- set _ac = message.content -%} {%- elif message.content is defined and message.content is iterable ... -%} {%- set _ac = render_content(message.content) -%} {%- else -%} {%- set _ac = '' -%} {%- endif -%} ``` **Tests:** T04, T11 --- ### BUG-003 — Markup HTML-Escaping in Tool JSON (MEDIUM) **Symptom:** Tool definitions or tool-call arguments with characters like `<`, `>`, `&`, or `"` appear HTML-escaped in the rendered prompt (`<`, `>`, `&`, `"`). This causes the model to misread the tool schema. **Root cause:** `tojson` returns a Jinja2 `Markup` object. When `Markup` is concatenated with a plain Python string using `+` inside a Jinja2 expression, the plain string is auto-escaped and then concatenated with the already-safe `Markup` value. **Fix:** Never use `+` to join `tojson` output with plain strings. Emit each fragment through a separate `{{ }}` block: ```jinja2 {# Every fragment in its own block #} {{- '{"name": ' -}}{{- tc.function.name | tojson -}} ``` **Tests:** T03, T04, T12 --- ### BUG-004 — Multiple System Messages Not Handled (MEDIUM) **Symptom:** Frameworks such as Open WebUI send more than one `role: system` message. The official template either crashes or emits multiple system turns, both of which confuse the model. **Root cause:** No merging logic for multiple system messages. **Fix:** Pre-scan all messages and concatenate system content with `\n\n`: ```jinja2 {%- if ns_sys.content == '' -%} {%- set ns_sys.content = _c -%} {%- else -%} {%- set ns_sys.content = ns_sys.content + '\n\n' + _c -%} {%- endif -%} ``` **Tests:** T02, T14 --- ### BUG-005 — Wrong Non-Thinking Prefill Whitespace (LOW-MEDIUM) **Symptom:** Non-thinking mode produces a think block with incorrect whitespace, moving the model off its training distribution and causing output quality degradation or refusal to honour the non-thinking instruction. **Root cause:** The official template uses `\n\n\n` (missing the second newline inside the block), which does not match the format described in the technical report. **Fix:** Use the exact 19-character sequence: ``` \n\n\n\n ``` **Tests:** T08, T17 --- ### BUG-006 — Triple-Quote Crash on Python String Arguments (MEDIUM) **Symptom:** Jinja2 raises a `TemplateSyntaxError` or produces garbled output when tool-call arguments contain triple-quote sequences (`"""` or `'''`) because the template previously embedded argument values using Python string literal concatenation. **Root cause:** Some community templates build the tool-call JSON via string interpolation (`'{"arguments": "' + args + '"}'`), which breaks for argument values containing quote characters. **Fix:** Use `tojson` for all non-string arguments (produces well-formed JSON) and pass string arguments through unchanged (caller provides valid JSON strings): ```jinja2 {%- if tc.function.arguments is string -%} {{- ', "arguments": ' + tc.function.arguments -}} {%- else -%} {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}} {%- endif -%} ``` **Tests:** T12 --- ## 8. Template Architecture The template is divided into eight clearly delimited sections, each with a comment header: ``` Section 1 render_content macro Handles str / list (image/video/text) / None → plain text. Increments ns.image_count / ns.video_count for vision tokens. Section 2 Namespace initialisation Single ns object; enable_thinking defaults to true. Section 3 Pre-scan Walk all user messages; last /no_think or /think wins. Section 4 Collect system content Merge all system / developer messages with \n\n. Section 5 Build tools list Normalise every tool to {"type":"function","function":{...}}. Section 6 Output system turn Emit one <|im_start|>system turn (user content + tools block). Section 7 Main message loop 7a system/developer → skip (already emitted) 7b user → render with vision support 7c assistant → render with think-block logic + tool_calls 7d tool → group into user turns 7e unknown role → raise_exception Section 8 Generation prompt enable_thinking=True → bare <|im_start|>assistant\n enable_thinking=False → add \n\n\n\n prefix ``` ### Design Decisions **No default system prompt.** Unlike some community templates, this template does not inject a default system prompt when none is provided. The model performs well without one, and injecting one would cause conflicts for applications that rely on the system prompt being exactly what they set. **No BOS token.** The Qwen3 family was trained without a BOS token. Adding one would consume a context window slot unnecessarily and may harm performance. **No `<|endoftext|>` in conversation.** This token is reserved for end-of-document signalling in the pre-training phase, not for conversation boundaries. --- ## 9. Test Coverage The 17 test fixtures in `validate_template.py` cover: | ID | Scenario | Key assertion | |---|---|---| | T01 | Simple user/assistant, no system, no tools | Exact ChatML output | | T02 | System message | System turn before user turn | | T03 | Tools defined, `enable_thinking=True` | Tools block in system; no prefill | | T04 | Tool call, `content=None` | No crash; `` present | | T05 | Parallel tool calls | `\n` separator | | T06 | Tool result (role=tool) | `<|im_start|>user\n` | | T07 | `enable_thinking=True` generation prompt | No think prefix emitted | | T08 | `enable_thinking=False` generation prompt | Exact 19-char prefill | | T09 | `/no_think` flag in user message | Non-thinking prefill applied | | T10 | Historical think blocks | Fully stripped, not collapsed | | T11 | Missing `content` key on assistant | No KeyError / UndefinedError | | T12 | Special chars in arguments | Correctly JSON-escaped | | T13 | Historical tool-call turn with think | Think block preserved | | T14 | Multiple system messages | Merged with `\n\n`; single system turn | | T15 | Parallel tool responses | Both inside single user turn | | T16 | Last history turn with existing think | Preserved verbatim | | T17 | Last history turn, no think, `enable_thinking=False` | Prefill injected | Run the suite: ```bash cd /workspace/project/qwen3_5-template python validate_template.py # Expected: 17 passed, 0 failed ``` --- ## 10. Tool Ecosystem Compatibility An analysis of 51 tool-calling frameworks and inference backends was conducted to verify that the template's output is consumable by the broadest possible set of tools. Key findings: ### 10.1 OpenAI JSON Format Dominance 31 of the 51 analysed tools use the **OpenAI-compatible JSON function-call API** (Group A). These tools pass tool definitions as a `tools` array and receive tool calls back as `message.tool_calls` objects. The template's input format is fully compatible with this convention. Notable Group A members: OpenHands, LangChain, LangGraph, LiteLLM, CrewAI, Pydantic AI, Open WebUI, LibreChat, LM Studio, LlamaIndex, AutoGen, LiteLLM. ### 10.2 Inference Server Compatibility | Backend | Compatibility note | |---|---| | **vLLM** | Uses the `hermes` tool parser for Qwen models, matching this template's `` format exactly. | | **llama.cpp** | Recognises `` via the `--jinja` flag + chat template loading. Note: `--jinja` disables GBNF grammar (Issue #12204). | | **Ollama** | Auto-detects the tool-call tag via `parseTag()` which reads the first text node after `.ToolCalls` in the Go template tree — `` is one of the three known tags. | | **LM Studio** | Passes tool definitions as the `tools` API field; receives tool calls in `message.tool_calls`. | | **TabbyAPI** | Full OpenAI-compatible API; correct chat template is the only requirement. | ### 10.3 Non-Native Tool-Calling Frameworks Three framework groups (Cline/Roo Code XML, OpenCode ``, Aider SEARCH/REPLACE) do not use the OpenAI tool-calling API at all. They inject their own tool descriptions into the system prompt and parse the model's text output directly. These frameworks do not interact with the chat template's tool-calling sections — they send no `tools` array and the template therefore emits no tool block. ### 10.4 Arguments as JSON String Several frameworks (notably some streaming clients and older OpenAI SDK versions) serialise `tool_calls[].function.arguments` as a JSON string rather than a parsed object. The template's dual-path arguments handling (Section 6.3) accommodates both cases transparently. --- *Generated as part of the `fix/qwen3-template-bugs` implementation.*