# Qwen3.5 / Qwen3.6 Jinja2 Chat Template — Implementation Writeup

**File:** `qwen3_5-template.jinja`  
**Validation:** `validate_template.py` (17 fixtures, 0 failures)  
**Bugs fixed:** BUG-001 through BUG-006  

---

## Table of Contents

1. [Why a New Template?](#1-why-a-new-template)
2. [Research Basis](#2-research-basis)
3. [Model Format Fundamentals](#3-model-format-fundamentals)
4. [Implementation Premises](#4-implementation-premises)
5. [enable_thinking Behavior](#5-enable_thinking-behavior)
6. [Tool Call Rendering](#6-tool-call-rendering)
7. [Bug Analysis and Fixes](#7-bug-analysis-and-fixes)
8. [Template Architecture](#8-template-architecture)
9. [Test Coverage](#9-test-coverage)
10. [Tool Ecosystem Compatibility](#10-tool-ecosystem-compatibility)

---

## 1. Why a New Template?

The official Qwen3.5/3.6 chat template (as shipped with the HuggingFace model
checkpoints) contains at least six correctness bugs that cause silent failures in
production agent loops. These bugs were independently reported across GitHub
issues, HuggingFace discussions, Reddit threads, and llama.cpp/vLLM bug trackers
between early 2025 and mid-2026.

An analysis of approximately five widely-used community replacement templates
showed that each one fixed a different subset of the bugs while introducing new
ones. None were derived systematically from the model's training format as
documented in the official technical report.

This template was written from scratch, grounded in:

- **Qwen3 Technical Report** (arXiv:2505.09388) — authoritative description of
  the training format, thinking mechanism, and tool-calling protocol.
- **Mid-Think Paper** (arXiv:2601.07036) — phase structure of reasoning chains and
  budget-stop format.
- **Hermes tool-call format spec** (Nous Research / NousHermes) — the XML-based
  tool-call format on which Qwen3 tool-calling is modelled.
- Community bug reports and vLLM/llama.cpp/Ollama source code analysis.

---

## 2. Research Basis

### 2.1 Qwen3 Technical Report (arXiv:2505.09388)

Key facts extracted for template construction:

- No BOS token. The model was trained without one; inserting one degrades output.
- `<think>` and `</think>` are **regular BPE text tokens**, not special tokens.
  Tokenizer ID 151644 = `<|im_start|>`, 151645 = `<|im_end|>`.
- Non-thinking mode is implemented by prepending an **empty think block** to the
  assistant generation: `<think>\n\n</think>\n\n`. The report states explicitly:
  *"For non-thinking mode samples, we retain an empty thinking block in the
  assistant's response. This design ensures internal format consistency."*
- `/think` and `/no_think` are plain text suffixes in user messages, not special
  tokens. The model was fine-tuned to follow the **last** such flag encountered in
  a multi-turn conversation.

### 2.2 Vocab and Tokenizer Notes

```
Token            ID       Note
<|endoftext|>   151643   End-of-document / pad fallback
<|im_start|>    151644   Begin-of-turn
<|im_end|>      151645   End-of-turn, eos_token
```

Qwen3.5/3.6 both use a padded vocabulary of 248,320 entries; tokens above 151,646
are padding with no semantics. The tokenizer class is `Qwen2Tokenizer` (BBPE,
no `<unk>`).

### 2.3 Tool-Call Format Origin

Qwen3 tool-calling uses the **Hermes-2 XML format** (NousResearch):

```
<tool_call>
{"name": "function_name", "arguments": {"key": "value"}}
</tool_call>
```

This is identical to vLLM's `hermes` parser target and is the format recognised
by Ollama's `parseTag()` heuristic (first text node following `.ToolCalls`).

---

## 3. Model Format Fundamentals

### 3.1 ChatML Base Structure

Every conversation is encoded as a sequence of turns delimited by im-start/end
control tokens. No newline appears before `<|im_end|>`.

```
<|im_start|>system
{system_content}<|im_end|>
<|im_start|>user
{user_content}<|im_end|>
<|im_start|>assistant
<think>
{thinking}
</think>

{response}<|im_end|>
```

The blank line between `</think>` and the response is mandatory. The model was
trained on this exact whitespace layout.

### 3.2 Non-Thinking Prefill (Character-Exact)

The non-thinking generation prefix is exactly 19 characters:

```
<think>\n\n</think>\n\n
```

Decomposed: `<think>` (7) + `\n` (1) + `\n` (1) + `</think>` (8) + `\n` (1) +
`\n` (1) = 19. Any deviation (extra space, missing newline) moves the model off
its training distribution.

### 3.3 Think-Block Scope Rules

| Turn type | Think-block treatment |
|---|---|
| Historical assistant turn (non-last, no tool_calls) | **Strip entirely** — `split('</think>')[-1].lstrip('\n')` |
| Historical assistant turn (has tool_calls) | **Preserve** — think block is part of the tool-call format |
| Last assistant turn in history (`add_generation_prompt=False`) | **Preserve verbatim** |
| Last assistant turn, no existing think, `enable_thinking=False` | **Inject** `<think>\n\n</think>\n\n` prefix |
| Generation prompt, `enable_thinking=True` | **No prefix** — model generates its own `<think>` |
| Generation prompt, `enable_thinking=False` | **Inject** `<think>\n\n</think>\n\n` prefix |

---

## 4. Implementation Premises

### 4.1 Single Namespace Object

All mutable template state lives in one `ns` namespace object, avoiding
Jinja2's scoping trap (variables set inside `{% for %}` blocks are not visible
outside without a namespace):

```jinja2
{%- set ns = namespace(
    enable_thinking=true,
    image_count=0,
    video_count=0
) -%}
```

### 4.2 Pre-Scan Before Rendering

The template performs a full pre-scan of all messages before emitting any output.
This is necessary because `/no_think` or `/think` can appear in any user message,
and the final flag determines the generation prompt behaviour. A single-pass loop
that both renders and tracks flags would have to look ahead, which Jinja2 cannot
do.

```jinja2
{%- for i in range(messages | length) -%}
  {%- if messages[i].role == 'user' -%}
    {%- set _u = messages[i].content if messages[i].content is string else '' -%}
    {%- if _u.rstrip().endswith('/no_think') -%}
      {%- set ns.enable_thinking = false -%}
    {%- elif _u.rstrip().endswith('/think') -%}
      {%- set ns.enable_thinking = true -%}
    {%- endif -%}
  {%- endif -%}
{%- endfor -%}
```

### 4.3 Separate `{{ }}` Blocks for `tojson` Output

Jinja2's `tojson` filter returns a `Markup` object (already HTML-safe). When a
`Markup` value is Python-concatenated with a plain string using `+`, Jinja2
auto-escapes the plain string and produces double-encoded output (`&quot;`,
`&#34;`, etc.). This is BUG-003.

The fix is to never concatenate `tojson` output with plain strings inside a
Jinja2 expression. Each fragment is emitted through its own `{{ }}` block:

```jinja2
{# WRONG — triggers HTML-escaping of the plain string #}
{{- '{"name": ' + tc.function.name | tojson + '}' -}}

{# CORRECT — separate blocks, no Python concatenation #}
{{- '{"name": ' -}}{{- tc.function.name | tojson -}}{{- '}' -}}
```

### 4.4 System Message Collection Phase

Multiple system messages are merged into a single `<|im_start|>system` turn
with `\n\n` as separator (BUG-004 fix). This is done as a separate pre-pass
(Section 4 in the template), so the main loop can unconditionally skip all
`role == 'system'` messages.

The user's system content always appears **before** the tools block in the
system turn, matching the training format.

### 4.5 Tool Normalisation

Some frameworks pass tool definitions with a top-level `function` key
(`{"type": "function", "function": {...}}`), while others pass the function
schema directly (`{"name": ..., "parameters": ...}`). The template normalises
all entries to the canonical form before serialisation:

```jinja2
{%- if tool.function is defined -%}
  {%- set ns_tb.list = ns_tb.list + [tool] -%}
{%- else -%}
  {%- set ns_tb.list = ns_tb.list + [{"type": "function", "function": tool}] -%}
{%- endif -%}
```

---

## 5. `enable_thinking` Behavior

### 5.1 Resolution Priority (Highest to Lowest)

1. **`/no_think` or `/think` text suffix** in the last user message that contains
   one. This is the highest priority because it represents the most recent
   explicit user intent and mirrors the model's fine-tuning data.
2. **`enable_thinking` template variable** passed at render time (e.g., via
   `tokenizer.apply_chat_template(..., enable_thinking=False)`).
3. **Default value** of `true` (thinking on by default, consistent with the model's
   training distribution).

### 5.2 Generation Prompt Behaviour

When `add_generation_prompt=True`:

```
enable_thinking=True  →  <|im_start|>assistant\n
                         (model generates <think> itself)

enable_thinking=False →  <|im_start|>assistant\n<think>\n\n</think>\n\n
                         (forces non-thinking mode by pre-filling empty block)
```

### 5.3 Last-History-Turn Behaviour (add_generation_prompt=False)

When the conversation ends with an assistant message and no generation prompt
is requested — typical when scoring a complete conversation or when the
assistant message is being appended to the prompt for continuation:

- **Think block present:** preserved verbatim regardless of `enable_thinking`.
- **No think block, `enable_thinking=True`:** content left as-is (historical turns
  are already stripped; the last one is the current generation context).
- **No think block, `enable_thinking=False`:** inject `<think>\n\n</think>\n\n`
  before the content.

### 5.4 Historical Think-Block Stripping (BUG-001)

The official template collapses think blocks in historical turns to
`<think>\n\n</think>` instead of removing them. In a long agentic loop this
produces an ever-growing sequence of empty think blocks that degrades prompt
quality ("prompt poisoning").

The correct operation is full removal:

```python
# Python equivalent
content = content.split('</think>')[-1].lstrip('\n') if '</think>' in content else content
```

```jinja2
{# Jinja2 equivalent #}
{%- if '</think>' in _ac -%}
  {%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
{%- endif -%}
```

**Exception:** turns that also carry `tool_calls` keep their think block intact.
The model is trained to produce thinking before tool invocations, and stripping
the think block from a historical tool-call turn would misrepresent the prompt.

---

## 6. Tool Call Rendering

### 6.1 System Turn Tool Block Format

The exact text injected into the system message when tools are present matches
the Qwen3 Hermes training format:

```
# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "...", ...}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
```

All text — including the instruction sentences — is literal and must not be
modified. The model was trained on this exact phrasing.

### 6.2 Assistant Tool-Call Block

Each tool call is rendered as:

```
<tool_call>
{"name": "function_name", "arguments": {JSON_OBJECT}}
</tool_call>
```

Multiple parallel calls appear as consecutive blocks separated by `\n`:

```
<tool_call>
{"name": "f1", "arguments": {...}}
</tool_call>
<tool_call>
{"name": "f2", "arguments": {...}}
</tool_call><|im_end|>
```

Note: the final `</tool_call>` is immediately followed by `<|im_end|>` with no
intervening newline. This matches the training format.

### 6.3 Arguments: String vs Object (BUG-006)

Some frameworks (notably older OpenAI-compatible clients and some streaming
implementations) serialise tool-call arguments as a JSON string
(`"{\"location\": \"Berlin\"}"`) rather than as an object
(`{"location": "Berlin"}`). The template handles both:

```jinja2
{%- if tc.function.arguments is string -%}
  {{- ', "arguments": ' + tc.function.arguments -}}
{%- else -%}
  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
{%- endif -%}
```

When arguments are already a string they are passed through as-is (the caller
is responsible for valid JSON). When they are a dict/object, `tojson` serialises
them correctly including Unicode escaping and quote escaping.

This arrangement also prevents the `"""` crash (BUG-006): Python triple-quoted
strings inside Jinja2 template strings would crash the Jinja2 parser if the
arguments dict happened to contain a value like `"""`. By using `tojson`
(which produces a proper JSON string literal) the crash cannot occur.

### 6.4 Tool Results

Tool results are wrapped in a user turn using `<tool_response>`:

```
<|im_start|>user
<tool_response>
{result_content}
</tool_response><|im_end|>
```

Consecutive tool-response messages are merged into a single user turn — the
template checks whether the previous message's role was also `tool` and
suppresses the `<|im_start|>user\n` header if so.

---

## 7. Bug Analysis and Fixes

### BUG-001 — Historical Think Blocks Leaked (CRITICAL)

**Symptom:** In multi-turn conversations with `enable_thinking=True`, every
historical assistant message retains a collapsed `<think>\n\n</think>` block.
Over many turns the prompt accumulates dozens of empty think blocks, degrading
model performance.

**Root cause:** Official template strips think content but leaves the surrounding
`<think>` tags.

**Fix:** Strip the entire block by splitting on `</think>` and taking the tail:

```jinja2
{%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
```

**Tests:** T10, T13, T16

---

### BUG-002 — KeyError on content=None / Missing content Key (HIGH)

**Symptom:** When an assistant message contains only `tool_calls` and no `content`
(or `content=None`, which is the OpenAI convention for pure tool-call responses),
the template throws `UndefinedError` or `KeyError`.

**Root cause:** Official template accesses `message.content` directly.

**Fix:** Guard the access:

```jinja2
{%- if message.content is defined and message.content is string -%}
  {%- set _ac = message.content -%}
{%- elif message.content is defined and message.content is iterable ... -%}
  {%- set _ac = render_content(message.content) -%}
{%- else -%}
  {%- set _ac = '' -%}
{%- endif -%}
```

**Tests:** T04, T11

---

### BUG-003 — Markup HTML-Escaping in Tool JSON (MEDIUM)

**Symptom:** Tool definitions or tool-call arguments with characters like `<`, `>`,
`&`, or `"` appear HTML-escaped in the rendered prompt (`&lt;`, `&gt;`, `&amp;`,
`&#34;`). This causes the model to misread the tool schema.

**Root cause:** `tojson` returns a Jinja2 `Markup` object. When `Markup` is
concatenated with a plain Python string using `+` inside a Jinja2 expression,
the plain string is auto-escaped and then concatenated with the already-safe
`Markup` value.

**Fix:** Never use `+` to join `tojson` output with plain strings. Emit each
fragment through a separate `{{ }}` block:

```jinja2
{# Every fragment in its own block #}
{{- '{"name": ' -}}{{- tc.function.name | tojson -}}
```

**Tests:** T03, T04, T12

---

### BUG-004 — Multiple System Messages Not Handled (MEDIUM)

**Symptom:** Frameworks such as Open WebUI send more than one `role: system`
message. The official template either crashes or emits multiple system turns,
both of which confuse the model.

**Root cause:** No merging logic for multiple system messages.

**Fix:** Pre-scan all messages and concatenate system content with `\n\n`:

```jinja2
{%- if ns_sys.content == '' -%}
  {%- set ns_sys.content = _c -%}
{%- else -%}
  {%- set ns_sys.content = ns_sys.content + '\n\n' + _c -%}
{%- endif -%}
```

**Tests:** T02, T14

---

### BUG-005 — Wrong Non-Thinking Prefill Whitespace (LOW-MEDIUM)

**Symptom:** Non-thinking mode produces a think block with incorrect whitespace,
moving the model off its training distribution and causing output quality
degradation or refusal to honour the non-thinking instruction.

**Root cause:** The official template uses `<think>\n</think>\n\n` (missing the
second newline inside the block), which does not match the format described in
the technical report.

**Fix:** Use the exact 19-character sequence:

```
<think>\n\n</think>\n\n
```

**Tests:** T08, T17

---

### BUG-006 — Triple-Quote Crash on Python String Arguments (MEDIUM)

**Symptom:** Jinja2 raises a `TemplateSyntaxError` or produces garbled output when
tool-call arguments contain triple-quote sequences (`"""` or `'''`) because the
template previously embedded argument values using Python string literal
concatenation.

**Root cause:** Some community templates build the tool-call JSON via string
interpolation (`'{"arguments": "' + args + '"}'`), which breaks for argument
values containing quote characters.

**Fix:** Use `tojson` for all non-string arguments (produces well-formed JSON) and
pass string arguments through unchanged (caller provides valid JSON strings):

```jinja2
{%- if tc.function.arguments is string -%}
  {{- ', "arguments": ' + tc.function.arguments -}}
{%- else -%}
  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
{%- endif -%}
```

**Tests:** T12

---

## 8. Template Architecture

The template is divided into eight clearly delimited sections, each with a
comment header:

```
Section 1  render_content macro
           Handles str / list (image/video/text) / None → plain text.
           Increments ns.image_count / ns.video_count for vision tokens.

Section 2  Namespace initialisation
           Single ns object; enable_thinking defaults to true.

Section 3  Pre-scan
           Walk all user messages; last /no_think or /think wins.

Section 4  Collect system content
           Merge all system / developer messages with \n\n.

Section 5  Build tools list
           Normalise every tool to {"type":"function","function":{...}}.

Section 6  Output system turn
           Emit one <|im_start|>system turn (user content + tools block).

Section 7  Main message loop
           7a  system/developer  → skip (already emitted)
           7b  user              → render with vision support
           7c  assistant         → render with think-block logic + tool_calls
           7d  tool              → group into user turns
           7e  unknown role      → raise_exception

Section 8  Generation prompt
           enable_thinking=True  → bare <|im_start|>assistant\n
           enable_thinking=False → add <think>\n\n</think>\n\n prefix
```

### Design Decisions

**No default system prompt.** Unlike some community templates, this template does
not inject a default system prompt when none is provided. The model performs well
without one, and injecting one would cause conflicts for applications that rely on
the system prompt being exactly what they set.

**No BOS token.** The Qwen3 family was trained without a BOS token. Adding one
would consume a context window slot unnecessarily and may harm performance.

**No `<|endoftext|>` in conversation.** This token is reserved for
end-of-document signalling in the pre-training phase, not for conversation
boundaries.

---

## 9. Test Coverage

The 17 test fixtures in `validate_template.py` cover:

| ID | Scenario | Key assertion |
|---|---|---|
| T01 | Simple user/assistant, no system, no tools | Exact ChatML output |
| T02 | System message | System turn before user turn |
| T03 | Tools defined, `enable_thinking=True` | Tools block in system; no prefill |
| T04 | Tool call, `content=None` | No crash; `<tool_call>` present |
| T05 | Parallel tool calls | `</tool_call>\n<tool_call>` separator |
| T06 | Tool result (role=tool) | `<|im_start|>user\n<tool_response>` |
| T07 | `enable_thinking=True` generation prompt | No think prefix emitted |
| T08 | `enable_thinking=False` generation prompt | Exact 19-char prefill |
| T09 | `/no_think` flag in user message | Non-thinking prefill applied |
| T10 | Historical think blocks | Fully stripped, not collapsed |
| T11 | Missing `content` key on assistant | No KeyError / UndefinedError |
| T12 | Special chars in arguments | Correctly JSON-escaped |
| T13 | Historical tool-call turn with think | Think block preserved |
| T14 | Multiple system messages | Merged with `\n\n`; single system turn |
| T15 | Parallel tool responses | Both inside single user turn |
| T16 | Last history turn with existing think | Preserved verbatim |
| T17 | Last history turn, no think, `enable_thinking=False` | Prefill injected |

Run the suite:

```bash
cd /workspace/project/qwen3_5-template
python validate_template.py
# Expected: 17 passed, 0 failed
```

---

## 10. Tool Ecosystem Compatibility

An analysis of 51 tool-calling frameworks and inference backends was conducted to
verify that the template's output is consumable by the broadest possible set of
tools. Key findings:

### 10.1 OpenAI JSON Format Dominance

31 of the 51 analysed tools use the **OpenAI-compatible JSON function-call API**
(Group A). These tools pass tool definitions as a `tools` array and receive tool
calls back as `message.tool_calls` objects. The template's input format is fully
compatible with this convention.

Notable Group A members: OpenHands, LangChain, LangGraph, LiteLLM, CrewAI,
Pydantic AI, Open WebUI, LibreChat, LM Studio, LlamaIndex, AutoGen, LiteLLM.

### 10.2 Inference Server Compatibility

| Backend | Compatibility note |
|---|---|
| **vLLM** | Uses the `hermes` tool parser for Qwen models, matching this template's `<tool_call>` format exactly. |
| **llama.cpp** | Recognises `<tool_call>` via the `--jinja` flag + chat template loading. Note: `--jinja` disables GBNF grammar (Issue #12204). |
| **Ollama** | Auto-detects the tool-call tag via `parseTag()` which reads the first text node after `.ToolCalls` in the Go template tree — `<tool_call>` is one of the three known tags. |
| **LM Studio** | Passes tool definitions as the `tools` API field; receives tool calls in `message.tool_calls`. |
| **TabbyAPI** | Full OpenAI-compatible API; correct chat template is the only requirement. |

### 10.3 Non-Native Tool-Calling Frameworks

Three framework groups (Cline/Roo Code XML, OpenCode `<parameter>`, Aider
SEARCH/REPLACE) do not use the OpenAI tool-calling API at all. They inject their
own tool descriptions into the system prompt and parse the model's text output
directly. These frameworks do not interact with the chat template's tool-calling
sections — they send no `tools` array and the template therefore emits no tool
block.

### 10.4 Arguments as JSON String

Several frameworks (notably some streaming clients and older OpenAI SDK versions)
serialise `tool_calls[].function.arguments` as a JSON string rather than a parsed
object. The template's dual-path arguments handling (Section 6.3) accommodates
both cases transparently.

---

*Generated as part of the `fix/qwen3-template-bugs` implementation.*