Instructions to use StableQuant/Qwen-Templates-Rebuild-Project with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use StableQuant/Qwen-Templates-Rebuild-Project with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="StableQuant/Qwen-Templates-Rebuild-Project")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("StableQuant/Qwen-Templates-Rebuild-Project", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use StableQuant/Qwen-Templates-Rebuild-Project with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "StableQuant/Qwen-Templates-Rebuild-Project"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project

SGLang

How to use StableQuant/Qwen-Templates-Rebuild-Project with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "StableQuant/Qwen-Templates-Rebuild-Project" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StableQuant/Qwen-Templates-Rebuild-Project",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use StableQuant/Qwen-Templates-Rebuild-Project with Docker Model Runner:
```
docker model run hf.co/StableQuant/Qwen-Templates-Rebuild-Project
```

openhands commited on 4 days ago

Commit

4d2228f

1 Parent(s): 96721e2

Update to v1.1 template structure

Browse files

Files changed (2) hide show

v1.0_rebuild_qwen3.5_and_3.6_template.jinja → qwen3-5_6-template_v1.1.jinja +72 -11
v1.0_writeup.md +0 -646

v1.0_rebuild_qwen3.5_and_3.6_template.jinja → qwen3-5_6-template_v1.1.jinja RENAMED Viewed

@@ -28,13 +28,18 @@
 {#- ===== SECTION 2: NAMESPACE INITIALISATION =====
      Single ns object for all mutable state.
-     enable_thinking default=true; overridden by template parameter (BUG-003 fix).
 -#}
 {%- set ns = namespace(
     enable_thinking=true,
     image_count=0,
     video_count=0
 ) -%}
 {%- if enable_thinking is defined -%}
   {%- if enable_thinking -%}
     {%- set ns.enable_thinking = true -%}
@@ -43,28 +48,49 @@
   {%- endif -%}
 {%- endif -%}
 {#- ===== SECTION 3: PRE-SCAN =====
      Track last /no_think or /think flag in user messages.
      The model follows the last flag encountered in multi-turn conversations.
 -#}
 {%- for i in range(messages | length) -%}
-  {%- if messages[i].role == 'user' -%}
-    {%- set _u = messages[i].content if messages[i].content is string else '' -%}
     {%- if _u.rstrip().endswith('/no_think') -%}
       {%- set ns.enable_thinking = false -%}
     {%- elif _u.rstrip().endswith('/think') -%}
       {%- set ns.enable_thinking = true -%}
     {%- endif -%}
   {%- endif -%}
 {%- endfor -%}
 {#- ===== SECTION 4: COLLECT SYSTEM CONTENT =====
      Merge all system/developer messages with \n\n separator (BUG-004 fix).
 -#}
 {%- set ns_sys = namespace(content='') -%}
 {%- for msg in messages -%}
   {%- if msg.role == 'system' or msg.role == 'developer' -%}
     {%- set _c = render_content(msg.content | default('')) | trim -%}
     {%- if _c -%}
       {%- if ns_sys.content == '' -%}
         {%- set ns_sys.content = _c -%}
@@ -127,7 +153,9 @@
   {#- 7c: Assistant messages -#}
   {%- elif message.role == 'assistant' -%}
-    {#- Safely extract content as string — guard against absent key (BUG-002 fix) -#}
     {%- if message.content is defined and message.content is string -%}
       {%- set _ac = message.content -%}
     {%- elif message.content is defined and message.content is iterable and message.content is not mapping -%}
@@ -136,6 +164,15 @@
       {%- set _ac = '' -%}
     {%- endif -%}
     {#- Collect tool_calls if present -#}
     {%- set _tc = message.tool_calls if message.tool_calls is defined and message.tool_calls else [] -%}
@@ -154,15 +191,27 @@
     {#- Think-block handling (BUG-001 fix + last-turn preservation):
         - Tool-call turns   : never strip (think block is part of the tool-call format)
         - Last-history turn : preserve; inject non-thinking prefill when absent
-        - Historical turns  : strip the think block -#}
     {%- if not _tc -%}
       {%- if _is_last_hist -%}
         {%- if '<think>' not in _ac and not ns.enable_thinking -%}
           {%- set _ac = '<think>\n\n</think>\n\n' + _ac -%}
         {%- endif -%}
       {%- else -%}
         {%- if '</think>' in _ac -%}
-          {%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
         {%- endif -%}
       {%- endif -%}
     {%- endif -%}
@@ -205,8 +254,10 @@
     {%- endif -%}
     {{- '<tool_response>\n' -}}
     {{- message.content | default('') -}}
-    {{- '\n</tool_response>' -}}
-    {%- if _next_role != 'tool' -%}
       {{- '<|im_end|>\n' -}}
     {%- endif -%}
@@ -218,12 +269,22 @@
 {%- endfor -%}
 {#- ===== SECTION 8: GENERATION PROMPT =====
-     enable_thinking=True  → no prefill (model generates <think> itself)
-     enable_thinking=False → exact 19-char non-thinking prefill (BUG-005 fix)
 -#}
 {%- if add_generation_prompt -%}
   {{- '<|im_start|>assistant\n' -}}
-  {%- if not ns.enable_thinking -%}
     {{- '<think>\n\n</think>\n\n' -}}
   {%- endif -%}
 {%- endif -%}

 {#- ===== SECTION 2: NAMESPACE INITIALISATION =====
      Single ns object for all mutable state.
+     enable_thinking  default=true  (BUG-003 fix)
+     preserve_thinking default=true: when false, suppresses think-block output in
+                       generation prompt and overrides enable_thinking to false.
+                       Passed via --chat-template-kwargs {"preserve_thinking":false}.
 -#}
 {%- set ns = namespace(
     enable_thinking=true,
     image_count=0,
     video_count=0
 ) -%}
+{#- Resolve enable_thinking kwarg -#}
 {%- if enable_thinking is defined -%}
   {%- if enable_thinking -%}
     {%- set ns.enable_thinking = true -%}
   {%- endif -%}
 {%- endif -%}
+{#- Resolve preserve_thinking kwarg.
+    preserve_thinking=false  => force non-thinking mode (same as enable_thinking=false).
+    preserve_thinking=true   => default, no override (thinking controlled by enable_thinking).
+    When not defined         => default, no override.
+-#}
+{%- if preserve_thinking is defined and not preserve_thinking -%}
+  {%- set ns.enable_thinking = false -%}
+{%- endif -%}
 {#- ===== SECTION 3: PRE-SCAN =====
      Track last /no_think or /think flag in user messages.
+     Also scan system messages for <|think_off|> / <|think_on|> markers
+     (allows apps to control thinking mode via system prompt injection).
      The model follows the last flag encountered in multi-turn conversations.
 -#}
 {%- for i in range(messages | length) -%}
+  {%- set _msg = messages[i] -%}
+  {%- if _msg.role == 'user' -%}
+    {%- set _u = _msg.content if _msg.content is string else '' -%}
     {%- if _u.rstrip().endswith('/no_think') -%}
       {%- set ns.enable_thinking = false -%}
     {%- elif _u.rstrip().endswith('/think') -%}
       {%- set ns.enable_thinking = true -%}
     {%- endif -%}
+  {%- elif _msg.role == 'system' or _msg.role == 'developer' -%}
+    {%- set _s = _msg.content if _msg.content is string else '' -%}
+    {%- if '<|think_off|>' in _s -%}
+      {%- set ns.enable_thinking = false -%}
+    {%- elif '<|think_on|>' in _s -%}
+      {%- set ns.enable_thinking = true -%}
+    {%- endif -%}
   {%- endif -%}
 {%- endfor -%}
 {#- ===== SECTION 4: COLLECT SYSTEM CONTENT =====
      Merge all system/developer messages with \n\n separator (BUG-004 fix).
+     <|think_off|> / <|think_on|> markers are stripped from output.
 -#}
 {%- set ns_sys = namespace(content='') -%}
 {%- for msg in messages -%}
   {%- if msg.role == 'system' or msg.role == 'developer' -%}
     {%- set _c = render_content(msg.content | default('')) | trim -%}
+    {%- set _c = _c | replace('<|think_off|>', '') | replace('<|think_on|>', '') | trim -%}
     {%- if _c -%}
       {%- if ns_sys.content == '' -%}
         {%- set ns_sys.content = _c -%}
   {#- 7c: Assistant messages -#}
   {%- elif message.role == 'assistant' -%}
+    {#- Safely extract content as string — guard against absent key (BUG-002 fix).
+        Also support message.reasoning_content as an explicit think-block source
+        (used by some frameworks that store thinking separately from content). -#}
     {%- if message.content is defined and message.content is string -%}
       {%- set _ac = message.content -%}
     {%- elif message.content is defined and message.content is iterable and message.content is not mapping -%}
       {%- set _ac = '' -%}
     {%- endif -%}
+    {#- Reconstruct content from reasoning_content + content when the framework
+        stores thinking separately (e.g. OpenAI-style reasoning_content field).
+        Only apply when no think-block already present in _ac. -#}
+    {%- if message.reasoning_content is defined and message.reasoning_content is string
+        and message.reasoning_content | trim
+        and '<think>' not in _ac -%}
+      {%- set _ac = '<think>\n' + message.reasoning_content | trim + '\n</think>\n\n' + _ac -%}
+    {%- endif -%}
     {#- Collect tool_calls if present -#}
     {%- set _tc = message.tool_calls if message.tool_calls is defined and message.tool_calls else [] -%}
     {#- Think-block handling (BUG-001 fix + last-turn preservation):
         - Tool-call turns   : never strip (think block is part of the tool-call format)
         - Last-history turn : preserve; inject non-thinking prefill when absent
+        - Historical turns  : strip using fuzzy end-tag matching to handle
+                              </think>, </thinking>, </ think>, </think > variants -#}
     {%- if not _tc -%}
       {%- if _is_last_hist -%}
         {%- if '<think>' not in _ac and not ns.enable_thinking -%}
           {%- set _ac = '<think>\n\n</think>\n\n' + _ac -%}
         {%- endif -%}
       {%- else -%}
+        {#- Fuzzy end-tag detection for historical turn stripping -#}
+        {%- set _think_end = '' -%}
         {%- if '</think>' in _ac -%}
+          {%- set _think_end = '</think>' -%}
+        {%- elif '</thinking>' in _ac -%}
+          {%- set _think_end = '</thinking>' -%}
+        {%- elif '</ think>' in _ac -%}
+          {%- set _think_end = '</ think>' -%}
+        {%- elif '</think >' in _ac -%}
+          {%- set _think_end = '</think >' -%}
+        {%- endif -%}
+        {%- if _think_end -%}
+          {%- set _ac = _ac.split(_think_end)[-1].lstrip('\n') -%}
         {%- endif -%}
       {%- endif -%}
     {%- endif -%}
     {%- endif -%}
     {{- '<tool_response>\n' -}}
     {{- message.content | default('') -}}
+    {%- if _next_role == 'tool' -%}
+      {{- '\n</tool_response>\n' -}}
+    {%- else -%}
+      {{- '\n</tool_response>' -}}
       {{- '<|im_end|>\n' -}}
     {%- endif -%}
 {%- endfor -%}
 {#- ===== SECTION 8: GENERATION PROMPT =====
+     enable_thinking=True  → open <think>\n prefill so llama.cpp reasoning-budget
+                             and other inference engines can hook into the think-stream.
+                             The model continues generating inside the open block.
+     enable_thinking=False → exact non-thinking prefill: <think>\n\n</think>\n\n
+                             (19-char closed block, BUG-005 fix)
+     NOTE: The <think>\n opener is EPHEMERAL — it lives only in the generation
+     prompt, never in chat history. Historical think-block stripping (BUG-001)
+     is handled in Section 7c and is entirely unaffected by this change.
+     No context poisoning risk.
 -#}
 {%- if add_generation_prompt -%}
   {{- '<|im_start|>assistant\n' -}}
+  {%- if ns.enable_thinking -%}
+    {{- '<think>\n' -}}
+  {%- else -%}
     {{- '<think>\n\n</think>\n\n' -}}
   {%- endif -%}
 {%- endif -%}

v1.0_writeup.md DELETED Viewed

@@ -1,646 +0,0 @@
-# Qwen3.5 / Qwen3.6 Jinja2 Chat Template — Implementation Writeup
-**File:** `qwen3_5-template.jinja`
-**Validation:** `validate_template.py` (17 fixtures, 0 failures)
-**Bugs fixed:** BUG-001 through BUG-006
----
-## Table of Contents
-1. [Why a New Template?](#1-why-a-new-template)
-2. [Research Basis](#2-research-basis)
-3. [Model Format Fundamentals](#3-model-format-fundamentals)
-4. [Implementation Premises](#4-implementation-premises)
-5. [enable_thinking Behavior](#5-enable_thinking-behavior)
-6. [Tool Call Rendering](#6-tool-call-rendering)
-7. [Bug Analysis and Fixes](#7-bug-analysis-and-fixes)
-8. [Template Architecture](#8-template-architecture)
-9. [Test Coverage](#9-test-coverage)
-10. [Tool Ecosystem Compatibility](#10-tool-ecosystem-compatibility)
----
-## 1. Why a New Template?
-The official Qwen3.5/3.6 chat template (as shipped with the HuggingFace model
-checkpoints) contains at least six correctness bugs that cause silent failures in
-production agent loops. These bugs were independently reported across GitHub
-issues, HuggingFace discussions, Reddit threads, and llama.cpp/vLLM bug trackers
-between early 2025 and mid-2026.
-An analysis of approximately five widely-used community replacement templates
-showed that each one fixed a different subset of the bugs while introducing new
-ones. None were derived systematically from the model's training format as
-documented in the official technical report.
-This template was written from scratch, grounded in:
-- **Qwen3 Technical Report** (arXiv:2505.09388) — authoritative description of
-  the training format, thinking mechanism, and tool-calling protocol.
-- **Mid-Think Paper** (arXiv:2601.07036) — phase structure of reasoning chains and
-  budget-stop format.
-- **Hermes tool-call format spec** (Nous Research / NousHermes) — the XML-based
-  tool-call format on which Qwen3 tool-calling is modelled.
-- Community bug reports and vLLM/llama.cpp/Ollama source code analysis.
----
-## 2. Research Basis
-### 2.1 Qwen3 Technical Report (arXiv:2505.09388)
-Key facts extracted for template construction:
-- No BOS token. The model was trained without one; inserting one degrades output.
-- `<think>` and `</think>` are **regular BPE text tokens**, not special tokens.
-  Tokenizer ID 151644 = `<|im_start|>`, 151645 = `<|im_end|>`.
-- Non-thinking mode is implemented by prepending an **empty think block** to the
-  assistant generation: `<think>\n\n</think>\n\n`. The report states explicitly:
-  *"For non-thinking mode samples, we retain an empty thinking block in the
-  assistant's response. This design ensures internal format consistency."*
-- `/think` and `/no_think` are plain text suffixes in user messages, not special
-  tokens. The model was fine-tuned to follow the **last** such flag encountered in
-  a multi-turn conversation.
-### 2.2 Vocab and Tokenizer Notes
-```
-Token            ID       Note
-<|endoftext|>   151643   End-of-document / pad fallback
-<|im_start|>    151644   Begin-of-turn
-<|im_end|>      151645   End-of-turn, eos_token
-```
-Qwen3.5/3.6 both use a padded vocabulary of 248,320 entries; tokens above 151,646
-are padding with no semantics. The tokenizer class is `Qwen2Tokenizer` (BBPE,
-no `<unk>`).
-### 2.3 Tool-Call Format Origin
-Qwen3 tool-calling uses the **Hermes-2 XML format** (NousResearch):
-```
-<tool_call>
-{"name": "function_name", "arguments": {"key": "value"}}
-</tool_call>
-```
-This is identical to vLLM's `hermes` parser target and is the format recognised
-by Ollama's `parseTag()` heuristic (first text node following `.ToolCalls`).
----
-## 3. Model Format Fundamentals
-### 3.1 ChatML Base Structure
-Every conversation is encoded as a sequence of turns delimited by im-start/end
-control tokens. No newline appears before `<|im_end|>`.
-```
-<|im_start|>system
-{system_content}<|im_end|>
-<|im_start|>user
-{user_content}<|im_end|>
-<|im_start|>assistant
-<think>
-{thinking}
-</think>
-{response}<|im_end|>
-```
-The blank line between `</think>` and the response is mandatory. The model was
-trained on this exact whitespace layout.
-### 3.2 Non-Thinking Prefill (Character-Exact)
-The non-thinking generation prefix is exactly 19 characters:
-```
-<think>\n\n</think>\n\n
-```
-Decomposed: `<think>` (7) + `\n` (1) + `\n` (1) + `</think>` (8) + `\n` (1) +
-`\n` (1) = 19. Any deviation (extra space, missing newline) moves the model off
-its training distribution.
-### 3.3 Think-Block Scope Rules
-| Turn type | Think-block treatment |
-|---|---|
-| Historical assistant turn (non-last, no tool_calls) | **Strip entirely** — `split('</think>')[-1].lstrip('\n')` |
-| Historical assistant turn (has tool_calls) | **Preserve** — think block is part of the tool-call format |
-| Last assistant turn in history (`add_generation_prompt=False`) | **Preserve verbatim** |
-| Last assistant turn, no existing think, `enable_thinking=False` | **Inject** `<think>\n\n</think>\n\n` prefix |
-| Generation prompt, `enable_thinking=True` | **No prefix** — model generates its own `<think>` |
-| Generation prompt, `enable_thinking=False` | **Inject** `<think>\n\n</think>\n\n` prefix |
----
-## 4. Implementation Premises
-### 4.1 Single Namespace Object
-All mutable template state lives in one `ns` namespace object, avoiding
-Jinja2's scoping trap (variables set inside `{% for %}` blocks are not visible
-outside without a namespace):
-```jinja2
-{%- set ns = namespace(
-    enable_thinking=true,
-    image_count=0,
-    video_count=0
-) -%}
-```
-### 4.2 Pre-Scan Before Rendering
-The template performs a full pre-scan of all messages before emitting any output.
-This is necessary because `/no_think` or `/think` can appear in any user message,
-and the final flag determines the generation prompt behaviour. A single-pass loop
-that both renders and tracks flags would have to look ahead, which Jinja2 cannot
-do.
-```jinja2
-{%- for i in range(messages | length) -%}
-  {%- if messages[i].role == 'user' -%}
-    {%- set _u = messages[i].content if messages[i].content is string else '' -%}
-    {%- if _u.rstrip().endswith('/no_think') -%}
-      {%- set ns.enable_thinking = false -%}
-    {%- elif _u.rstrip().endswith('/think') -%}
-      {%- set ns.enable_thinking = true -%}
-    {%- endif -%}
-  {%- endif -%}
-{%- endfor -%}
-```
-### 4.3 Separate `{{ }}` Blocks for `tojson` Output
-Jinja2's `tojson` filter returns a `Markup` object (already HTML-safe). When a
-`Markup` value is Python-concatenated with a plain string using `+`, Jinja2
-auto-escapes the plain string and produces double-encoded output (`&quot;`,
-`&#34;`, etc.). This is BUG-003.
-The fix is to never concatenate `tojson` output with plain strings inside a
-Jinja2 expression. Each fragment is emitted through its own `{{ }}` block:
-```jinja2
-{# WRONG — triggers HTML-escaping of the plain string #}
-{{- '{"name": ' + tc.function.name | tojson + '}' -}}
-{# CORRECT — separate blocks, no Python concatenation #}
-{{- '{"name": ' -}}{{- tc.function.name | tojson -}}{{- '}' -}}
-```
-### 4.4 System Message Collection Phase
-Multiple system messages are merged into a single `<|im_start|>system` turn
-with `\n\n` as separator (BUG-004 fix). This is done as a separate pre-pass
-(Section 4 in the template), so the main loop can unconditionally skip all
-`role == 'system'` messages.
-The user's system content always appears **before** the tools block in the
-system turn, matching the training format.
-### 4.5 Tool Normalisation
-Some frameworks pass tool definitions with a top-level `function` key
-(`{"type": "function", "function": {...}}`), while others pass the function
-schema directly (`{"name": ..., "parameters": ...}`). The template normalises
-all entries to the canonical form before serialisation:
-```jinja2
-{%- if tool.function is defined -%}
-  {%- set ns_tb.list = ns_tb.list + [tool] -%}
-{%- else -%}
-  {%- set ns_tb.list = ns_tb.list + [{"type": "function", "function": tool}] -%}
-{%- endif -%}
-```
----
-## 5. `enable_thinking` Behavior
-### 5.1 Resolution Priority (Highest to Lowest)
-1. **`/no_think` or `/think` text suffix** in the last user message that contains
-   one. This is the highest priority because it represents the most recent
-   explicit user intent and mirrors the model's fine-tuning data.
-2. **`enable_thinking` template variable** passed at render time (e.g., via
-   `tokenizer.apply_chat_template(..., enable_thinking=False)`).
-3. **Default value** of `true` (thinking on by default, consistent with the model's
-   training distribution).
-### 5.2 Generation Prompt Behaviour
-When `add_generation_prompt=True`:
-```
-enable_thinking=True  →  <|im_start|>assistant\n
-                         (model generates <think> itself)
-enable_thinking=False →  <|im_start|>assistant\n<think>\n\n</think>\n\n
-                         (forces non-thinking mode by pre-filling empty block)
-```
-### 5.3 Last-History-Turn Behaviour (add_generation_prompt=False)
-When the conversation ends with an assistant message and no generation prompt
-is requested — typical when scoring a complete conversation or when the
-assistant message is being appended to the prompt for continuation:
-- **Think block present:** preserved verbatim regardless of `enable_thinking`.
-- **No think block, `enable_thinking=True`:** content left as-is (historical turns
-  are already stripped; the last one is the current generation context).
-- **No think block, `enable_thinking=False`:** inject `<think>\n\n</think>\n\n`
-  before the content.
-### 5.4 Historical Think-Block Stripping (BUG-001)
-The official template collapses think blocks in historical turns to
-`<think>\n\n</think>` instead of removing them. In a long agentic loop this
-produces an ever-growing sequence of empty think blocks that degrades prompt
-quality ("prompt poisoning").
-The correct operation is full removal:
-```python
-# Python equivalent
-content = content.split('</think>')[-1].lstrip('\n') if '</think>' in content else content
-```
-```jinja2
-{# Jinja2 equivalent #}
-{%- if '</think>' in _ac -%}
-  {%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
-{%- endif -%}
-```
-**Exception:** turns that also carry `tool_calls` keep their think block intact.
-The model is trained to produce thinking before tool invocations, and stripping
-the think block from a historical tool-call turn would misrepresent the prompt.
----
-## 6. Tool Call Rendering
-### 6.1 System Turn Tool Block Format
-The exact text injected into the system message when tools are present matches
-the Qwen3 Hermes training format:
-```
-# Tools
-You may call one or more functions to assist with the user query.
-You are provided with function signatures within <tools></tools> XML tags:
-<tools>
-{"type": "function", "function": {"name": "...", ...}}
-</tools>
-For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
-<tool_call>
-{"name": <function-name>, "arguments": <args-json-object>}
-</tool_call>
-```
-All text — including the instruction sentences — is literal and must not be
-modified. The model was trained on this exact phrasing.
-### 6.2 Assistant Tool-Call Block
-Each tool call is rendered as:
-```
-<tool_call>
-{"name": "function_name", "arguments": {JSON_OBJECT}}
-</tool_call>
-```
-Multiple parallel calls appear as consecutive blocks separated by `\n`:
-```
-<tool_call>
-{"name": "f1", "arguments": {...}}
-</tool_call>
-<tool_call>
-{"name": "f2", "arguments": {...}}
-</tool_call><|im_end|>
-```
-Note: the final `</tool_call>` is immediately followed by `<|im_end|>` with no
-intervening newline. This matches the training format.
-### 6.3 Arguments: String vs Object (BUG-006)
-Some frameworks (notably older OpenAI-compatible clients and some streaming
-implementations) serialise tool-call arguments as a JSON string
-(`"{\"location\": \"Berlin\"}"`) rather than as an object
-(`{"location": "Berlin"}`). The template handles both:
-```jinja2
-{%- if tc.function.arguments is string -%}
-  {{- ', "arguments": ' + tc.function.arguments -}}
-{%- else -%}
-  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
-{%- endif -%}
-```
-When arguments are already a string they are passed through as-is (the caller
-is responsible for valid JSON). When they are a dict/object, `tojson` serialises
-them correctly including Unicode escaping and quote escaping.
-This arrangement also prevents the `"""` crash (BUG-006): Python triple-quoted
-strings inside Jinja2 template strings would crash the Jinja2 parser if the
-arguments dict happened to contain a value like `"""`. By using `tojson`
-(which produces a proper JSON string literal) the crash cannot occur.
-### 6.4 Tool Results
-Tool results are wrapped in a user turn using `<tool_response>`:
-```
-<|im_start|>user
-<tool_response>
-{result_content}
-</tool_response><|im_end|>
-```
-Consecutive tool-response messages are merged into a single user turn — the
-template checks whether the previous message's role was also `tool` and
-suppresses the `<|im_start|>user\n` header if so.
----
-## 7. Bug Analysis and Fixes
-### BUG-001 — Historical Think Blocks Leaked (CRITICAL)
-**Symptom:** In multi-turn conversations with `enable_thinking=True`, every
-historical assistant message retains a collapsed `<think>\n\n</think>` block.
-Over many turns the prompt accumulates dozens of empty think blocks, degrading
-model performance.
-**Root cause:** Official template strips think content but leaves the surrounding
-`<think>` tags.
-**Fix:** Strip the entire block by splitting on `</think>` and taking the tail:
-```jinja2
-{%- set _ac = _ac.split('</think>')[-1].lstrip('\n') -%}
-```
-**Tests:** T10, T13, T16
----
-### BUG-002 — KeyError on content=None / Missing content Key (HIGH)
-**Symptom:** When an assistant message contains only `tool_calls` and no `content`
-(or `content=None`, which is the OpenAI convention for pure tool-call responses),
-the template throws `UndefinedError` or `KeyError`.
-**Root cause:** Official template accesses `message.content` directly.
-**Fix:** Guard the access:
-```jinja2
-{%- if message.content is defined and message.content is string -%}
-  {%- set _ac = message.content -%}
-{%- elif message.content is defined and message.content is iterable ... -%}
-  {%- set _ac = render_content(message.content) -%}
-{%- else -%}
-  {%- set _ac = '' -%}
-{%- endif -%}
-```
-**Tests:** T04, T11
----
-### BUG-003 — Markup HTML-Escaping in Tool JSON (MEDIUM)
-**Symptom:** Tool definitions or tool-call arguments with characters like `<`, `>`,
-`&`, or `"` appear HTML-escaped in the rendered prompt (`&lt;`, `&gt;`, `&amp;`,
-`&#34;`). This causes the model to misread the tool schema.
-**Root cause:** `tojson` returns a Jinja2 `Markup` object. When `Markup` is
-concatenated with a plain Python string using `+` inside a Jinja2 expression,
-the plain string is auto-escaped and then concatenated with the already-safe
-`Markup` value.
-**Fix:** Never use `+` to join `tojson` output with plain strings. Emit each
-fragment through a separate `{{ }}` block:
-```jinja2
-{# Every fragment in its own block #}
-{{- '{"name": ' -}}{{- tc.function.name | tojson -}}
-```
-**Tests:** T03, T04, T12
----
-### BUG-004 — Multiple System Messages Not Handled (MEDIUM)
-**Symptom:** Frameworks such as Open WebUI send more than one `role: system`
-message. The official template either crashes or emits multiple system turns,
-both of which confuse the model.
-**Root cause:** No merging logic for multiple system messages.
-**Fix:** Pre-scan all messages and concatenate system content with `\n\n`:
-```jinja2
-{%- if ns_sys.content == '' -%}
-  {%- set ns_sys.content = _c -%}
-{%- else -%}
-  {%- set ns_sys.content = ns_sys.content + '\n\n' + _c -%}
-{%- endif -%}
-```
-**Tests:** T02, T14
----
-### BUG-005 — Wrong Non-Thinking Prefill Whitespace (LOW-MEDIUM)
-**Symptom:** Non-thinking mode produces a think block with incorrect whitespace,
-moving the model off its training distribution and causing output quality
-degradation or refusal to honour the non-thinking instruction.
-**Root cause:** The official template uses `<think>\n</think>\n\n` (missing the
-second newline inside the block), which does not match the format described in
-the technical report.
-**Fix:** Use the exact 19-character sequence:
-```
-<think>\n\n</think>\n\n
-```
-**Tests:** T08, T17
----
-### BUG-006 — Triple-Quote Crash on Python String Arguments (MEDIUM)
-**Symptom:** Jinja2 raises a `TemplateSyntaxError` or produces garbled output when
-tool-call arguments contain triple-quote sequences (`"""` or `'''`) because the
-template previously embedded argument values using Python string literal
-concatenation.
-**Root cause:** Some community templates build the tool-call JSON via string
-interpolation (`'{"arguments": "' + args + '"}'`), which breaks for argument
-values containing quote characters.
-**Fix:** Use `tojson` for all non-string arguments (produces well-formed JSON) and
-pass string arguments through unchanged (caller provides valid JSON strings):
-```jinja2
-{%- if tc.function.arguments is string -%}
-  {{- ', "arguments": ' + tc.function.arguments -}}
-{%- else -%}
-  {{- ', "arguments": ' -}}{{- tc.function.arguments | tojson -}}
-{%- endif -%}
-```
-**Tests:** T12
----
-## 8. Template Architecture
-The template is divided into eight clearly delimited sections, each with a
-comment header:
-```
-Section 1  render_content macro
-           Handles str / list (image/video/text) / None → plain text.
-           Increments ns.image_count / ns.video_count for vision tokens.
-Section 2  Namespace initialisation
-           Single ns object; enable_thinking defaults to true.
-Section 3  Pre-scan
-           Walk all user messages; last /no_think or /think wins.
-Section 4  Collect system content
-           Merge all system / developer messages with \n\n.
-Section 5  Build tools list
-           Normalise every tool to {"type":"function","function":{...}}.
-Section 6  Output system turn
-           Emit one <|im_start|>system turn (user content + tools block).
-Section 7  Main message loop
-           7a  system/developer  → skip (already emitted)
-           7b  user              → render with vision support
-           7c  assistant         → render with think-block logic + tool_calls
-           7d  tool              → group into user turns
-           7e  unknown role      → raise_exception
-Section 8  Generation prompt
-           enable_thinking=True  → bare <|im_start|>assistant\n
-           enable_thinking=False → add <think>\n\n</think>\n\n prefix
-```
-### Design Decisions
-**No default system prompt.** Unlike some community templates, this template does
-not inject a default system prompt when none is provided. The model performs well
-without one, and injecting one would cause conflicts for applications that rely on
-the system prompt being exactly what they set.
-**No BOS token.** The Qwen3 family was trained without a BOS token. Adding one
-would consume a context window slot unnecessarily and may harm performance.
-**No `<|endoftext|>` in conversation.** This token is reserved for
-end-of-document signalling in the pre-training phase, not for conversation
-boundaries.
----
-## 9. Test Coverage
-The 17 test fixtures in `validate_template.py` cover:
-| ID | Scenario | Key assertion |
-|---|---|---|
-| T01 | Simple user/assistant, no system, no tools | Exact ChatML output |
-| T02 | System message | System turn before user turn |
-| T03 | Tools defined, `enable_thinking=True` | Tools block in system; no prefill |
-| T04 | Tool call, `content=None` | No crash; `<tool_call>` present |
-| T05 | Parallel tool calls | `</tool_call>\n<tool_call>` separator |
-| T06 | Tool result (role=tool) | `<|im_start|>user\n<tool_response>` |
-| T07 | `enable_thinking=True` generation prompt | No think prefix emitted |
-| T08 | `enable_thinking=False` generation prompt | Exact 19-char prefill |
-| T09 | `/no_think` flag in user message | Non-thinking prefill applied |
-| T10 | Historical think blocks | Fully stripped, not collapsed |
-| T11 | Missing `content` key on assistant | No KeyError / UndefinedError |
-| T12 | Special chars in arguments | Correctly JSON-escaped |
-| T13 | Historical tool-call turn with think | Think block preserved |
-| T14 | Multiple system messages | Merged with `\n\n`; single system turn |
-| T15 | Parallel tool responses | Both inside single user turn |
-| T16 | Last history turn with existing think | Preserved verbatim |
-| T17 | Last history turn, no think, `enable_thinking=False` | Prefill injected |
-Run the suite:
-```bash
-cd /workspace/project/qwen3_5-template
-python validate_template.py
-# Expected: 17 passed, 0 failed
-```
----
-## 10. Tool Ecosystem Compatibility
-An analysis of 51 tool-calling frameworks and inference backends was conducted to
-verify that the template's output is consumable by the broadest possible set of
-tools. Key findings:
-### 10.1 OpenAI JSON Format Dominance
-31 of the 51 analysed tools use the **OpenAI-compatible JSON function-call API**
-(Group A). These tools pass tool definitions as a `tools` array and receive tool
-calls back as `message.tool_calls` objects. The template's input format is fully
-compatible with this convention.
-Notable Group A members: OpenHands, LangChain, LangGraph, LiteLLM, CrewAI,
-Pydantic AI, Open WebUI, LibreChat, LM Studio, LlamaIndex, AutoGen, LiteLLM.
-### 10.2 Inference Server Compatibility
-| Backend | Compatibility note |
-|---|---|
-| **vLLM** | Uses the `hermes` tool parser for Qwen models, matching this template's `<tool_call>` format exactly. |
-| **llama.cpp** | Recognises `<tool_call>` via the `--jinja` flag + chat template loading. Note: `--jinja` disables GBNF grammar (Issue #12204). |
-| **Ollama** | Auto-detects the tool-call tag via `parseTag()` which reads the first text node after `.ToolCalls` in the Go template tree — `<tool_call>` is one of the three known tags. |
-| **LM Studio** | Passes tool definitions as the `tools` API field; receives tool calls in `message.tool_calls`. |
-| **TabbyAPI** | Full OpenAI-compatible API; correct chat template is the only requirement. |
-### 10.3 Non-Native Tool-Calling Frameworks
-Three framework groups (Cline/Roo Code XML, OpenCode `<parameter>`, Aider
-SEARCH/REPLACE) do not use the OpenAI tool-calling API at all. They inject their
-own tool descriptions into the system prompt and parse the model's text output
-directly. These frameworks do not interact with the chat template's tool-calling
-sections — they send no `tools` array and the template therefore emits no tool
-block.
-### 10.4 Arguments as JSON String
-Several frameworks (notably some streaming clients and older OpenAI SDK versions)
-serialise `tool_calls[].function.arguments` as a JSON string rather than a parsed
-object. The template's dual-path arguments handling (Section 6.3) accommodates
-both cases transparently.
----
-*Generated as part of the `fix/qwen3-template-bugs` implementation.*