Upload README-v8.md

7efa9ee verified 4 days ago

10.7 kB

license: apache-2.0
tags:
  - jinja
  - chat-template
  - qwen
  - qwen3.5
  - qwen3.6
  - lm-studio
  - mlx
  - llama.cpp
  - tool-calling
  - thinking

Fixed jinja chat templates for Qwen 3.5 & 3.6

2026-05-07 Update: Fixed 8th bug: Mid-conversation system messages no longer crash the template. Compatibility restored for agent frameworks (OpenCode, Docker Agent, oh-my-pi). Re-engineered Jinja string parsing for C++ engine stability.

These are drop-in Jinja templates that fix rendering errors, token waste, and missing features in the official Qwen chat templates.

They are tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.

Why you need this

The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.

Here are the 8 bugs this template fixes:

Problem	Impact	Fix
1. Tool calls fail on C++ engines	The `\|items` filter doesn't exist in `minijinja` (LM Studio, llama.cpp, MLX). Tool calls instantly crash the template.	Rewritten for strict C++ engine compatibility.
2. Mid-conversation system crash	Frameworks injecting mid-conversation steering instructions trigger a hard crash.	Native, chronological rendering for system messages anywhere.
3. `developer` role rejected	Modern APIs send the developer role; the official template rejects it.	Added full support for `"developer"`.
4. Empty thinking blocks spam	Every past turn gets wrapped in empty `<think></think>` tags, wasting context and breaking caching.	Dynamic length checks and history visibility logic.
5. No way to toggle thinking	The user is restricted to the model defaults.	Intercepts `<\|think_off\|>` and `<\|think_on\|>` tags natively.
6. Qwen 3.6 `</thinking>` hallucination	Model sometimes generates `</thinking>` instead of `</think>`, permanently breaking the parser.	Advanced tag detection and stream recovery.
7. No-user-query crash	`raise_exception` crashes agentic loops, system-only contexts, or `/reset` flows.	Graceful fallback scanning mechanism.
8. Unclosed thinking before tool call	Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers.	Auto-injects closing tags before tool boundaries.

Quick install

Choose your environment and update the template:

LM Studio

Open your Qwen model in the right-side panel.
Scroll down to Prompt Template.
Replace the template with the contents of qwen3.5/chat_template.jinja or qwen3.6/chat_template.jinja.
Click Save.

llama.cpp / koboldcpp

--jinja --chat-template-file qwen3.6/chat_template.jinja

vLLM / TextGen

Replace the "chat_template" string in your tokenizer_config.json with the raw file contents.

oMLX

Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.

Which file do I use?

Template File	Supported Models
`qwen3.5/chat_template.jinja`	Qwen3.5-35B-A3B, Qwen3.5-32B, Qwen3.5-14B, and all Qwen 3.5 variants.
`qwen3.6/chat_template.jinja`	Qwen3.6-27B, Qwen3.6-35B-A3B, and all Qwen 3.6 variants.

Note: The 3.6 template is a superset. It additionally handles preserve_thinking, </thinking> hallucination recovery, and interrupted thought streams. If you are on 3.6, always use the 3.6 file.

The thinking toggle

You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.

The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.

Fast answer, no reasoning:

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Deep reasoning:

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)

Pre-installed models

If you are using one of the following models, you already have an older version of this template installed.

Technical Details of the 8 Fixes

1. Tool calls on C++ engines

The official template iterates tool call arguments with |items: {%- for key, value in tool_call.arguments|items %}

Python's Jinja supports |items. C++ runtimes (LM Studio, llama.cpp, MLX) do not, which produces a rendering error. This template uses direct dictionary key lookups instead. It also replaces is sequence with is iterable, removes Python-only |safe wrappers, and handles arguments returned as raw strings.

2. Mid-conversation system messages crash

The official template hard-crashes if a system or developer message appears anywhere except the first position. This breaks agentic frameworks (Codex CLI, Docker Agent, oh-my-pi, OpenCode) that inject steering instructions mid-conversation. The fix natively renders these messages chronologically to preserve LLM recency bias while enforcing strict image-blocking checks.

3. `developer` role

The OpenAI-compatible API spec sends message.role == "developer" for system-level instructions. The official Qwen template throws an exception. Both templates here accept "developer" and map it properly.

4. Empty thinking blocks

The official template wraps every past assistant turn in thinking tags, even when empty. When there is no reasoning content, those tags waste context tokens and break prefix caching. The 3.5 template checks reasoning_content before emitting. The 3.6 template checks reasoning_content|trim|length > 0 and ties history visibility to the <|think_off|> override.

5. `</thinking>` hallucination (Qwen 3.6 only)

The Qwen 3.6 model sometimes generates </thinking> instead of the expected </think>. The official parser splits on </think > only and fails. The 3.6 template detects which closing tag was actually used and splits dynamically. It also handles interrupted generation by rescuing incomplete streams.

6. Arguments serialization

The official template serializes argument values with |tojson unconditionally, failing when the value is already a string. The fixed templates check the type first. Strings pass through as-is, and everything else goes through |tojson.

7. Auto-close unclosed thinking before tool calls

The model sometimes starts a thinking block and immediately calls a tool without emitting the closing tag. The official template lets the unclosed thinking tag bleed into the tool call. The fixed templates detect this pattern and safely auto-inject the closing tag using standard Jinja split operations to guarantee 100% C++ compatibility.

8. No-user-query exception

The official template scans the message list in reverse. If all messages are tool results, or there are no user messages, it fires raise_exception('No user query found...') and hard-crashes. The fix replaces the exception with a graceful fallback {%- set ns.last_query_index = messages|length - 1 %}, enabling agentic tool-calling chains to function perfectly.

Comparison: Qwen 3.5 templates

Feature	Official	LuffyTheFox	mod-ellary	Pneuny	This
Tool arguments	Fails	Fixed	Missing	Fixed	Fixed
`\|safe` removed	Fails	Fixed	Missing	Fixed	Fixed
`developer` role	Missing	Missing	Missing	Missing	Added
Thinking toggle	None	None	`/think` (system only)	None	`<\|think_off\|>` anywhere
Empty think in history	Broken	Broken	Tags omitted	Broken	Fixed
Mid-conversation system	Crashes	Crashes	Crashes	Crashes	Fixed
Clean instructions	Yes	Yes	Yes	Injects text	Yes
No-user-query crash	Crashes	Crashes	Crashes	Crashes	Graceful fallback
Auto-close thinking	Not handled	Not handled	Not handled	Not handled	Auto-injects close tag

Comparison: Qwen 3.6 template

Feature	Official	This
Tool arguments	Fails (`\|items`)	Fixed
`\|safe` removed	Fails	Fixed
`developer` role	Missing	Added
Thinking toggle	None	`<\|think_off\|>` anywhere
`preserve_thinking`	Spams empty blocks	Dynamic length checks
Mid-conversation system	Crashes	Fixed
`</thinking>` hallucination	Fails	Detected and handled
Interrupted streams	Broken tags	Rescued
Auto-close thinking before tool	Not handled	Auto-injects close tag
No-user-query crash	Crashes	Graceful fallback

Authorship

Role	Author
Original models	Alibaba Cloud (Qwen team)
Template fixes	froggeric

License

Apache-2.0, inherited from Qwen.