froggeric's picture
Upload README-v8.md
7efa9ee verified
metadata
license: apache-2.0
tags:
  - jinja
  - chat-template
  - qwen
  - qwen3.5
  - qwen3.6
  - lm-studio
  - mlx
  - llama.cpp
  - tool-calling
  - thinking

Fixed jinja chat templates for Qwen 3.5 & 3.6

2026-05-07 Update: Fixed 8th bug: Mid-conversation system messages no longer crash the template. Compatibility restored for agent frameworks (OpenCode, Docker Agent, oh-my-pi). Re-engineered Jinja string parsing for C++ engine stability.

These are drop-in Jinja templates that fix rendering errors, token waste, and missing features in the official Qwen chat templates.

They are tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.


Why you need this

The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.

Here are the 8 bugs this template fixes:

Problem Impact Fix
1. Tool calls fail on C++ engines The |items filter doesn't exist in minijinja (LM Studio, llama.cpp, MLX). Tool calls instantly crash the template. Rewritten for strict C++ engine compatibility.
2. Mid-conversation system crash Frameworks injecting mid-conversation steering instructions trigger a hard crash. Native, chronological rendering for system messages anywhere.
3. developer role rejected Modern APIs send the developer role; the official template rejects it. Added full support for "developer".
4. Empty thinking blocks spam Every past turn gets wrapped in empty <think></think> tags, wasting context and breaking caching. Dynamic length checks and history visibility logic.
5. No way to toggle thinking The user is restricted to the model defaults. Intercepts <|think_off|> and <|think_on|> tags natively.
6. Qwen 3.6 </thinking> hallucination Model sometimes generates </thinking> instead of </think>, permanently breaking the parser. Advanced tag detection and stream recovery.
7. No-user-query crash raise_exception crashes agentic loops, system-only contexts, or /reset flows. Graceful fallback scanning mechanism.
8. Unclosed thinking before tool call Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers. Auto-injects closing tags before tool boundaries.

Quick install

Choose your environment and update the template:

LM Studio

  1. Open your Qwen model in the right-side panel.
  2. Scroll down to Prompt Template.
  3. Replace the template with the contents of qwen3.5/chat_template.jinja or qwen3.6/chat_template.jinja.
  4. Click Save.

llama.cpp / koboldcpp

--jinja --chat-template-file qwen3.6/chat_template.jinja

vLLM / TextGen

Replace the "chat_template" string in your tokenizer_config.json with the raw file contents.

oMLX

Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.


Which file do I use?

Template File Supported Models
qwen3.5/chat_template.jinja Qwen3.5-35B-A3B, Qwen3.5-32B, Qwen3.5-14B, and all Qwen 3.5 variants.
qwen3.6/chat_template.jinja Qwen3.6-27B, Qwen3.6-35B-A3B, and all Qwen 3.6 variants.

Note: The 3.6 template is a superset. It additionally handles preserve_thinking, </thinking> hallucination recovery, and interrupted thought streams. If you are on 3.6, always use the 3.6 file.


The thinking toggle

You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.

The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.

Fast answer, no reasoning:

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Deep reasoning:

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)


Pre-installed models

If you are using one of the following models, you already have an older version of this template installed.


Technical Details of the 8 Fixes

1. Tool calls on C++ engines

The official template iterates tool call arguments with |items: {%- for key, value in tool_call.arguments|items %}

Python's Jinja supports |items. C++ runtimes (LM Studio, llama.cpp, MLX) do not, which produces a rendering error. This template uses direct dictionary key lookups instead. It also replaces is sequence with is iterable, removes Python-only |safe wrappers, and handles arguments returned as raw strings.

2. Mid-conversation system messages crash

The official template hard-crashes if a system or developer message appears anywhere except the first position. This breaks agentic frameworks (Codex CLI, Docker Agent, oh-my-pi, OpenCode) that inject steering instructions mid-conversation. The fix natively renders these messages chronologically to preserve LLM recency bias while enforcing strict image-blocking checks.

3. developer role

The OpenAI-compatible API spec sends message.role == "developer" for system-level instructions. The official Qwen template throws an exception. Both templates here accept "developer" and map it properly.

4. Empty thinking blocks

The official template wraps every past assistant turn in thinking tags, even when empty. When there is no reasoning content, those tags waste context tokens and break prefix caching. The 3.5 template checks reasoning_content before emitting. The 3.6 template checks reasoning_content|trim|length > 0 and ties history visibility to the <|think_off|> override.

5. </thinking> hallucination (Qwen 3.6 only)

The Qwen 3.6 model sometimes generates </thinking> instead of the expected </think>. The official parser splits on </think > only and fails. The 3.6 template detects which closing tag was actually used and splits dynamically. It also handles interrupted generation by rescuing incomplete streams.

6. Arguments serialization

The official template serializes argument values with |tojson unconditionally, failing when the value is already a string. The fixed templates check the type first. Strings pass through as-is, and everything else goes through |tojson.

7. Auto-close unclosed thinking before tool calls

The model sometimes starts a thinking block and immediately calls a tool without emitting the closing tag. The official template lets the unclosed thinking tag bleed into the tool call. The fixed templates detect this pattern and safely auto-inject the closing tag using standard Jinja split operations to guarantee 100% C++ compatibility.

8. No-user-query exception

The official template scans the message list in reverse. If all messages are tool results, or there are no user messages, it fires raise_exception('No user query found...') and hard-crashes. The fix replaces the exception with a graceful fallback {%- set ns.last_query_index = messages|length - 1 %}, enabling agentic tool-calling chains to function perfectly.

Comparison: Qwen 3.5 templates
Feature Official LuffyTheFox mod-ellary Pneuny This
Tool arguments Fails Fixed Missing Fixed Fixed
|safe removed Fails Fixed Missing Fixed Fixed
developer role Missing Missing Missing Missing Added
Thinking toggle None None /think (system only) None <|think_off|> anywhere
Empty think in history Broken Broken Tags omitted Broken Fixed
Mid-conversation system Crashes Crashes Crashes Crashes Fixed
Clean instructions Yes Yes Yes Injects text Yes
No-user-query crash Crashes Crashes Crashes Crashes Graceful fallback
Auto-close thinking Not handled Not handled Not handled Not handled Auto-injects close tag
Comparison: Qwen 3.6 template
Feature Official This
Tool arguments Fails (|items) Fixed
|safe removed Fails Fixed
developer role Missing Added
Thinking toggle None <|think_off|> anywhere
preserve_thinking Spams empty blocks Dynamic length checks
Mid-conversation system Crashes Fixed
</thinking> hallucination Fails Detected and handled
Interrupted streams Broken tags Rescued
Auto-close thinking before tool Not handled Auto-injects close tag
No-user-query crash Crashes Graceful fallback

Authorship

Role Author
Original models Alibaba Cloud (Qwen team)
Template fixes froggeric

License

Apache-2.0, inherited from Qwen.