Chat template is too complicated that even Gemma 4 itself has no idea how to parse it

#53
by alexcardo - opened

I'm unsuccessfully trying to attach the model to my working instruments such as OpenWebUI and OpenCode. People have noticed that your chat template is too complicated. Al the reports about double CoT and multiple useless tool callings... all these issues are because of this. Can you please clarify how to deal with your chat template as only you can realize this logic.

Even your (I have no doubts) smartest model Gemma 4 31B it can't realize how to deal with this template.

The model returns content. Based on the template, the model will likely include tokens like <|channel>thought\n<channel|> or <|think|>.
Currently, the backend does:
322: content = delta.get("content", "")
323: full_content += content
324: if content:
325: yield f"data: {json.dumps(chunk)}\n\n"
It just forwards the chunk from vLLM as is.
The frontend then receives this and does:
const renderMessageContent = (content: string) => {
const parts = content.split(/(.*?</thought>)/s);
// ...
But the model is using <|channel>thought\n<channel|> (or similar) according to the template.
To separate reasoning from the answer:

  1. Backend Transformation: The backend should detect the thinking delimiters from vLLM and wrap them in and tags before sending them to the frontend. This way the frontend's existing renderMessageContent will work, or we can update the frontend to handle the raw tokens.
  2. Frontend Update: If we don't want to change the backend, we must update the frontend's renderMessageContent to recognize the model's actual thinking tokens.
    Given the template:
  • The prompt adds <|channel>thought\n<channel|> if enable_thinking is false?
    Wait, let me re-read the template:
    {%- if add_generation_prompt -%}
    {%- if ns.prev_message_type != 'tool_response' -%}
    {{- '<|turn>model\n' -}}
    {%- endif -%}
    {%- if not enable_thinking | default(false) -%}
    {{- '<|channel>thought\n<channel|>' -}}
    {%- endif -%}
    {%- endif -%}
    This is confusing. If enable_thinking is FALSE, it adds a "thought" channel? Usually, it's the other way around.
    Let's look at strip_thinking:
    {%- macro strip_thinking(text) -%}
    {%- set ns = namespace(result='') -%}
    {%- for part in text.split('<channel|>') -%}
    {%- if '<|channel>' in part -%}
    {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
    {%- else -%}
    {%- set ns.result = ns.result + part -%}
    {%- endif -%}
    {%- endfor -%}
    {{- ns.result | trim -}}
    {%- endmacro -%}
    It splits by <channel|> and then if <|channel> is in the part, it takes the part before <|channel>.
    This looks like a very specific tokenization scheme.
    If I want to "separate the reasoning mode from the response mode", I should probably make the backend wrap the thinking part in something the frontend can easily identify, like ....
    The model output will probably look like:
    <|channel>thought\nSome reasoning here...<|channel>Actual answer here
    So the backend should:
  1. Keep track of whether it is currently in "thinking mode".
  2. When it sees the start of a thinking block, emit .
  3. When it sees the end of a thinking block, emit .
    Wait, if the model is using vLLM, vLLM might be stripping these tokens or providing them as part of the content.
    If enable_thinking is true, the system turn starts with <|think|>.
    Let's look at the template again.
    strip_thinking is used when rendering model messages into the prompt for the next turn. It's NOT used for the final output to the user. The final output is the raw content from the model.
    So the model will output <|channel>thought\n...<|channel>....
    Plan:
  4. Update the backend (main.py) to parse the stream for thinking delimiters and replace them with and .
  5. Specifically, we need to identify the start and end tokens for thinking. Based on the template, it seems to be <|channel>thought\n<channel|> or similar. Actually, the template uses <channel|> as a separator and <|channel> as a marker.
    Wait, let's look at:
    {%- for part in text.split('<channel|>') -%}
    {%- if '<|channel>' in part -%}
    {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
    This means if the text is A <channel|> B <|channel> C <channel|> D, it splits into ['A ', ' B <|channel> C ', ' D'].
    For the second part, it splits by <|channel> and takes B .
    So the result is A B D. It removes the part starting from <|channel> until the next <channel|>.
    So thinking is contained between <|channel> and <channel|>.
    Wait, the generation prompt says:
    {{- '<|channel>thought\n<channel|>' -}}
    This looks like it's forcing the model to start with a thought.
    So the thinking block is: <|channel>thought\n ... <channel|> (or something similar).
    Let's propose a plan:
  6. Modify backend/main.py to implement a stateful stream processor.
  7. When the stream contains <|channel>thought, start a block.
  8. When the stream contains <channel|>, end the block.
  9. This way the frontend (which already handles ) will render it correctly.
    I should also verify what the actual tokens are. Since I can't run the model, I have to rely on the template.
    The template uses:
  • <|channel>thought\n<channel|> in the generation prompt.
  • split('<channel|>') and split('<|channel>') in strip_thinking.
    This strongly suggests that thinking blocks are delimited by <|channel> and <channel|>.
    Wait, if the generation prompt is <|channel>thought\n<channel|>, then the model's response starts after that.
    The model might then produce more content, and then maybe another <|channel>...<channel|> block?
    No, usually reasoning comes first.
    If the prompt ends with <|channel>thought\n<channel|>, the model starts responding.
    If the model's response is Reasoning... <|channel>Answer, then the reasoning is what follows the prompt.
    But wait, the strip_thinking macro removes the part starting with <|channel>.
    {%- if '<|channel>' in part -%}
    {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
    This means if the content is Hello <|channel> thought stuff <channel|> World, the result is Hello World.
    So the thinking is <|channel> ... <channel|>.
    So:
  • Start token: <|channel>
  • End token: <channel|>
    But wait, the prompt is <|channel>thought\n<channel|>.
    If the prompt already contains the end token <channel|>, then the model's response is the answer?
    No, the prompt is what's sent to the model. The model's response is what we get back.
    If the prompt is: ... <|turn>model\n <|channel>thought\n<channel|>
    Then the model will start generating.
    If the model then generates Thinking... <|channel>Answer, then the "Thinking..." part is outside of a <|channel> block?
    Let's re-read strip_thinking very carefully.
    {%- macro strip_thinking(text) -%}
    {%- set ns = namespace(result='') -%}
    {%- for part in text.split('<channel|>') -%}
    {%- if '<|channel>' in part -%}
    {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
    {%- else -%}
    {%- set ns.result = ns.result + part -%}
    {%- endif -%}
    {%- endfor -%}
    {{- ns.result | trim -}}
    {%- endmacro -%}
    If text = Thought 1 <|channel> Thought 2 <channel|> Answer 1 <|channel> Thought 3 <channel|> Answer 2
  1. split('<channel|>') -> ['Thought 1 <|channel> Thought 2 ', ' Answer 1 <|channel> Thought 3 ', ' Answer 2']
  2. Part 1: contains <|channel>, so result += 'Thought 1 '
  3. Part 2: contains <|channel>, so result += ' Answer 1 '
  4. Part 3: no <|channel>, so result += ' Answer 2'
    Result: Thought 1 Answer 1 Answer 2.
    So it removes everything from <|channel> to <channel|>.
    This means the answer is what's not in a <|channel>...<channel|> block.
    The thinking is what's inside the <|channel>...<channel|> block.
    But wait, the prompt is <|channel>thought\n<channel|>.
    If the prompt ends with <channel|>, then the model starts generating.
    If the model generates Thinking... <|channel>Answer, then Thinking... is outside the block.
    This is contradictory. Let's look at the prompt again.
    {%- if not enable_thinking | default(false) -%}
    {{- '<|channel>thought\n<channel|>' -}}
    {%- endif -%}
    If enable_thinking is FALSE, it adds this. Why would it add a thought channel if thinking is disabled?
    Maybe enable_thinking in the template means "don't force the thought channel because it's already handled by the model's internal thinking process"?
    Anyway, the goal is to "separate reasoning mode from response mode".
    The most robust way is to find the delimiters.
    If the model outputs:
    <|channel>thought\nI should start by...<channel|>The answer is 42.
    Then we want:
    I should start by...The answer is 42.
    So the delimiters are <|channel> and <channel|>.
    Wait, the prompt: <|channel>thought\n<channel|>
    If the model sees this, it might continue from the <channel|> token.
    Actually, look at strip_thinking again.
    It splits by <channel|>.
    Then it looks for <|channel> in the part.
    If it finds <|channel>, it takes the part before it.
    Example: Ans1 <|channel> Thought1 <channel|> Ans2 <|channel> Thought2 <channel|> Ans3
    Parts: ['Ans1 <|channel> Thought1 ', ' Ans2 <|channel> Thought2 ', ' Ans3']
  5. Ans1
  6. Ans2
  7. Ans3
    Result: Ans1 Ans2 Ans3.
    Correct. Thinking is between <|channel> and <channel|>.
    So the plan is:
  8. Backend: In main.py, when streaming from vLLM, maintain a buffer.
  9. Check for <|channel> and <channel|>.
  10. When <|channel> is encountered, wrap in .
  11. When <channel|> is encountered, wrap in .
    Wait, I should check if there's any other token like <|think|> mentioned in the template.
    Yes: {%- if enable_thinking is defined and enable_thinking -%}{{- '<|think|>' -}}{%- endif -%}
    This is at the very top of the system turn.
    But for the model's response, the strip_thinking macro is the key.
    Let's refine the backend implementation.
    We can't just do a simple replace because the tokens might be split across chunks.
    We need a small buffer.
    Plan for Backend:
  12. Modify run_vllm_loop in main.py.
  13. Introduce a thinking_active boolean.
  14. Process the content stream to detect <|channel> and <channel|>.
  15. Emit and accordingly.
    Wait, I should also consider if the model outputs <|channel>thought\n and the user wants to see "thought" as the header.
    The template says <|channel>thought\n<channel|>.
    If the model outputs <|channel>thought\n...<channel|>, we can just replace <|channel>thought\n with and <channel|> with .
    One more thing: The frontend's renderMessageContent uses:
    const parts = content.split(/(.*?</thought>)/s);
    This is a regex that captures the tags.
    It then checks if (part.startsWith('') && part.endsWith('')).
    This means the frontend expects the tags to be exactly and .
    Refined Plan:
  16. Modify backend/main.py:
    • In run_vllm_loop, replace <|channel> with (or handle the thought\n part).
    • Replace <channel|> with .
    • Use a buffer to handle tokens split across chunks.
      Let's double check the tokens again.
      Template: {{- '<|channel>thought\n<channel|>' -}}
      strip_thinking uses split('<channel|>') and split('<|channel>').
      So the start token is <|channel> and the end token is <channel|>.
      I will implement this in the backend.

VLLM has a commit merged already which is supposed to fix some of the issues.

https://github.com/vllm-project/vllm/pull/39027

Sign up or log in to comment