Improvement (50% token reduction via the tool)

#8
by Elsephire - opened

Hello, i’m using a modified version of this template. This more compact tool is faster and provides a less noisy context for the LLM. I think it’s a powerful improvement over your original template: https://gist.github.com/webel/3c2cef9671119d71fc902d0c301db4eb

Elsephire changed discussion title from improvement (50% token reduction with tool) to Improvement (50% token reduction via the tool)

Very cool! Thank you. I have incorporate in v10. Please try.

I tested v10, and it's looping and overthinking. I don't understand why. I've published my custom template (a fusion of the Unsloth template and a compact tool) if you'd like to check it out.
https://huggingface.co/Elsephire/Qwen3.6-template-jinja/blob/main/qwen3.6-unsloth-and-compact-tools.jinja

Still in the process of validating, but a quick AI check gave me this:


v9 vs v10 Comparison: Key Differences

Tool Rendering (The "Compaction" Change)

Version Tool Format Token Usage (8 tools)
v9 (line 68-69) `{{- tool tojson }}` — full JSON schema dump
v10 (lines 68-107) Compact one-liners: `remember(text: string, room?: general prefs)` + optional schema dump only for array/object types

v10 "Compaction" Changes

  1. Lines 68-107: Replaced {{- tool | tojson }} with manual property iteration, rendering typed one-liners
  2. Lines 100-107: Schema dump only for array|object types (not always)
  3. Line 93: Added has_tools flag to ns_flags namespace

Potential Issues in v10 Templates

Issue 1: Forced Thinking on Post-Query Tool Rounds (High Severity)

Location: chat_template-v10.jinja:157 (same as v9 line 157)

{%- if loop.index0 > ns.last_query_index %}
    {{- '🤖' + message.role + '\n</think>\n\n' + content }}
{%- else %}
    {{- '🤖' + message.role + '\n' + content }}
{%- endif %}

Problem: The two-pass algorithm (lines 93-106) finds the last user query index. Any assistant message after that index (i.e., intermediate tool-call responses in multi-step agentic flows) gets forced thinking injection. This means every tool-use round triggers unnecessary </think> blocks.

Effect: Overthinking during agentic loops, wasted tokens, increased latency per tool round.

Issue 2: Empty Thinking Block When Thinking Disabled (Medium Severity)

Location: chat_template-v10.jinja:214-215

{%- if ns_flags.enable_thinking is false %}
    {{- '</think>\n\n' }}
{%- else %}
    {{- '<think>\n' }}
{%- endif %}

Problem: When enable_thinking is explicitly false, the template outputs <think>\n\n</think>\n\n — an empty thinking block. This:

  1. Wastes 2 special tokens per generation start
  2. May conditionally encourage the model to always reason, even for non-reasoning tasks

v9 behavior: Identical (line 214-218). This is a pre-existing issue carried forward.

Issue 3: Schema Dump Still Present for Complex Types (Low Severity)

Location: chat_template-v10.jinja:100-107

{%- if fn.parameters is defined and fn.parameters is mapping %}
    {%- for pname in props %}
        {%- set pdef = props[pname] %}
        {%- if pdef.type is defined and (pdef.type == 'array' or pdef.type == 'object') %}
            {{- '\n  - ' ~ pname ~ ' schema: ' ~ pdef | tojson }}
        {%- endif %}
    {%- endfor %}
{%- endif %}

Problem: For tools with array or object type parameters, the full JSON schema is still dumped via tojson. This partially undermines the token savings goal. If all tool parameters are complex types, you get nearly the same token usage as v9.

Issue 4: has_tools Flag Not Used for Generation Prompt (Low Severity)

Location: chat_template-v10.jinja:216-217

The ns_flags.has_tools flag is set at line 65 but only used at line 216 to inject an additional reminder about function call format. This is redundant since the instruction block (lines 109-114) already contains the same reminder. It adds ~80 tokens of duplication.


Summary Table

Issue Severity v9 Has It? v10 Has It?
Forced thinking on post-query assistants High Yes (line 157) Yes (line 157) — unchanged from v9
Empty thinking block when disabled Medium Yes (line 214-218) Yes (line 214-215) — unchanged from v9
Schema dump for complex types N/A No (always full JSON) Yes (partial, only array/object) — new in v10
Redundant function call reminder Low No Yes (line 216-217) — new in v10

Key Finding: The forced-thinking issue and empty-thinking-block issue are pre-existing from v9, not introduced by v10's compaction changes. The v10 "compaction" itself is functionally correct and does not introduce new logic bugs — it only changes how tools are rendered in the prompt.

If you want to fix the forced-thinking issue, you would need to remove the loop.index0 > ns.last_query_index conditional (lines 157-161) and always use standard assistant message rendering without forced </think> injection.

Sign up or log in to comment