chat_template.jinja should honor OpenAI-standard `reasoning_effort` field

#12

by keyang-lite - opened 11 days ago

Hi NVIDIA team 👋

The chat template that ships with this repo (chat_template.jinja) exposes two reasoning-control knobs via chat_template_kwargs:

low_effort (bool) — appends {reasoning effort: low} to the user message
enable_thinking (bool) — emits empty <think></think> in the generation prompt

Both work well. The problem: the template does not read the OpenAI-standard reasoning_effort field (values "none" | "low" | "medium" | "high"), which is the standard way clients ask reasoning models to reduce thinking. vLLM (and any other server that speaks the OpenAI chat API) forwards reasoning_effort into the Jinja environment automatically, but since this template never references the variable, it falls through and the model does full thinking every time.

Net effect for any OpenAI-API client: reasoning_effort: "low" is silently a no-op, and reasoning_effort: "none" produces deceptive output (the model still generates a full CoT; vLLM hides it from the response body, so the client sees reasoning: null and thinks it was cheap — but pays full thinking cost). Full receipts below.

Reproducer

Serving this model in vllm/vllm-openai:v0.19.0 with --reasoning-parser nemotron_v3 --trust-remote-code, I sent the prompt "What is the capital city of France?" with max_tokens=1024, temperature=0 in five modes, 3 runs each:

Mode	reasoning_chars	content_chars	completion_tokens	wall_ms
1. no flag	1146	368	372	3084
2. `reasoning_effort="low"`	985	350	310	2576
3. `reasoning_effort="none"`	0	373	315	2620
4. `kwargs.low_effort=true`	45	37	22	217
5. `kwargs.enable_thinking=false`	0	36	9	108

Row 2 is within run-to-run noise of Row 1. reasoning_effort: "low" does not meaningfully shorten generation.
Row 3 shows reasoning: null but completion_tokens and wall time match full-thinking Row 1 — the model generated a full CoT that was hidden. Compare to Row 5 where reasoning: null correctly correlates with 9 tokens and 108 ms.
Rows 4 and 5 (the existing chat_template_kwargs knobs) are the only modes that actually reduce generation, and they reduce it by 17× / 41× respectively. These correspond to the Jinja branches at chat_template.jinja:180 (appends {reasoning effort: low} to the user message when low_effort) and chat_template.jinja:204-208 (emits empty <think></think> in the generation prompt when not enable_thinking).

Proposed fix

Three-line Jinja addition, right after the existing variable definitions at the top of chat_template.jinja:

{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
{%- set low_effort = low_effort if low_effort is defined else False %}
{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}

{# Honor OpenAI-standard reasoning_effort field as a fallback when the native flags are unset. #}
{%- if reasoning_effort is defined and reasoning_effort is not none %}
    {%- if reasoning_effort == "none" %}
        {%- set enable_thinking = False %}
    {%- elif reasoning_effort == "low" %}
        {%- set low_effort = True %}
    {%- endif %}
{%- endif %}

Mapping rationale:

reasoning_effort == "none" → enable_thinking = False (model doesn't think at all)
reasoning_effort == "low" → low_effort = True (model thinks briefly)
reasoning_effort == "medium" | "high" → no change (default is already full thinking)

The patch is defensive: it only sets flags when reasoning_effort is explicitly provided, and it respects chat_template_kwargs.low_effort / enable_thinking if they're also set (since those are applied first).

Why fix it here rather than in vLLM

This bug also affects any other OpenAI-compatible inference server that renders the HF chat template (SGLang, TGI, llama.cpp server, etc.). Fixing it in the template fixes it everywhere at once. I've filed a companion vLLM-side issue for teams that can't wait for a model card update, but the template is the right long-term home for the fix.

Happy to open a PR against the repo if that's the preferred workflow. Thanks! 🙏

keyang-lite changed discussion status to closed 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment