chat_template.jinja should honor OpenAI-standard `reasoning_effort` field
Hi NVIDIA team π
The chat template that ships with this repo (chat_template.jinja) exposes two reasoning-control knobs via chat_template_kwargs:
low_effort(bool) β appends{reasoning effort: low}to the user messageenable_thinking(bool) β emits empty<think></think>in the generation prompt
Both work well. The problem: the template does not read the OpenAI-standard reasoning_effort field (values "none" | "low" | "medium" | "high"), which is the standard way clients ask reasoning models to reduce thinking. vLLM (and any other server that speaks the OpenAI chat API) forwards reasoning_effort into the Jinja environment automatically, but since this template never references the variable, it falls through and the model does full thinking every time.
Net effect for any OpenAI-API client: reasoning_effort: "low" is silently a no-op, and reasoning_effort: "none" produces deceptive output (the model still generates a full CoT; vLLM hides it from the response body, so the client sees reasoning: null and thinks it was cheap β but pays full thinking cost). Full receipts below.
Reproducer
Serving this model in vllm/vllm-openai:v0.19.0 with --reasoning-parser nemotron_v3 --trust-remote-code, I sent the prompt "What is the capital city of France?" with max_tokens=1024, temperature=0 in five modes, 3 runs each:
| Mode | reasoning_chars | content_chars | completion_tokens | wall_ms |
|---|---|---|---|---|
| 1. no flag | 1146 | 368 | 372 | 3084 |
2. reasoning_effort="low" |
985 | 350 | 310 | 2576 |
3. reasoning_effort="none" |
0 | 373 | 315 | 2620 |
4. kwargs.low_effort=true |
45 | 37 | 22 | 217 |
5. kwargs.enable_thinking=false |
0 | 36 | 9 | 108 |
- Row 2 is within run-to-run noise of Row 1.
reasoning_effort: "low"does not meaningfully shorten generation. - Row 3 shows
reasoning: nullbutcompletion_tokensand wall time match full-thinking Row 1 β the model generated a full CoT that was hidden. Compare to Row 5 wherereasoning: nullcorrectly correlates with 9 tokens and 108 ms. - Rows 4 and 5 (the existing
chat_template_kwargsknobs) are the only modes that actually reduce generation, and they reduce it by 17Γ / 41Γ respectively. These correspond to the Jinja branches atchat_template.jinja:180(appends{reasoning effort: low}to the user message whenlow_effort) andchat_template.jinja:204-208(emits empty<think></think>in the generation prompt whennot enable_thinking).
Proposed fix
Three-line Jinja addition, right after the existing variable definitions at the top of chat_template.jinja:
{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
{%- set low_effort = low_effort if low_effort is defined else False %}
{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}
{# Honor OpenAI-standard reasoning_effort field as a fallback when the native flags are unset. #}
{%- if reasoning_effort is defined and reasoning_effort is not none %}
{%- if reasoning_effort == "none" %}
{%- set enable_thinking = False %}
{%- elif reasoning_effort == "low" %}
{%- set low_effort = True %}
{%- endif %}
{%- endif %}
Mapping rationale:
reasoning_effort == "none"βenable_thinking = False(model doesn't think at all)reasoning_effort == "low"βlow_effort = True(model thinks briefly)reasoning_effort == "medium" | "high"β no change (default is already full thinking)
The patch is defensive: it only sets flags when reasoning_effort is explicitly provided, and it respects chat_template_kwargs.low_effort / enable_thinking if they're also set (since those are applied first).
Why fix it here rather than in vLLM
This bug also affects any other OpenAI-compatible inference server that renders the HF chat template (SGLang, TGI, llama.cpp server, etc.). Fixing it in the template fixes it everywhere at once. I've filed a companion vLLM-side issue for teams that can't wait for a model card update, but the template is the right long-term home for the fix.
Happy to open a PR against the repo if that's the preferred workflow. Thanks! π