Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
nemotron-3
latent-moe
mtp
conversational
custom_code
modelopt

chat_template.jinja should honor OpenAI-standard `reasoning_effort` field

#12
by keyang-lite - opened

Hi NVIDIA team πŸ‘‹

The chat template that ships with this repo (chat_template.jinja) exposes two reasoning-control knobs via chat_template_kwargs:

  • low_effort (bool) β€” appends {reasoning effort: low} to the user message
  • enable_thinking (bool) β€” emits empty <think></think> in the generation prompt

Both work well. The problem: the template does not read the OpenAI-standard reasoning_effort field (values "none" | "low" | "medium" | "high"), which is the standard way clients ask reasoning models to reduce thinking. vLLM (and any other server that speaks the OpenAI chat API) forwards reasoning_effort into the Jinja environment automatically, but since this template never references the variable, it falls through and the model does full thinking every time.

Net effect for any OpenAI-API client: reasoning_effort: "low" is silently a no-op, and reasoning_effort: "none" produces deceptive output (the model still generates a full CoT; vLLM hides it from the response body, so the client sees reasoning: null and thinks it was cheap β€” but pays full thinking cost). Full receipts below.

Reproducer

Serving this model in vllm/vllm-openai:v0.19.0 with --reasoning-parser nemotron_v3 --trust-remote-code, I sent the prompt "What is the capital city of France?" with max_tokens=1024, temperature=0 in five modes, 3 runs each:

Mode reasoning_chars content_chars completion_tokens wall_ms
1. no flag 1146 368 372 3084
2. reasoning_effort="low" 985 350 310 2576
3. reasoning_effort="none" 0 373 315 2620
4. kwargs.low_effort=true 45 37 22 217
5. kwargs.enable_thinking=false 0 36 9 108
  • Row 2 is within run-to-run noise of Row 1. reasoning_effort: "low" does not meaningfully shorten generation.
  • Row 3 shows reasoning: null but completion_tokens and wall time match full-thinking Row 1 β€” the model generated a full CoT that was hidden. Compare to Row 5 where reasoning: null correctly correlates with 9 tokens and 108 ms.
  • Rows 4 and 5 (the existing chat_template_kwargs knobs) are the only modes that actually reduce generation, and they reduce it by 17Γ— / 41Γ— respectively. These correspond to the Jinja branches at chat_template.jinja:180 (appends {reasoning effort: low} to the user message when low_effort) and chat_template.jinja:204-208 (emits empty <think></think> in the generation prompt when not enable_thinking).

Proposed fix

Three-line Jinja addition, right after the existing variable definitions at the top of chat_template.jinja:

{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
{%- set low_effort = low_effort if low_effort is defined else False %}
{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}

{# Honor OpenAI-standard reasoning_effort field as a fallback when the native flags are unset. #}
{%- if reasoning_effort is defined and reasoning_effort is not none %}
    {%- if reasoning_effort == "none" %}
        {%- set enable_thinking = False %}
    {%- elif reasoning_effort == "low" %}
        {%- set low_effort = True %}
    {%- endif %}
{%- endif %}

Mapping rationale:

  • reasoning_effort == "none" β†’ enable_thinking = False (model doesn't think at all)
  • reasoning_effort == "low" β†’ low_effort = True (model thinks briefly)
  • reasoning_effort == "medium" | "high" β†’ no change (default is already full thinking)

The patch is defensive: it only sets flags when reasoning_effort is explicitly provided, and it respects chat_template_kwargs.low_effort / enable_thinking if they're also set (since those are applied first).

Why fix it here rather than in vLLM

This bug also affects any other OpenAI-compatible inference server that renders the HF chat template (SGLang, TGI, llama.cpp server, etc.). Fixing it in the template fixes it everywhere at once. I've filed a companion vLLM-side issue for teams that can't wait for a model card update, but the template is the right long-term home for the fix.

Happy to open a PR against the repo if that's the preferred workflow. Thanks! πŸ™

keyang-lite changed discussion status to closed

Sign up or log in to comment