Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code
Eval Results

Fix streaming output when enable_thinking is disabled

#29
by Kwindla - opened

Fix streaming output when enable_thinking is disabled

Problem

The current nano_v3_reasoning_parser.py correctly handles the enable_thinking: false flag for non-streaming requests, but streaming requests still route content to the wrong field.

When using vLLM with streaming enabled and thinking disabled:

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

Current behavior: Content appears in delta.reasoning_content instead of delta.content

Expected behavior: Content should appear in delta.content (since thinking is disabled)

Root Cause

The existing extract_reasoning method handles the field swap for non-streaming responses, but the streaming path uses extract_reasoning_streaming from the parent DeepSeekR1ReasoningParser, which doesn't know about the enable_thinking flag.

Solution

Override extract_reasoning_streaming to swap the fields when thinking is disabled, matching the behavior of the non-streaming path.

Changes

  1. Add __init__ to capture enable_thinking state at parser initialization
  2. Add extract_reasoning_streaming override to swap fields in streaming mode
  3. Add docstring explaining the parser's purpose

Testing

Tested with vLLM v0.1.dev on NVIDIA DGX Spark (GB10) with both streaming and non-streaming requests:

# Streaming with thinking disabled - now works correctly
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/path/to/model",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": true,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Content now correctly appears in delta.content for all streaming chunks.

Kwindla changed pull request status to open

@kwondla This indeed fixes the non-reasoning/streaming issue, but breaks tool calling.

I can't get any IDE to use tools after adding this parser.

Any ideas?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment