Fix streaming output when enable_thinking is disabled
Fix streaming output when enable_thinking is disabled
Problem
The current nano_v3_reasoning_parser.py correctly handles the enable_thinking: false flag for non-streaming requests, but streaming requests still route content to the wrong field.
When using vLLM with streaming enabled and thinking disabled:
response = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
messages=[{"role": "user", "content": "Hello"}],
stream=True,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
Current behavior: Content appears in delta.reasoning_content instead of delta.content
Expected behavior: Content should appear in delta.content (since thinking is disabled)
Root Cause
The existing extract_reasoning method handles the field swap for non-streaming responses, but the streaming path uses extract_reasoning_streaming from the parent DeepSeekR1ReasoningParser, which doesn't know about the enable_thinking flag.
Solution
Override extract_reasoning_streaming to swap the fields when thinking is disabled, matching the behavior of the non-streaming path.
Changes
- Add
__init__to captureenable_thinkingstate at parser initialization - Add
extract_reasoning_streamingoverride to swap fields in streaming mode - Add docstring explaining the parser's purpose
Testing
Tested with vLLM v0.1.dev on NVIDIA DGX Spark (GB10) with both streaming and non-streaming requests:
# Streaming with thinking disabled - now works correctly
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/path/to/model",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true,
"chat_template_kwargs": {"enable_thinking": false}
}'
Content now correctly appears in delta.content for all streaming chunks.
@kwondla This indeed fixes the non-reasoning/streaming issue, but breaks tool calling.
I can't get any IDE to use tools after adding this parser.
Any ideas?