Thinking Blocks Error

#16

by TronixAT - opened Feb 26

Discussion

TronixAT

Feb 26

•

edited Feb 26

Does anyone else experience this issue? The LLM doenst generate think /think blocks. It does something like this

THINKING PROCESS:
....example text
/think

how to solve this problem? Prompt Injection? I feel like the model should be able to handle this by default..

owao

Feb 26

How are you running the model? Your inference server should already parse the response and put what starts with

Thinking Process:

[...]

in reasoning_content, and its answer in content.

Maybe you should check if your server command includes the correct reasoning parser. Or if you set it to "auto" (or didn't set it), maybe you should try to update your inference server. For such a model release, sglang, vllm, etc. should all have added it.
At least with llama.cpp, I personally correctly get the expected behavior:

Thinking part:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"Thinking"}}],"created":1772093206,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" Process"}}],"created":1772093206,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":2,"predicted_ms":38.147,"predicted_per_token_ms":19.0735,"predicted_per_second":52.4287624190631}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":":"}}],"created":1772093207,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":3,"predicted_ms":79.392,"predicted_per_token_ms":26.464,"predicted_per_second":37.787182587666265}}

[...]

Answer part:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hey"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":405,"predicted_ms":13295.955,"predicted_per_token_ms":32.82951851851852,"predicted_per_second":30.460391901145876}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" there"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":406,"predicted_ms":13328.65,"predicted_per_token_ms":32.829187192118226,"predicted_per_second":30.460699320636373}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"!"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":407,"predicted_ms":13361.324,"predicted_per_token_ms":32.828805896805896,"predicted_per_second":30.461053111203647}}

TronixAT

Feb 26

oh maybe thats my fault then.. i use vllm nightly build with these start params

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,2 vllm serve ./models/qwen3.5-27b-awq-bf16-int4/ --api-key dev --served-model-name qwen3-32b-awq --max-model-len 35576 --gpu-memory-utilization 0.9 --max-num-seqs 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --tensor-parallel-size 2 --max-num-batched-tokens 16384 --kv-cache-dtype auto --enable-chunked-prefill --disable-log-requests --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --mm-encoder-tp-mode data --mm-processor-cache-type shm --enable-prefix-caching

by the way, is it also possible to use tool call parser hermes? I always used hermes for qwen3 32b, but for qwen3.5 27b the tool calls dont seem to work

owao

Feb 26

•

edited Feb 26

Not sure for the tools calls but for the reasoning have a look here: https://deepwiki.com/search/what-are-the-different-reasoni_e672d78a-3412-49ef-a712-eb515f410dc5 maybe you can try --reasoning-parser qwen3?
You can also use deepwiki to see for the tool call case, it's really helpful and I always get answers to my questions! I use it almost everyday :)
btw nothing to deal with your issue but I saw you name the model with --served-model-name qwen3-32b-awq I guess you forgot to change it ;)

RGMC98

Feb 26

For Vllm i use this flag and it works even in openwebui :
--reasoning-parser qwen3

owao

Feb 26

This comment has been hidden (marked as Off-Topic)

caihuaiguang

Mar 5

This comment has been hidden (marked as Resolved)

hehe3838

Mar 15

this same problem
raw response：
{
"id": "chatcmpl-9d6c5f1c0f32c335",
"model": "Qwen3.5-27B",
"usage": {
"total_tokens": 1692,
"prompt_tokens": 1250,
"completion_tokens": 442,
"prompt_tokens_details": null,
"completion_tokens_details": {
"text_tokens": null,
"audio_tokens": null,
"image_tokens": null,
"reasoning_tokens": 0,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
},
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "用户问的是关于Qwen VL模型的问题。Qwen VL是通义千问的视觉语言模型系列。这是一个关于AI模型的技术问题，我不需要使用MCP工具来回答这个问题，因为这是关于我自身知识的问题。\n\n让我直接回答用户关于Qwen VL模型的问题。\n\n\nQwen VL（通义千问视觉语言模型）系列主要包括以下几个版本：\n\n## Qwen VL 模型版本\n\n| 版本 | 发布时间 | 特点 |\n|------|----------|------|\n| Qwen-VL | 2023年 | 首代视觉语言模型，支持图像理解、OCR、多模态对话 |\n| Qwen-VL-Chat | 2023年 | 对话优化版本，支持多轮视觉对话 |\n| Qwen-VL-Plus | 2024年 | 增强版，支持更高分辨率图像、更长上下文 |\n| Qwen-VL-Max | 2024年 | 旗舰版，最强的视觉理解能力 |\n| Qwen2-VL | 2024年 | 第二代视觉语言模型，大幅提升性能 |\n| Qwen2.5-VL | 2024年 | 最新迭代，支持视频理解、更精细的视觉定位 |\n\n## 主要能力\n\n- 图像理解：识别、描述、分析图像内容\n- OCR识别：提取图像中的文字信息\n- 多模态对话：结合文本和图像进行对话\n- 视觉定位：在图像中定位特定对象\n- 视频理解（Qwen2.5-VL）：支持视频内容分析\n\n## 使用方式\n\n- API调用：通过阿里云百炼平台或DashScope API\n- 本地部署：部分版本可在Hugging Face获取开源权重\n- 模型尺寸：提供多种参数量版本（0.5B、7B、72B等）\n\n您想了解某个具体版本的详细信息或使用方式吗？",
"tool_calls": null,
"function_call": null
},
"finish_reason": "stop"
}
],
"created": 1773548162,
"system_fingerprint": null
}

vllm --reasoning-parser qwen3

owao

Mar 15

@hehe3838 In addition to llama.cpp, I also set up vllm to run the model, and couldn't reproduce.

Can you share the exact command you run the server with along with the API request?

hehe3838

Mar 17

@hehe3838 In addition to llama.cpp, I also set up vllm to run the model, and couldn't reproduce.

Can you share the exact command you run the server with along with the API request?

docker.yml

services:
qwen35:
image: vllm/vllm-openai:nightly
container_name: qwen35-27b-4x4090
restart: unless-stopped
runtime: nvidia
# ipc: host
environment:
# GPU 与分布式配置
- NVIDIA_VISIBLE_DEVICES=0,1,2,3
- NCCL_DEBUG=WARN
- NCCL_IB_DISABLE=1
- NCCL_P2P_DISABLE=1
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- NCCL_MIN_NCHANNELS=4

  # API 与安全
  - VLLM_API_KEY=xxxxx
  
volumes:
  - ./models:/models:ro                        
  - ./cache/huggingface:/root/.cache/huggingface    
ports:
  - "8000:8000"
shm_size: '16gb'                                  
entrypoint: ["vllm", "serve"]
command:
  - "/models/Qwen3.5-27B"
  
  # 基础服务配置
  - "--served-model-name=Qwen3.5-27B"
  - "--port=8000"
  - "--host=0.0.0.0"
  
  # 并行与精度策略
  - "--tensor-parallel-size=4"                
  - "--dtype=half"                             
  - "--kv-cache-dtype=auto"                 
  
  # 显存与上下文管理
  - "--gpu-memory-utilization=0.85"         
  - "--max-model-len=32768"               
  - "--max-num-seqs=8"                        
  
  # Qwen3.5 专用解析器
  - "--reasoning-parser=qwen3"              
  - "--enable-auto-tool-choice"               
  - "--tool-call-parser=qwen3_coder"            
  
  # 性能优化（生产环境推荐）
  - "--enable-prefix-caching"              
  - "--enable-chunked-prefill"                
  
  # 多模态支持（如无需视觉可添加 --language-model-only 节省显存）
  - "--mm-processor-cache-type=shm"       
  
  # 安全与兼容
  - "--trust-remote-code"

curl req:
-d '{
"model": "Qwen3.5-27B",
"messages": [{"role": "user", "content": "介绍下莎士比亚和他的作品"}],
"max_tokens": 4096,
"temperature": 1.0,
"stream": true,
"chat_template_kwargs": {"enable_thinking": true}
}'

owao

Mar 17

What if you pass "chat_template_kwargs": {"enable_thinking": true} as "extra_body": {"chat_template_kwargs": {"enable_thinking": true}} instead?
I'm suspecting it to currently work in instruct mode because the "thinking part" of the returned response seems different from the usual reasoning_content which should be more structured than what we see in your returned content (usually something like Thinking Process:\n\n1. **Analyze the Input:**\n[...])

ioiblues

11 days ago

I also have a similar serving issue with vllm. The following is the execution information.

docker run -d --name vllm-qwen3.5-35b-gptq
--gpus all
-p 8000:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-v /etc/localtime:/etc/localtime:ro
-v /etc/timezone:/etc/timezone:ro
--ipc=host
vllm/vllm-openai:latest
--model Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
--tensor-parallel-size 2
--quantization gptq_marlin
--max-model-len 131072
--gpu-memory-utilization 0.85
--trust-remote-code
--dtype float16
--kv-cache-dtype fp8
--enable-auto-tool-choice
--tool-call-parser qwen3_xml

ioiblues

11 days ago

•

edited 11 days ago

Resolved. The issue was that the reasoning parser was not configured. Please add the following option:

--reasoning-parser qwen3 \

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment