Thinking Blocks Error
Does anyone else experience this issue? The LLM doenst generate think /think blocks. It does something like this
THINKING PROCESS:
....example text
/think
how to solve this problem? Prompt Injection? I feel like the model should be able to handle this by default..
How are you running the model? Your inference server should already parse the response and put what starts with
Thinking Process:
[...]
in reasoning_content, and its answer in content.
Maybe you should check if your server command includes the correct reasoning parser. Or if you set it to "auto" (or didn't set it), maybe you should try to update your inference server. For such a model release, sglang, vllm, etc. should all have added it.
At least with llama.cpp, I personally correctly get the expected behavior:
Thinking part:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"Thinking"}}],"created":1772093206,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" Process"}}],"created":1772093206,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":2,"predicted_ms":38.147,"predicted_per_token_ms":19.0735,"predicted_per_second":52.4287624190631}}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":":"}}],"created":1772093207,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":3,"predicted_ms":79.392,"predicted_per_token_ms":26.464,"predicted_per_second":37.787182587666265}}
[...]
Answer part:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hey"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":405,"predicted_ms":13295.955,"predicted_per_token_ms":32.82951851851852,"predicted_per_second":30.460391901145876}}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" there"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":406,"predicted_ms":13328.65,"predicted_per_token_ms":32.829187192118226,"predicted_per_second":30.460699320636373}}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"!"}}],"created":1772093220,"id":"chatcmpl-PYHrFtxQsSAXboFq7NgCGxOkbZ55GzGS","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8157-2943210c1","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":12,"prompt_ms":196.066,"prompt_per_token_ms":16.338833333333334,"prompt_per_second":61.20388032601267,"predicted_n":407,"predicted_ms":13361.324,"predicted_per_token_ms":32.828805896805896,"predicted_per_second":30.461053111203647}}
oh maybe thats my fault then.. i use vllm nightly build with these start params
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,2 vllm serve ./models/qwen3.5-27b-awq-bf16-int4/ --api-key dev --served-model-name qwen3-32b-awq --max-model-len 35576 --gpu-memory-utilization 0.9 --max-num-seqs 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --tensor-parallel-size 2 --max-num-batched-tokens 16384 --kv-cache-dtype auto --enable-chunked-prefill --disable-log-requests --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --mm-encoder-tp-mode data --mm-processor-cache-type shm --enable-prefix-caching
by the way, is it also possible to use tool call parser hermes? I always used hermes for qwen3 32b, but for qwen3.5 27b the tool calls dont seem to work
Not sure for the tools calls but for the reasoning have a look here: https://deepwiki.com/search/what-are-the-different-reasoni_e672d78a-3412-49ef-a712-eb515f410dc5 maybe you can try --reasoning-parser qwen3?
You can also use deepwiki to see for the tool call case, it's really helpful and I always get answers to my questions! I use it almost everyday :)
btw nothing to deal with your issue but I saw you name the model with --served-model-name qwen3-32b-awq I guess you forgot to change it ;)
For Vllm i use this flag and it works even in openwebui :
--reasoning-parser qwen3
this same problem
raw response:
{
"id": "chatcmpl-9d6c5f1c0f32c335",
"model": "Qwen3.5-27B",
"usage": {
"total_tokens": 1692,
"prompt_tokens": 1250,
"completion_tokens": 442,
"prompt_tokens_details": null,
"completion_tokens_details": {
"text_tokens": null,
"audio_tokens": null,
"image_tokens": null,
"reasoning_tokens": 0,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
},
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "用户问的是关于Qwen VL模型的问题。Qwen VL是通义千问的视觉语言模型系列。这是一个关于AI模型的技术问题,我不需要使用MCP工具来回答这个问题,因为这是关于我自身知识的问题。\n\n让我直接回答用户关于Qwen VL模型的问题。\n\n\nQwen VL(通义千问视觉语言模型)系列主要包括以下几个版本:\n\n## Qwen VL 模型版本\n\n| 版本 | 发布时间 | 特点 |\n|------|----------|------|\n| Qwen-VL | 2023年 | 首代视觉语言模型,支持图像理解、OCR、多模态对话 |\n| Qwen-VL-Chat | 2023年 | 对话优化版本,支持多轮视觉对话 |\n| Qwen-VL-Plus | 2024年 | 增强版,支持更高分辨率图像、更长上下文 |\n| Qwen-VL-Max | 2024年 | 旗舰版,最强的视觉理解能力 |\n| Qwen2-VL | 2024年 | 第二代视觉语言模型,大幅提升性能 |\n| Qwen2.5-VL | 2024年 | 最新迭代,支持视频理解、更精细的视觉定位 |\n\n## 主要能力\n\n- 图像理解:识别、描述、分析图像内容\n- OCR识别:提取图像中的文字信息\n- 多模态对话:结合文本和图像进行对话\n- 视觉定位:在图像中定位特定对象\n- 视频理解(Qwen2.5-VL):支持视频内容分析\n\n## 使用方式\n\n- API调用:通过阿里云百炼平台或DashScope API\n- 本地部署:部分版本可在Hugging Face获取开源权重\n- 模型尺寸:提供多种参数量版本(0.5B、7B、72B等)\n\n您想了解某个具体版本的详细信息或使用方式吗?",
"tool_calls": null,
"function_call": null
},
"finish_reason": "stop"
}
],
"created": 1773548162,
"system_fingerprint": null
}
vllm --reasoning-parser qwen3
@hehe3838 In addition to llama.cpp, I also set up vllm to run the model, and couldn't reproduce.
Can you share the exact command you run the server with along with the API request?
docker.yml
services:
qwen35:
image: vllm/vllm-openai:nightly
container_name: qwen35-27b-4x4090
restart: unless-stopped
runtime: nvidia
# ipc: host
environment:
# GPU 与分布式配置
- NVIDIA_VISIBLE_DEVICES=0,1,2,3
- NCCL_DEBUG=WARN
- NCCL_IB_DISABLE=1
- NCCL_P2P_DISABLE=1
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- NCCL_MIN_NCHANNELS=4
# API 与安全
- VLLM_API_KEY=xxxxx
volumes:
- ./models:/models:ro
- ./cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
shm_size: '16gb'
entrypoint: ["vllm", "serve"]
command:
- "/models/Qwen3.5-27B"
# 基础服务配置
- "--served-model-name=Qwen3.5-27B"
- "--port=8000"
- "--host=0.0.0.0"
# 并行与精度策略
- "--tensor-parallel-size=4"
- "--dtype=half"
- "--kv-cache-dtype=auto"
# 显存与上下文管理
- "--gpu-memory-utilization=0.85"
- "--max-model-len=32768"
- "--max-num-seqs=8"
# Qwen3.5 专用解析器
- "--reasoning-parser=qwen3"
- "--enable-auto-tool-choice"
- "--tool-call-parser=qwen3_coder"
# 性能优化(生产环境推荐)
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
# 多模态支持(如无需视觉可添加 --language-model-only 节省显存)
- "--mm-processor-cache-type=shm"
# 安全与兼容
- "--trust-remote-code"
curl req:
-d '{
"model": "Qwen3.5-27B",
"messages": [{"role": "user", "content": "介绍下莎士比亚和他的作品"}],
"max_tokens": 4096,
"temperature": 1.0,
"stream": true,
"chat_template_kwargs": {"enable_thinking": true}
}'
What if you pass "chat_template_kwargs": {"enable_thinking": true} as "extra_body": {"chat_template_kwargs": {"enable_thinking": true}} instead?
I'm suspecting it to currently work in instruct mode because the "thinking part" of the returned response seems different from the usual reasoning_content which should be more structured than what we see in your returned content (usually something like Thinking Process:\n\n1. **Analyze the Input:**\n[...])
I also have a similar serving issue with vllm. The following is the execution information.
docker run -d --name vllm-qwen3.5-35b-gptq
--gpus all
-p 8000:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-v /etc/localtime:/etc/localtime:ro
-v /etc/timezone:/etc/timezone:ro
--ipc=host
vllm/vllm-openai:latest
--model Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
--tensor-parallel-size 2
--quantization gptq_marlin
--max-model-len 131072
--gpu-memory-utilization 0.85
--trust-remote-code
--dtype float16
--kv-cache-dtype fp8
--enable-auto-tool-choice
--tool-call-parser qwen3_xml
Resolved. The issue was that the reasoning parser was not configured. Please add the following option:
--reasoning-parser qwen3 \
