Bug: Model does not produce <think></think> tokens when enable_thinking=true

#26

by mohamedemam - opened Mar 4

Mar 4

When serving Qwen3-35B-A3B via llama.cpp with enable_thinking=true passed through --chat-template-kwargs, the model outputs all reasoning inside the final answer text rather than wrapping it in tags. Adding the following system prompt forces correct behavior as a workaround:

"Always wrap your internal reasoning inside tags before writing your final answer."

This suggests the model is not reliably conditioned to produce thinking tokens at inference time without explicit prompting, even when the chat template correctly injects the opening token at the generation prompt.
Expected: Model opens , reasons, closes , then writes final answer.
Actual: Model outputs reasoning and answer together as plain text with no delimiters.
Environment:

Model: unsloth/Qwen3.5-35B-A3B-GGUF MXFP4_MOE quant
Backend: llama.cpp server
--jinja + custom --chat-template-file
enable_thinking=true via --chat-template-kwargs
Sampling: temp=0.7, top-p=0.8, top-k=20

Suspected cause: top-k=20 may suppress the token during long reasoning chains, preventing the model from transitioning to the final answer.

mohamedemam

Mar 4

it case by mmproj-F16.gguf

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment