Bogus template?
Hi!
Trying to use this model with paroquant[mlx] with swival shows some issues with thinking sections, as if they were closed, but not properly opened:
$ uv run swival --provider generic --base-url http://127.0.0.1:8000 --model qwen "Compute 2+2"
LLM responded in 53.0s finish_reason=stop
✓ Agent finished: 1 turns
The user wants me to compute 2+2, which is a simple arithmetic operation that equals 4.
</think>
2 + 2 = **4**
Also, adding enable_thinking = false to the chat template doesn't disable thinking, while it does with other inference tools and quants.
Did I miss something?
The chat template was copied directly from the unquantized model, so it shouldn't cause the problem. Could you check the behavior of the official Qwen models using the following command?
mlx_vlm.server --model Qwen/Qwen3.5-0.8B --port 8000
If the behavior is the same then it's the problem with the mlx upstream or the agent framework you're using. Let me know if the unquantized one works and paroquant doesn't.
The mlx-community/Qwen3.5-0.8B-MLX-4bit model works perfectly.
The original Qwen/Qwen3.5-0.8B also doesn't have that issue.
Maybe your quants are based on an older version? I think there's been a couple revisions.
I've reproduced the issue and confirmed that it's due to a non-standard API implementation in mlx-vlm. As a temporary fix, please add --llm-only to the ParoQuant server arguments to force using mlx-lm rather than mlx-vlm.
Even with --llm-only, tool calling seems to be broken, albeit differently:
Calling model openai/z-lab/Qwen3.5-27B-PARO with max_tokens=32768, top_p=1.0
LLM responded in 9.1s finish_reason=stop
✓ Agent finished: 1 turns
<tool_call>
<function=grep>
<parameter=pattern>
fastlr|FastLR
</parameter>
<parameter=case_insensitive>
true
</parameter>
</function>
</tool_call>
<tool_call>
<function=list_files>
<parameter=pattern>
**/*fastlr*
</parameter>
</function>
</tool_call>
<tool_call>
<function=list_files>
<parameter=pattern>
**/services/**
</parameter>
</function>
</tool_call>
Yes I can reproduce the exact problem above with swival, but the official package qwen_agent by the Qwen team can parse the model's response correctly, as in https://github.com/z-lab/paroquant/blob/main/paroquant/cli/agent.py. I wonder if there's some compatibility issues with swival. We'll look into to this.
It's been fixed in the latest version. Please upgrade the package with:
pip install --upgrade "paroquant[mlx]"
Feel free to re-open this issue if you encounter any problems.
Awesome, thank you!
Awesome, thank you!
Have you had any futher problems with the current quant? Would you recommend it?