Bogus template?

by jedisct1 - opened Mar 15

Mar 15

Hi!

Trying to use this model with paroquant[mlx] with swival shows some issues with thinking sections, as if they were closed, but not properly opened:

$ uv run swival --provider generic --base-url http://127.0.0.1:8000 --model qwen "Compute 2+2"

  LLM responded in 53.0s  finish_reason=stop
  ✓ Agent finished: 1 turns

The user wants me to compute 2+2, which is a simple arithmetic operation that equals 4.
</think>

2 + 2 = **4**

Also, adding enable_thinking = false to the chat template doesn't disable thinking, while it does with other inference tools and quants.

Did I miss something?

liang2kl

Z Lab org Mar 15

The chat template was copied directly from the unquantized model, so it shouldn't cause the problem. Could you check the behavior of the official Qwen models using the following command?

mlx_vlm.server --model Qwen/Qwen3.5-0.8B --port 8000

If the behavior is the same then it's the problem with the mlx upstream or the agent framework you're using. Let me know if the unquantized one works and paroquant doesn't.

jedisct1

Mar 15

The mlx-community/Qwen3.5-0.8B-MLX-4bit model works perfectly.

The original Qwen/Qwen3.5-0.8B also doesn't have that issue.

Maybe your quants are based on an older version? I think there's been a couple revisions.

liang2kl

Z Lab org Mar 16

I've reproduced the issue and confirmed that it's due to a non-standard API implementation in mlx-vlm. As a temporary fix, please add --llm-only to the ParoQuant server arguments to force using mlx-lm rather than mlx-vlm.

jedisct1

Mar 16

Even with --llm-only, tool calling seems to be broken, albeit differently:

  Calling model openai/z-lab/Qwen3.5-27B-PARO with max_tokens=32768, top_p=1.0
  LLM responded in 9.1s  finish_reason=stop
  ✓ Agent finished: 1 turns



<tool_call>
<function=grep>
<parameter=pattern>
fastlr|FastLR
</parameter>
<parameter=case_insensitive>
true
</parameter>
</function>
</tool_call>
<tool_call>
<function=list_files>
<parameter=pattern>
**/*fastlr*
</parameter>
</function>
</tool_call>
<tool_call>
<function=list_files>
<parameter=pattern>
**/services/**
</parameter>
</function>
</tool_call>

liang2kl

Z Lab org Mar 16

Yes I can reproduce the exact problem above with swival, but the official package qwen_agent by the Qwen team can parse the model's response correctly, as in https://github.com/z-lab/paroquant/blob/main/paroquant/cli/agent.py. I wonder if there's some compatibility issues with swival. We'll look into to this.

liang2kl

Z Lab org Mar 17

It's been fixed in the latest version. Please upgrade the package with:

pip install --upgrade "paroquant[mlx]"

Feel free to re-open this issue if you encounter any problems.

liang2kl changed discussion status to closed Mar 17

jedisct1

Mar 17

Awesome, thank you!

docato

2 days ago

Awesome, thank you!

Have you had any futher problems with the current quant? Would you recommend it?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment