Great model, small problem
I'm getting stuck in thinking loops with ???????????
I think this is a bug with llamacpp and I've opened a ticket, but it could be the model as well. I'm actually unsure. Otherwise, great model. It's trading blows with large models such as Behemoth. I'd argue it's actually better with thinking.
Without thinking it's... not great.
When using Qwen3.5, this issue mainly comes from its template design. To unlock the model’s full performance, you must deploy it with professional inference frameworks such as sglang or vLLM, using the qwen3 reasoning template together with the qwen3_coder tool-calling template. Otherwise, when handling long contexts, the model may produce artifacts like “!!!!!!!!” or “??????????”, or fall into infinite repetition.
This limitation is difficult to accept, especially for users with constrained resources who cannot rely on llama.cpp. At the moment, the best we can do is use ktransformers (kt‑kernel + sglangKT) to experiment with Qwen3.5 MoE. Unfortunately, we still cannot deploy the promising Qwen3.5 27B dense model comfortably on llama.cpp.
--Capilot translate
这是Qwen3.5特有的模板问题,只有在sglang,vllm等专业框架上部署,思维模板选择 qwen3,工具调用模板使用qwen3_coder 时(这个"qwen3_coder"工具模板很重要),才能获得完整的性能,否则在处理长上下文时会出现!!!!!!!!或??????????,诸如此类的乱码或无限重复。
这不太让人接受对吧,如果在资源有限的情况下无法使用llama.cpp?但现在我们最多能使用ktranformers(kt-kernel+sglangKT)来尝试qwen3.5 MOE,而不能很愉快地用llama.cpp部署颇具潜力的Qwen3.5 27B dense。
嘿,我又试了试llama.cpp,发现qwen3.5看上去没那么差了(没有烦人的wait,没有无限循环),但是我即时做的有限,只是简单地打几个招呼,问些问题。处理100k长文本是否出现同样的????????????依然没有确认,但是是时候对llama.cpp重拾热情了
"Hey, I gave llama.cpp another shot and it seems like Qwen 3.5 is actually performing much better now (no more annoying 'wait' prompts or infinite loops). I’ve only done some basic testing so far—just simple greetings and a few questions. I haven’t confirmed yet if it still spits out '????????' when handling 100k long context, but it’s definitely time to get excited about llama.cpp again!" ---gemini
