Very good so far!
Even after jamming it into 64GB memory it seems to really hold the Claude style reasoning. For now I only had it do Qwen3.5 style reasoning 2 times. The tone of the model is also really nice. It at least appears that a lot of overt enthusiasm and sycophancy was reverted during the fine-tuning which is great news and aligns with how Claude usually writes. Much better than the one from timteh673.
During a couple of random accuracy tests the model either got the answer 100% correct with magnitudes less reasoning or failed in a way where I could actually look at the error and see how it could've arrived at that conclusion which is better than a straight up, nonsensical hallucination.
Using the built-in llama.cpp MCP implementation I ran a couple prompts that required web search and site content grabbing which it handled well, even double checking sources which I didn't see in base Qwen3.5. It also handled Context7 perfectly, letting it zero-shot all of my (arguably not that complex) coding tests.
For now it seems to fix the biggest pain-points of Qwen3.5 while either not touching or maybe improving what was already good. I will continue testing (given how reasoning works, I do wonder what the consequences of training directly on reasoning traces as opposed to letting it figure it out within constraints could be!) but seems like I finally can replace much of what Perplexity was doing for me with a single, local model.
Thank you for this tune!
Even after jamming it into 64GB memory it seems to really hold the Claude style reasoning. For now I only had it do Qwen3.5 style reasoning 2 times. The tone of the model is also really nice. It at least appears that a lot of overt enthusiasm and sycophancy was reverted during the fine-tuning which is great news and aligns with how Claude usually writes. Much better than the one from timteh673.
During a couple of random accuracy tests the model either got the answer 100% correct with magnitudes less reasoning or failed in a way where I could actually look at the error and see how it could've arrived at that conclusion which is better than a straight up, nonsensical hallucination.
Using the built-in llama.cpp MCP implementation I ran a couple prompts that required web search and site content grabbing which it handled well, even double checking sources which I didn't see in base Qwen3.5. It also handled Context7 perfectly, letting it zero-shot all of my (arguably not that complex) coding tests.
For now it seems to fix the biggest pain-points of Qwen3.5 while either not touching or maybe improving what was already good. I will continue testing (given how reasoning works, I do wonder what the consequences of training directly on reasoning traces as opposed to letting it figure it out within constraints could be!) but seems like I finally can replace much of what Perplexity was doing for me with a single, local model.
Thank you for this tune!
Thanks for the detailed feedback — really appreciate you putting it through so many tests.
Good to hear the tone and reasoning style are closer to what you’d expect.
Your observation about the failure modes is also very helpful — making errors more interpretable (instead of pure hallucinations) was part of the goal.
Interesting point about MCP + web search and the source double-checking. That wasn’t explicitly enforced, so it’s likely emerging behavior.
And agreed on the reasoning traces question — there are definitely trade-offs there, and I’m still exploring that direction.
Thanks again for sharing your results — looking forward to hearing more as you continue testing.:)