Qwen3.5 122B on Stix Halo
Since qwen3.5 was rereleased will this one need to be, I had issues with unslot buy yours has been superb so far on my strix halo.
Please do what do did here to glm4.7-flash, gpt-oss-120b, and qwen3.5-35b.. others. You are the master IMO.
Since qwen3.5 was rereleased will this one need to be
qwen 3.5 was not re-released, that was a problem with unsloth's quantization method that doesn't affect my own. I use bartowski's calibration data and fixed tensor types
glm4.7-flash, gpt-oss-120b, and qwen3.5-35b
The homogeneous Q8_0 size of all of these models is <100GiB so there's not much point in making a mixed quant like I did here because PP will probably be worse.
To optimize for TG it's entirely a bandwidth issue so you just want to run as few bits as you're comfortable with, probably ~q5_k_s or the re-released unsloth dynamic quants.
I actually did end up re-quantizing it to dequantize the ssm alpha/beta layers like Bartowski did. only like +40MB
I cant understand why you're version of this isnt more popular I have had way more success speed etc with it than unsloth.. Oddly I am running with with llamacpp vulkan and not rocm. I cant seem to get it to load with fedora43 rocm stack :/...I'll tryharderlater.
I have mentioned it in the strix halo discord multiple times.
thank you for your focus on Strix Halo. Could you also publish a recommended Llama-server command?