Qwen3.5 122B on Stix Halo

by scottgl - opened Mar 2

Discussion

scottgl

Mar 2

This comment has been hidden (marked as Resolved)

twotall

Mar 6

Since qwen3.5 was rereleased will this one need to be, I had issues with unslot buy yours has been superb so far on my strix halo.
Please do what do did here to glm4.7-flash, gpt-oss-120b, and qwen3.5-35b.. others. You are the master IMO.

Beinsezii

Owner Mar 6

Since qwen3.5 was rereleased will this one need to be

qwen 3.5 was not re-released, that was a problem with unsloth's quantization method that doesn't affect my own. I use bartowski's calibration data and fixed tensor types

glm4.7-flash, gpt-oss-120b, and qwen3.5-35b

The homogeneous Q8_0 size of all of these models is <100GiB so there's not much point in making a mixed quant like I did here because PP will probably be worse.

To optimize for TG it's entirely a bandwidth issue so you just want to run as few bits as you're comfortable with, probably ~q5_k_s or the re-released unsloth dynamic quants.

Beinsezii

Owner Mar 10

I actually did end up re-quantizing it to dequantize the ssm alpha/beta layers like Bartowski did. only like +40MB

twotall

Mar 10

•

edited Mar 10

I cant understand why you're version of this isnt more popular I have had way more success speed etc with it than unsloth.. Oddly I am running with with llamacpp vulkan and not rocm. I cant seem to get it to load with fedora43 rocm stack :/...I'll tryharderlater.

I have mentioned it in the strix halo discord multiple times.

datayoda

Mar 17

thank you for your focus on Strix Halo. Could you also publish a recommended Llama-server command?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment