Qwen3.5 GGUF Evaluation Results
Third party results conducted by Benjamin Marie:
"I tested Unsloth's UD Q4 and Q3 GGUF quantizations of Qwen3.5-397B-A17B and they both performed very well.
In my runs, I didnβt observe a meaningful difference between the original weights and Q3 (less than 1 point of accuracy difference, so only a ~3.5% relative error increase).
You can cut on the order of ~500 GB of memory footprint while seeing little to no practical degradation (at least on the tasks I tried)."
Note the 3-bit is slightly higher accuracy than 4-bit due to a normal margin of error.
Has anyone done similar for the 122b, 35ba35, or 27b yet?
I noticed that after the recent update, UD-TQ1_0 has been removed. Are you planning to upload it in the future?
Version UD-TQ1_0 is the only one that can be run on a system with 128 GB of unified RAM. I would really appreciate it if you could put it back (ideally with updated quantization)
@dzupin there's a new UD-IQ2_XSS which should fit in 128G. I'm currently downloading it and will run some benchmarks to compare against ubergarm's smol-IQ2_XS: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8
Apparently the IQ2_M also fits in 128G, so going to test that one too
@danielhanchen the chart in this thread is outdated right? I remember looking a while back and the IQ2_M was too big for 128G. Did you guys run another benchmark? I'm interested in the relative error increase for the new IQ2_M and IQ2_XXS
My experience was that on my 128GB unified RAM only UD-TQ1_0 was able to run full 256K context without having to do any additional quantization of the context or shrinking of context size window. And I still had some spare RAM for my other apps to run.
@dzupin are you on apple silicon? If so you can increase usable VRAM to 125G. I was able to fit ubergarm's smol-IQ2_XS with full 256k context and it worked great.
Note that I don't use this device for anything else, so it sits in pre-login state idling at around 2-3GB RAM.
Thanks for sharing. I've gone up to 120gb vram on my Mac Studio but always get a little nervous as I've locked it up before. 125gb is wild!
I am using the UD-IQ2-M one across two devices via RPC (Studio + 5090 rtx). Performance is very good -- around 170 t/s prompt processing and 20 t/s inference. Quality is excellent so far.
