Quant comparison versus "smol-IQ2_XS"

#5
by coder543 - opened

This has a mainline llama.cpp quant: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF

The "smol-IQ2_XS" quant is listed at 122GB, which is 8GB smaller than the smallest 2-bit quant here on unsloth/Qwen3.5-397B-A17B-GGUF.

I'm curious if anyone can shed light on what's going on. Is this quant better than one of the 1-bit quants here? How is it so much smaller?

122GB is small enough that I can run it with 32k context on a single DGX Spark, which isn't amazing, but it is something.

Ran a controlled perplexity check with llama-perplexity using the same settings across all three variants: same text file: full GPU offload (-ngl all), ctx=512, chunks=16, ppl-stride=0.

Scores:

Quant Perplexity (lower is better)
UD-IQ1_M 1.1903 ± 0.0117
smol-IQ2_XS 1.1916 ± 0.0117
UD-TQ1_0 1.1989 ± 0.0122

Not definitive, but still interesting. The IQ1_M quant is slightly smaller than smol-IQ2_XS on disk, and the IQ1_M quant generally seems to behave much better with multimodal inputs, apart from some strange warnings in the log for this single-image input message:

image slice encoded in 879 ms
decoding image batch 1/3, n_tokens_batch = 512
find_slot: non-consecutive token position 518 after 517 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 517 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
image decoded (batch 1/3) in 4388 ms
decoding image batch 2/3, n_tokens_batch = 512
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
image decoded (batch 2/3) in 4497 ms
decoding image batch 3/3, n_tokens_batch = 461
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 77 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 77 new tokens
image decoded (batch 3/3) in 4183 ms

Currently downloading IQ1_M to run the same benchmarks I ran here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8

About the strange message you see: it seems to be happening with every model of the Qwen 3.5 family. Even when running bf16 35B I see these messages when processing images

@coder543

I'm curious if anyone can shed light on what's going on.

the tl;dr; is that myself and @AesSedai are making custom quants optimized for MoE models e.g. for mainline llama.cpp check out: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF

Is this quant better than one of the 1-bit quants here? How is it so much smaller?

Yes, my smol-IQ2_XS is potentially "better" than the 1-bit quants here given it I preserve the attn/shexp/ssm tensors better than the typical UD recipes which more closely match the standard mainline llama.cpp dynamic quantizations. You can look inside the quants and see some smaller ~4ish BPW quantizations used e.g. : ssm_out@q4_K https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF/blob/main/UD-IQ2_M/Qwen3.5-397B-A17B-UD-IQ2_M-00002-of-00004.gguf

Of course everything is a trade off so find what works best for your specific hardware backend, inference engine, and workload.

Ran a controlled perplexity check with llama-perplexity using the same settings across all three variants: same text file: full GPU offload (-ngl all), ctx=512, chunks=16, ppl-stride=0.

Thanks for showing some of your workflow. 16 chunks at 512 ctx is not long enough to get a good estimate imo. I ran the quants in question and updated my modelcard with the results:

https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/resolve/main/images/perplexity.png

Here is an example of my workflow: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3#698f7ebf2aa648f3b77a1262

Of course perplexity isn't the whole story, you can also test KLD and use various corpi similar to what AesSedai shows.

Also, you can see some similar results for Qwen3-Coder-Next here where unsloth did a pretty good job with the smallest (note TQ1_0 does not contain any actual TQ1_0 tensors, they just use that naming slug convention incorrectly):

https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF/resolve/main/images/perplexity.png

Anyway, a 128GB DGX Spark is pretty nice as with CUDA backend you can run most any quantization types including the newer SOTA quants available only on ik_llama.cpp

Have fun with whatever you choose to run!

Cheers!

Unsloth AI org

Ran a controlled perplexity check with llama-perplexity using the same settings across all three variants: same text file: full GPU offload (-ngl all), ctx=512, chunks=16, ppl-stride=0.

Scores:

Quant Perplexity (lower is better)
UD-IQ1_M 1.1903 ± 0.0117
smol-IQ2_XS 1.1916 ± 0.0117
UD-TQ1_0 1.1989 ± 0.0122

Not definitive, but still interesting. The IQ1_M quant is slightly smaller than smol-IQ2_XS on disk, and the IQ1_M quant generally seems to behave much better with multimodal inputs, apart from some strange warnings in the log for this single-image input message:

image slice encoded in 879 ms
decoding image batch 1/3, n_tokens_batch = 512
find_slot: non-consecutive token position 518 after 517 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 517 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
image decoded (batch 1/3) in 4388 ms
decoding image batch 2/3, n_tokens_batch = 512
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
image decoded (batch 2/3) in 4497 ms
decoding image batch 3/3, n_tokens_batch = 461
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 77 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 128 new tokens
find_slot: non-consecutive token position 518 after 518 for sequence 0 with 77 new tokens
image decoded (batch 3/3) in 4183 ms

Thank you for testing!

Sign up or log in to comment