Suggestion for Qwen 3.5 (SSM-Hybrid) quantization recipe

by Pentium95 - opened Mar 24

Mar 24

Hi @mradermacher , thank you for the massive amount of GGUF quants you provide to the community!

I’ve been analyzing the different variants for Qwen 3.5 and noticed that the current recipes treat this model like a standard Transformer, quantizing the SSM layers (ssm_alpha, ssm_beta, ssm_out) down to low-bit formats.

Since Qwen 3.5 is a hybrid architecture, it is uniquely sensitive to the precision of these SSM components. According to Unsloth's benchmarks ( https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks ), compressing these specificlayers to 4-bit or lower, significantly degrades the model's long-context logic and reasoning, whereas keeping them in F16 or Q8_0 preserves over 99% of the original performance.

Given how influential your quants are, would you consider creating a "specialized" recipe for Qwen 3.5 models that protects the SSM "brains" (e.g., keeping ssm_alpha/beta in F16 and ssm_out in Q8_0), perhaps even by offsetting the size increase with slightly more compression on the less critical FFN layers or embeddings?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment