Missing about 50~55GB of Q3?

#7
by lingyezhixing - opened

Now there are ~35GB, ~45GB, ~65GB, but not 55GB

啊哈,是的,我想我明白了。你想要一个大概能放入 55GiB 内存 + 显存的量化模型是吗?我猜你用的是 ik_llama.cpp 对吧?另外看看新的 -fdn 512 功能,它通过融合门控 delta 网络来加速 qwen35moe 和 qwe3next,以获得更快的 token 生成速度。

那我会考虑添加另一个在 64GB 总内存 + 显存下表现良好的量化版本。最近忙着处理这么多新模型,而且我在这里的公共配额有限。你知道人们还在哪里能找到我的模型吗?

谢谢!

此评论由 ubergarm/Qwen3.5-397B-A17B Q3_K 翻译


Ahh yes, I believe I understand. You would like to have a quantized model that fits in roughly 55GiB of RAM+VRAM? I assume you are using ik_llama.cpp yes? Also check out the new -fdn 512 feature that speeds up qwen35moe and qwe3next by fusing the gated delta net for more token generation speed.

I'll consider adding another quant that will work well in 64GB total RAM+VRAM then. It has been busy with so many new models, and I have limited public quota here. Where else do people find my models do you know?

Thanks!

This comment translated by ubergarm/Qwen3.5-397B-A17B Q3_K

It's not necessary at all; I'm just tentatively offering a suggestion rather than demanding it, please don't feel any responsibility or pressure

Regarding where to find your model, I mostly use Hugging Face, and sometimes I also use ModelScope (I hope I haven't misunderstood the question)

I don't know I'd use it over the Unsloth UD-Q3_K_XL but for what its worth I do fall in the 55GB range and would have downloaded the model for testing purposes.

I made my own quant from BF16, i like it so far.
Its 53gb in size and has sub 3% perplexity.

ffn_gate_exps.weight: IQ3_KS: 48 tensors (38.65B elements)
ffn_up_exps.weight: IQ3_KS: 48 tensors (38.65B elements)
ffn_down_exps.weight: IQ3_KS: 48 tensors (38.65B elements)
attn_qkv.weight: Q8_0: 36 tensors (1.36B elements)
attn_gate.weight: Q8_0: 36 tensors (905.97M elements)
ssm_out.weight: Q8_0: 36 tensors (905.97M elements)
output.weight: Q8_0: 1 tensors (762.84M elements)
token_embd.weight: Q8_0: 1 tensors (762.84M elements)
attn_q.weight: Q8_0: 12 tensors (603.98M elements)
attn_output.weight: Q8_0: 12 tensors (301.99M elements)
ffn_down_shexp.weight: Q8_0: 48 tensors (150.99M elements)
ffn_gate_shexp.weight: Q8_0: 48 tensors (150.99M elements)
ffn_up_shexp.weight: Q8_0: 48 tensors (150.99M elements)
ffn_gate_inp.weight: F32: 48 tensors (37.75M elements)
attn_k.weight: Q8_0: 12 tensors (18.87M elements)
attn_v.weight: Q8_0: 12 tensors (18.87M elements)
ssm_alpha.weight: BF16: 36 tensors (7.08M elements)
ssm_beta.weight: BF16: 36 tensors (7.08M elements)
// Rest are small vectors in F32

@igor255

Nice! Looks like a solid recipe, that looks exactly like what I'd make for a smol-IQ3_KS!

Yes the trend seems to be to keep those ssm guys at full quality bf16 (or f32 if your specific backend is actually faster with that?) (but don't downcast it to f16 as it could cause clipping). I'm not sure where the original research is suggesting that even q8_0 effects long context performance, but I'm still not sure about that, but it isn't a lot of size difference so seems fine.

Feel free to upload it to HF, if you need any guidance I just shared some tips with @tarruda here: https://www.reddit.com/r/LocalLLaMA/comments/1rotwhr/comment/o9uxvlc/ as he just uploaded a new option for the big size here: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-heretic-smol-IQ2_XS-GGUF

Cheers and happy weekend y'all!

Sign up or log in to comment