Why 35B-INT4 smaller than 27B-INT4

#10
by andynoodles - opened

BF16 size
Qwen/Qwen3.5-27B - 55.6
Qwen/Qwen3.5-35B-A3B - 71.9GB

INT4 size
Qwen/Qwen3.5-27B-GPTQ-Int4 - 30.3GB
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 - 24.5GB

anyone know what's happening?

andynoodles changed discussion title from Why 35B-INT4 smaller then 27B-INT4 to Why 35B-INT4 smaller than 27B-INT4

MoE vs dense

The catch is
Before quantize:
Qwen3.5-35B-A3B > Qwen3.5-27B

After FP8 quantize
Qwen3.5-35B-A3B > Qwen3.5-27B

After INT4 quantize
Qwen3.5-35B-A3B < Qwen3.5-27B

My question is why INT4 quantization shirnk the size of 35B-A3B model to a point that is smaller than 27B?

Answer:
The attension layer in aren't quantized, they kept it BF16 for some reason
model.language_model.layers.0.linear_attn.in_proj_qkv.weight
model.language_model.layers.3.self_attn.q_proj.weight
shown in the model.safetensors.index.json file for 27B INT4 model.
for GPTQ quatization, extension should be .qweight not just .weight

Ref:
https://huggingface.co/Qwen/Qwen3.5-27B-GPTQ-Int4/blob/main/model.safetensors.index.json

That sounds like a bug?

Sign up or log in to comment