Why 35B-INT4 smaller than 27B-INT4
BF16 size
Qwen/Qwen3.5-27B - 55.6
Qwen/Qwen3.5-35B-A3B - 71.9GB
INT4 size
Qwen/Qwen3.5-27B-GPTQ-Int4 - 30.3GB
Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 - 24.5GB
anyone know what's happening?
MoE vs dense
The catch is
Before quantize:
Qwen3.5-35B-A3B > Qwen3.5-27B
After FP8 quantize
Qwen3.5-35B-A3B > Qwen3.5-27B
After INT4 quantize
Qwen3.5-35B-A3B < Qwen3.5-27B
My question is why INT4 quantization shirnk the size of 35B-A3B model to a point that is smaller than 27B?
Answer:
The attension layer in aren't quantized, they kept it BF16 for some reasonmodel.language_model.layers.0.linear_attn.in_proj_qkv.weightmodel.language_model.layers.3.self_attn.q_proj.weight
shown in the model.safetensors.index.json file for 27B INT4 model.
for GPTQ quatization, extension should be .qweight not just .weight
Ref:
https://huggingface.co/Qwen/Qwen3.5-27B-GPTQ-Int4/blob/main/model.safetensors.index.json
That sounds like a bug?