Comparing with Official GPTQ-Int4 quantized model?

#6
by haili-tian - opened

This quantization method and approach are essentially consistent with the officially released GPTQ-Int4 — only the routed experts are quantized, while the rest remain in BF16/FP16.

May I ask:

  1. Where do the original model/weights come from -- Qwen3.5’s BF16 model, GPTQ‑Int4, or something else (e.g., one of Unsloth quantized ggufs)?
  2. Have benchmarks been conducted comparing them (MXFP4_MOE_BF16/MXFP4_MOE_FP16) with the Qwen3.5 GPTQ-Int4 model?

I used unsloth's BF16 model for it, and AFAIK there are no benchmarks with GPTQ, I only run llama.cpp on my machine

Sign up or log in to comment