Comparing with Official GPTQ-Int4 quantized model?
#6
by haili-tian - opened
This quantization method and approach are essentially consistent with the officially released GPTQ-Int4 — only the routed experts are quantized, while the rest remain in BF16/FP16.
May I ask:
- Where do the original model/weights come from -- Qwen3.5’s BF16 model, GPTQ‑Int4, or something else (e.g., one of Unsloth quantized ggufs)?
- Have benchmarks been conducted comparing them (MXFP4_MOE_BF16/MXFP4_MOE_FP16) with the Qwen3.5 GPTQ-Int4 model?
I used unsloth's BF16 model for it, and AFAIK there are no benchmarks with GPTQ, I only run llama.cpp on my machine