Perfomance of this model is one of the best

#13
by Geximus - opened

Hi @cpatonn ,
I wanted to share my experience after extensive testing of your Qwen3-Next-80B-A3B-Instruct-AWQ-4bit quantization.
Testing methodology
I compared several quantization approaches for Qwen3-Next-80B-A3B:

Original BF16 (baseline)
FP8 quantization
GPTQ-Int4
Your AWQ W4A16 (compressed-tensors)

Results
Your quantization consistently produces outputs that are nearly indistinguishable from the original BF16 model in terms of:

Reasoning quality
Instruction following
Code generation accuracy
Tool calling reliability

Why it works so well
Your recipe demonstrates excellent understanding of the Qwen3-Next hybrid architecture:

MoE expert weights (512 experts) β†’ quantized to INT4
Attention layers (linear_attn + self_attn) β†’ kept in BF16
Shared experts β†’ kept in BF16
Gates and routers β†’ kept in BF16
lm_head β†’ kept in BF16

This selective approach preserves the critical precision where it matters most.
Deployment details

Hardware: 4Γ— RTX 3090 24GB (Tensor Parallel)
Framework: vLLM v0.13.0
Kernel: CompressedTensorsWNA16MarlinMoEMethod (Marlin)
Memory: ~11.3 GiB model weights per GPU
KV Cache: ~200K tokens available
Context: 256K supported

Conclusion
This is hands down the best quantization available for Qwen3-Next. Thank you for your excellent work and for sharing it with the community!
Best regards

Geximus changed discussion title from Perfomance of this mode is one of the best to Perfomance of this model is one of the best
cyankiwi org

Thank you for using my quant. It has always been a pleasure making quantized models for the community :)

can you share your commad of vllm? Thx

Sign up or log in to comment