Perfomance of this model is one of the best
Hi @cpatonn ,
I wanted to share my experience after extensive testing of your Qwen3-Next-80B-A3B-Instruct-AWQ-4bit quantization.
Testing methodology
I compared several quantization approaches for Qwen3-Next-80B-A3B:
Original BF16 (baseline)
FP8 quantization
GPTQ-Int4
Your AWQ W4A16 (compressed-tensors)
Results
Your quantization consistently produces outputs that are nearly indistinguishable from the original BF16 model in terms of:
Reasoning quality
Instruction following
Code generation accuracy
Tool calling reliability
Why it works so well
Your recipe demonstrates excellent understanding of the Qwen3-Next hybrid architecture:
MoE expert weights (512 experts) β quantized to INT4
Attention layers (linear_attn + self_attn) β kept in BF16
Shared experts β kept in BF16
Gates and routers β kept in BF16
lm_head β kept in BF16
This selective approach preserves the critical precision where it matters most.
Deployment details
Hardware: 4Γ RTX 3090 24GB (Tensor Parallel)
Framework: vLLM v0.13.0
Kernel: CompressedTensorsWNA16MarlinMoEMethod (Marlin)
Memory: ~11.3 GiB model weights per GPU
KV Cache: ~200K tokens available
Context: 256K supported
Conclusion
This is hands down the best quantization available for Qwen3-Next. Thank you for your excellent work and for sharing it with the community!
Best regards
Thank you for using my quant. It has always been a pleasure making quantized models for the community :)
can you share your commad of vllm? Thx