Qwen3-Coder-Next β REAP Pruned (512β256 experts) + MXFP4_MOE
A pruned and quantized version of Qwen/Qwen3-Coder-Next (80B total / 3B active parameters), optimized for efficient deployment on NVIDIA hardware.
What was done
1. REAP β Router-weighted Expert Activation Pruning (512 β 256 experts)
Each of the 48 MoE layers originally has 512 routed experts (top-10 routing + 1 shared expert). Using REAP (Router-weighted Expert Activation Pruning), I analyzed expert importance via router weight magnitudes and pruned the 256 least important experts per layer, removing 12,288 experts total (50% reduction).
- Method: Combined importance metric (router weight magnitude Γ activation frequency)
- Pruning: Per-layer, keeping the 256 most important experts in each layer
- Shared experts: Preserved (not pruned)
- DeltaNet/Attention layers: Preserved (not pruned)
2. Quantization β MXFP4_MOE via llama.cpp
The pruned F16 model was quantized using llama.cpp's built-in MXFP4_MOE quantization type, which is specifically designed for MoE architectures.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Coder-Next (80B total, 3B active) |
| Architecture | Hybrid DeltaNet + Attention + MoE, 48 layers |
| Original experts | 512 per layer, top-10 routing + 1 shared |
| Pruned experts | 256 per layer (50% reduction) |
| Hidden dimension | 2048 |
| Expert intermediate dimension | 512 |
| Context length | 262,144 tokens (YaRN scaling supported) |
Files
| File | Size | Description |
|---|---|---|
pruned-mxfp4-moe.gguf |
22 GB | Recommended. REAP-pruned + MXFP4_MOE quantized |
pruned-q4km.gguf |
24 GB | REAP-pruned + Q4_K_M quantized |
Evaluation
Perplexity (wikitext-2-raw, 128 chunks, ctx=512)
| Model | Size | Perplexity (β better) |
|---|---|---|
| pruned-mxfp4-moe.gguf | 22 GB | 14.09 Β± 0.24 |
| pruned-q4km.gguf | 24 GB | 14.23 Β± 0.24 |
MXFP4_MOE achieves slightly better perplexity than Q4_K_M at 2 GB smaller size.
Tensor-level Quality (NVFP4 vs F16 reference)
Sampled across layers 0, 10, 20, 30, 47 (84 tensors compared):
| Component | Avg Cosine Similarity | Avg SNR (dB) | Avg Relative Error |
|---|---|---|---|
| Attention | 0.9986 | 19.89 | 0.027 |
| SSM (DeltaNet) | 0.9965 | 19.81 | 0.070 |
| Norm layers | 1.0000 | β | 0.000 |
| Router | 0.9945 | 19.58 | 0.110 |
| Shared experts | 0.9947 | 19.74 | 0.103 |
| Expert FFN | 0.9949 | 19.93 | 0.095 |
| Overall | 0.9963 | 19.80 | β |
Compression Summary
Original F16: ~159 GB (80B params)
After REAP (F16): 82 GB (50% expert reduction)
After MXFP4_MOE: 22 GB (3.55x quantization)
Total compression: ~7.2x from original
Usage
llama.cpp
# Interactive chat
./llama-cli -m pruned-mxfp4-moe.gguf \
--jinja -ngl 99 -fa on -sm row \
--temp 1.0 --top-k 40 --top-p 0.95 \
-c 40960 -n 32768
# Server mode
./llama-server -m pruned-mxfp4-moe.gguf \
-ngl 99 -fa on -c 40960
Hardware Requirements
- GPU: NVIDIA GPU with β₯24 GB VRAM (fully offloaded with
-ngl 99) - Tested on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
- RAM: ~24 GB for full model loading
REAP Pruning Details
The pruning used a combined importance metric that weighs both router weight magnitude and simulated activation frequency. Key observations:
- Pruned experts had zero importance in most layers β many of the 512 experts are effectively dead/unused
- Kept experts showed wide importance ranges (0 to ~1300+), confirming heavy-tailed expert utilization
- Per-layer pruning ensures each layer retains its most important experts independently
- The shared expert in each layer was always preserved, maintaining the model's baseline capability
Methodology
- Download Qwen3-Coder-Next F16 GGUF from HuggingFace
- Extract router weights from GGUF tensors for all 48 MoE layers
- Compute expert importance using column norms of router weight matrices
- Prune bottom 256 experts per layer, rewriting the GGUF
- Quantize pruned F16 model using
llama-quantizewithMXFP4_MOEtype - Evaluate perplexity on wikitext-2 benchmark
Scripts
The pipeline scripts used for this conversion are available in the repository:
reap_nvfp4_pipeline.pyβ Main orchestration scriptrouter_weighted_expert_pruning.pyβ REAP implementation- Quantization via
llama-quantizefrom llama.cpp
Limitations
- Perplexity was measured on wikitext-2, which may not reflect coding task quality. Qwen3-Coder-Next is primarily a code model.
- The pruning used router weight magnitudes as a proxy for importance rather than calibration data from actual inference. Real-data calibration could yield better pruning decisions.
- 50% expert pruning is aggressive β downstream task quality should be validated on coding benchmarks.
Citation
If you use this model, please cite the original Qwen3-Coder-Next:
@misc{qwen3codernext,
title={Qwen3-Coder-Next},
author={Qwen Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/Qwen/Qwen3-Coder-Next}
}
- Downloads last month
- 404
We're not able to determine the quantization variants.
Model tree for DJLougen/Qwen3-Coder-Next-REAP-MXFP4-GGUF
Base model
Qwen/Qwen3-Coder-Next