Qwen3-Coder-Next — REAP Pruned (512→256 experts) + MXFP4_MOE

A pruned and quantized version of Qwen/Qwen3-Coder-Next (80B total / 3B active parameters), optimized for efficient deployment on NVIDIA hardware.

What was done

1. REAP — Router-weighted Expert Activation Pruning (512 → 256 experts)

Each of the 48 MoE layers originally has 512 routed experts (top-10 routing + 1 shared expert). Using REAP (Router-weighted Expert Activation Pruning), I analyzed expert importance via router weight magnitudes and pruned the 256 least important experts per layer, removing 12,288 experts total (50% reduction).

Method: Combined importance metric (router weight magnitude × activation frequency)
Pruning: Per-layer, keeping the 256 most important experts in each layer
Shared experts: Preserved (not pruned)
DeltaNet/Attention layers: Preserved (not pruned)

2. Quantization — MXFP4_MOE via llama.cpp

The pruned F16 model was quantized using llama.cpp's built-in MXFP4_MOE quantization type, which is specifically designed for MoE architectures.

Model Details

Property	Value
Base model	Qwen/Qwen3-Coder-Next (80B total, 3B active)
Architecture	Hybrid DeltaNet + Attention + MoE, 48 layers
Original experts	512 per layer, top-10 routing + 1 shared
Pruned experts	256 per layer (50% reduction)
Hidden dimension	2048
Expert intermediate dimension	512
Context length	262,144 tokens (YaRN scaling supported)

Files

File	Size	Description
`pruned-mxfp4-moe.gguf`	22 GB	Recommended. REAP-pruned + MXFP4_MOE quantized
`pruned-q4km.gguf`	24 GB	REAP-pruned + Q4_K_M quantized

Evaluation

Perplexity (wikitext-2-raw, 128 chunks, ctx=512)

Model	Size	Perplexity (↓ better)
pruned-mxfp4-moe.gguf	22 GB	14.09 ± 0.24
pruned-q4km.gguf	24 GB	14.23 ± 0.24

MXFP4_MOE achieves slightly better perplexity than Q4_K_M at 2 GB smaller size.

Tensor-level Quality (NVFP4 vs F16 reference)

Sampled across layers 0, 10, 20, 30, 47 (84 tensors compared):

Component	Avg Cosine Similarity	Avg SNR (dB)	Avg Relative Error
Attention	0.9986	19.89	0.027
SSM (DeltaNet)	0.9965	19.81	0.070
Norm layers	1.0000	∞	0.000
Router	0.9945	19.58	0.110
Shared experts	0.9947	19.74	0.103
Expert FFN	0.9949	19.93	0.095
Overall	0.9963	19.80	—

Compression Summary

Original F16:          ~159 GB (80B params)
After REAP (F16):       82 GB (50% expert reduction)
After MXFP4_MOE:        22 GB (3.55x quantization)
Total compression:      ~7.2x from original

Usage

llama.cpp

# Interactive chat
./llama-cli -m pruned-mxfp4-moe.gguf \
    --jinja -ngl 99 -fa on -sm row \
    --temp 1.0 --top-k 40 --top-p 0.95 \
    -c 40960 -n 32768

# Server mode
./llama-server -m pruned-mxfp4-moe.gguf \
    -ngl 99 -fa on -c 40960

Hardware Requirements

GPU: NVIDIA GPU with ≥24 GB VRAM (fully offloaded with -ngl 99)
Tested on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
RAM: ~24 GB for full model loading

REAP Pruning Details

The pruning used a combined importance metric that weighs both router weight magnitude and simulated activation frequency. Key observations:

Pruned experts had zero importance in most layers — many of the 512 experts are effectively dead/unused
Kept experts showed wide importance ranges (0 to ~1300+), confirming heavy-tailed expert utilization
Per-layer pruning ensures each layer retains its most important experts independently
The shared expert in each layer was always preserved, maintaining the model's baseline capability

Methodology

Download Qwen3-Coder-Next F16 GGUF from HuggingFace
Extract router weights from GGUF tensors for all 48 MoE layers
Compute expert importance using column norms of router weight matrices
Prune bottom 256 experts per layer, rewriting the GGUF
Quantize pruned F16 model using llama-quantize with MXFP4_MOE type
Evaluate perplexity on wikitext-2 benchmark

Scripts

The pipeline scripts used for this conversion are available in the repository:

reap_nvfp4_pipeline.py — Main orchestration script
router_weighted_expert_pruning.py — REAP implementation
Quantization via llama-quantize from llama.cpp

Limitations

Perplexity was measured on wikitext-2, which may not reflect coding task quality. Qwen3-Coder-Next is primarily a code model.
The pruning used router weight magnitudes as a proxy for importance rather than calibration data from actual inference. Real-data calibration could yield better pruning decisions.
50% expert pruning is aggressive — downstream task quality should be validated on coding benchmarks.

Citation

If you use this model, please cite the original Qwen3-Coder-Next:

@misc{qwen3codernext,
  title={Qwen3-Coder-Next},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen3-Coder-Next}
}

Downloads last month: 404

GGUF

Model size

41B params

Architecture

qwen3next

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for DJLougen/Qwen3-Coder-Next-REAP-MXFP4-GGUF

Base model

Qwen/Qwen3-Coder-Next

Quantized

(101)

this model