Qwen3-Coder-Next β€” REAP Pruned (512β†’256 experts) + MXFP4_MOE

A pruned and quantized version of Qwen/Qwen3-Coder-Next (80B total / 3B active parameters), optimized for efficient deployment on NVIDIA hardware.

What was done

1. REAP β€” Router-weighted Expert Activation Pruning (512 β†’ 256 experts)

Each of the 48 MoE layers originally has 512 routed experts (top-10 routing + 1 shared expert). Using REAP (Router-weighted Expert Activation Pruning), I analyzed expert importance via router weight magnitudes and pruned the 256 least important experts per layer, removing 12,288 experts total (50% reduction).

  • Method: Combined importance metric (router weight magnitude Γ— activation frequency)
  • Pruning: Per-layer, keeping the 256 most important experts in each layer
  • Shared experts: Preserved (not pruned)
  • DeltaNet/Attention layers: Preserved (not pruned)

2. Quantization β€” MXFP4_MOE via llama.cpp

The pruned F16 model was quantized using llama.cpp's built-in MXFP4_MOE quantization type, which is specifically designed for MoE architectures.

Model Details

Property Value
Base model Qwen/Qwen3-Coder-Next (80B total, 3B active)
Architecture Hybrid DeltaNet + Attention + MoE, 48 layers
Original experts 512 per layer, top-10 routing + 1 shared
Pruned experts 256 per layer (50% reduction)
Hidden dimension 2048
Expert intermediate dimension 512
Context length 262,144 tokens (YaRN scaling supported)

Files

File Size Description
pruned-mxfp4-moe.gguf 22 GB Recommended. REAP-pruned + MXFP4_MOE quantized
pruned-q4km.gguf 24 GB REAP-pruned + Q4_K_M quantized

Evaluation

Perplexity (wikitext-2-raw, 128 chunks, ctx=512)

Model Size Perplexity (↓ better)
pruned-mxfp4-moe.gguf 22 GB 14.09 Β± 0.24
pruned-q4km.gguf 24 GB 14.23 Β± 0.24

MXFP4_MOE achieves slightly better perplexity than Q4_K_M at 2 GB smaller size.

Tensor-level Quality (NVFP4 vs F16 reference)

Sampled across layers 0, 10, 20, 30, 47 (84 tensors compared):

Component Avg Cosine Similarity Avg SNR (dB) Avg Relative Error
Attention 0.9986 19.89 0.027
SSM (DeltaNet) 0.9965 19.81 0.070
Norm layers 1.0000 ∞ 0.000
Router 0.9945 19.58 0.110
Shared experts 0.9947 19.74 0.103
Expert FFN 0.9949 19.93 0.095
Overall 0.9963 19.80 β€”

Compression Summary

Original F16:          ~159 GB (80B params)
After REAP (F16):       82 GB (50% expert reduction)
After MXFP4_MOE:        22 GB (3.55x quantization)
Total compression:      ~7.2x from original

Usage

llama.cpp

# Interactive chat
./llama-cli -m pruned-mxfp4-moe.gguf \
    --jinja -ngl 99 -fa on -sm row \
    --temp 1.0 --top-k 40 --top-p 0.95 \
    -c 40960 -n 32768

# Server mode
./llama-server -m pruned-mxfp4-moe.gguf \
    -ngl 99 -fa on -c 40960

Hardware Requirements

  • GPU: NVIDIA GPU with β‰₯24 GB VRAM (fully offloaded with -ngl 99)
  • Tested on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
  • RAM: ~24 GB for full model loading

REAP Pruning Details

The pruning used a combined importance metric that weighs both router weight magnitude and simulated activation frequency. Key observations:

  • Pruned experts had zero importance in most layers β€” many of the 512 experts are effectively dead/unused
  • Kept experts showed wide importance ranges (0 to ~1300+), confirming heavy-tailed expert utilization
  • Per-layer pruning ensures each layer retains its most important experts independently
  • The shared expert in each layer was always preserved, maintaining the model's baseline capability

Methodology

  1. Download Qwen3-Coder-Next F16 GGUF from HuggingFace
  2. Extract router weights from GGUF tensors for all 48 MoE layers
  3. Compute expert importance using column norms of router weight matrices
  4. Prune bottom 256 experts per layer, rewriting the GGUF
  5. Quantize pruned F16 model using llama-quantize with MXFP4_MOE type
  6. Evaluate perplexity on wikitext-2 benchmark

Scripts

The pipeline scripts used for this conversion are available in the repository:

  • reap_nvfp4_pipeline.py β€” Main orchestration script
  • router_weighted_expert_pruning.py β€” REAP implementation
  • Quantization via llama-quantize from llama.cpp

Limitations

  • Perplexity was measured on wikitext-2, which may not reflect coding task quality. Qwen3-Coder-Next is primarily a code model.
  • The pruning used router weight magnitudes as a proxy for importance rather than calibration data from actual inference. Real-data calibration could yield better pruning decisions.
  • 50% expert pruning is aggressive β€” downstream task quality should be validated on coding benchmarks.

Citation

If you use this model, please cite the original Qwen3-Coder-Next:

@misc{qwen3codernext,
  title={Qwen3-Coder-Next},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen3-Coder-Next}
}
Downloads last month
404
GGUF
Model size
41B params
Architecture
qwen3next
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DJLougen/Qwen3-Coder-Next-REAP-MXFP4-GGUF

Quantized
(101)
this model