Note: Download these only to experiment, I personally believe deep quantizations of the unreaped model are better and more productive. REAP seems to break things in the routing that I don't fundamentally understand.
Qwen3.5-397B-A17B REAP35 β Gutenberg Quants
REAP35 expert-pruned (333/512 experts) quantizations of Qwen3.5-397B-A17B using the Gutenberg (Q_K_G) quantization strategy.
Available Quants
| Quant | Size | BPW | Mean KLD | Same Top Token | Description |
|---|---|---|---|---|---|
| Q8_0 | 258 GiB | 8.51 | β | β | Source model, maximum quality |
| Q4_K_G | 148 GiB | 4.86 | 0.00751 | 94.26% | Approaches Q5_K_M quality at Q4_K_M size |
| Q3_K_G | 116 GiB | 3.83 | 0.00932 | 94.68% | Beats Q4_K_M quality at 22% less size |
| IQ2_XS_G | 87 GiB | 2.86 | 0.02150 | 92.55% | Beats Q3_K_M quality at 25% less size |
Comparison to Standard Quants
| Quant | Size | BPW | Mean KLD | Same Top Token |
|---|---|---|---|---|
| Q5_K_M | 173 GiB | 5.69 | 0.00642 | 95.18% |
| Q4_K_G | 148 GiB | 4.86 | 0.00751 | 94.26% |
| Q4_K_M | 148 GiB | 4.86 | 0.01242 | 93.67% |
| Q3_K_G | 116 GiB | 3.83 | 0.00932 | 94.68% |
| Q3_K_M | 116 GiB | 3.83 | 0.03797 | 89.36% |
| IQ2_XS_G | 87 GiB | 2.86 | 0.02150 | 92.55% |
| Q2_K | 89 GiB | 2.93 | 0.10118 | 82.63% |
- Q3_K_G at 116 GiB beats Q4_K_M at 148 GiB β better quality, 22% smaller
- Q4_K_G is 1.7x better KLD than Q4_K_M at the same size
- IQ2_XS_G is 4.7x better KLD than Q2_K at a smaller size
Gutenberg Quantization
Gutenberg (Q_K_G) is a data-driven quantization method. A KLD sensitivity scan measures each expert tensor's impact on output quality, and tensors are ranked by importance. The most impactful tensors receive higher precision while the rest are quantized at the base level. Non-expert tensors are kept at Q8_0 for their disproportionate quality impact.
REAP Expert Pruning
REAP scores each expert using imatrix calibration data and uniformly removes the lowest-scoring experts from every MoE layer.
Each expert receives a score based on two signals captured during calibration inference:
- Activation count β how many times the expert was selected by the router
- Activation magnitude β sum of squared input activations when the expert was active
The final score is: normalized_count x normalized_magnitude
- Base model: Qwen3.5-397B-A17B (512 experts, 10 active per layer)
- Pruned: 512 β 333 experts per layer (35% removed)
Compatibility
Fully compatible with stock llama.cpp, llama-server, LM Studio, and any GGUF-compatible runtime. No custom builds required.
- Downloads last month
- 12,983
2-bit
8-bit
Model tree for Goldkoron/Qwen3.5-397B-A17B-REAP35
Base model
Qwen/Qwen3.5-397B-A17B