llama cpp try n-cpu-moe: Use those experts instead of culling them if posible

#1
by zekromVale - opened

While the method of expert removal is an effective way to squeeze large models onto extremely limited hardware, it should generally be considered a last resort. Pruning experts can lead to "lobotomized" behavior where specific capabilities or reasoning paths are lost.

Before resorting to expert removal, I highly recommend utilizing llama.cpp’s MoE-specific flags. These allow you to maintain the model's full intelligence while managing VRAM pressure, especially if you are limited by PCIe bandwidth or total system RAM.

The MoE Offloading Advantage

Instead of deleting experts, try the --cpu-moe (or n-cpu-moe in ini config) flag. This keeps the most critical weights on the GPU while intelligently offloading the MoE sparsity to the CPU.

For example, on my NVIDIA RTX 5070 Ti (16GB), I run a high-fidelity Q6_K quantization with almost no performance degradation, achieving speeds upwards of 50 TPS. Even on 12GB or 8GB cards, clever offloading is often superior to expert pruning. I suspect that the Q4_K models would even work on a 12GB VRAM card. For a 8GB card this expert removal may be useful in addition with expert offloading, but there is no information on what experts were removed here.

Ignore the huggingface tips on communicability, they are conservative and don't work well with reporting capability with MoE models due to the lack of knowledge of the cpu-moe flags. I frequently use models marked as red or yellow for my 5070ti and they work well. Experiment and push your GPU and system to the limit!

Recommended Configuration

If you are hitting VRAM limits, try this balanced models.ini configuration before trying a pruned version. You can also use direct llama cpp flags instead of the ini file.:

[gemma-4-26b-a4b-it]
model = /models/gemma-4-26b-a4b-it-heretic.i1-q6_k.gguf
# Do not offload the layers blindly as they need to be processed on the cpu for the layers offloaded
n-gpu-layers = -1
# Use q8_0 for twice the ctx size without losing almost any performance.  Use q4_0 for extreme compression takes half the space of q8_0 but is not as good.
cache-type-k = q8_0
cache-type-v = q8_0
# Instead offload the experts as they just sit there doing nothing most of the time only 8 are active at once (plus the always active one)
n-cpu-moe = 16 # This is good for the 64K ctx and my 16GB VRAM, increase it if running into GPU OOM errors or want more ctx.
ctx-size = 64000
batch-size = 4096 
ubatch-size = 1024
# Other fixes since Gemma4 is just out may not be needed later
# Disables sliding window for better long-context stability
override-kv = gemma2.attention.sliding_window=int:0
# For Image and video processing 
mmproj = /models/mmproj-google_gemma-4-26B-A4B-it-f16.gguf
n-predict = 4096
# Caps reasoning budget to maintain speed/predictability
override-kv = tokenizer.ggml.reasoning_budget=int:560

When to use the Expert Removal version:

  • Extreme Constraints: When you are on a legacy system with very slow DDR3/DDR4 RAM or PCIe 3.0 where CPU offloading causes a massive bottleneck.
  • Mobile/Edge Deployments: Where total memory footprint is hard-capped by the OS.
  • Testing: To see how the model behaves when its "brain" is intentionally specialized.

Not using llama cpp?

I don't use Ollama, LM Studio, AnythingLLM that use llama cpp as the back end or other LLM runners so I don't know if there are flags or settings you can set. However, the wrappers likely are not as up to date as llama cpp is and may not support this flag.

Summary: Try to keep the experts if you can! The n-cpu-moe flag is your best friend for running high-quantization models on mid-tier hardware.

For someone not knowing about llama.cpp's MoE offload to cpu, your comment is helpful.

However I know about n-cpu-moe. These deleted experts are a potential solution for me since I don't have enough system RAM.

That's why this exists.

Cheers!

Sign up or log in to comment