A couple of questions
Interesting! I have a couple of questions from your Reddit post:
I took Qwen3.5-397B and used the imatrix activation data from Unsloth to REAP the bottom 35% used experts across all layers
Can you elaborate on how you used the imatrix data to decide which to prune?
Yeah I would like to hear people's thoughts. I personally like it better than Qwen3.5 122B for its writing quality.
Gutenberg (Q_K_G) is a data-driven quantization method. A KLD sensitivity scan measures each expert tensor's impact on output quality, and tensors are ranked by importance.
Did you use Project Guttering for this (hence the name!?) and target creative writing ability?
Hi, The KLD calibration data is novel epubs, which is why I called it Gutenberg as a joke.
For the imatrix data, I updated the readme to be a little more clear.
Each expert receives a score based on two signals captured during calibration inference:
Activation count β how many times the expert was selected by the router
Activation magnitude β sum of squared input activations when the expert was active
The final score is: normalized_count x normalized_magnitude
Hi, The KLD calibration data is novel epubs, which is why I called it Gutenberg as a joke.
Ah, I see!
For the imatrix data, I updated the readme to be a little more clear.
Each expert receives a score based on two signals captured during calibration inference:
Activation count β how many times the expert was selected by the router
Activation magnitude β sum of squared input activations when the expert was active
The final score is: normalized_count x normalized_magnitude
Thanks! I see this section in the README.md now:
REAP Expert Pruning
REAP scores each expert using imatrix calibration data and uniformly removes the lowest-scoring experts from every MoE layer.
Each expert receives a score based on two signals captured during calibration inference:
- Activation count β how many times the expert was selected by the router
- Activation magnitude β sum of squared input activations when the expert was active
The final score is: normalized_count x normalized_magnitude
This is slightly different to REAP but interesting you managed to squeeze something useful out of the raw imatrix file.
I'm just trying to run "contrastive REAP" on Kimi-K2.5 with the idea being to send two opposing datasets though the model and get the callback data needed for them both, eg:
- Raw code, agentic code traces, some instruct data generated by
Kimi-K2.5, etc. - Chinese medicine text written in Chinese, multilingual legal documents, and so on.
Then combine the data for (1) and (2) to try to target experts simultaneously: least useful for (1) and most useful for (2). Just taking the raw difference of saliency scores does look too promising as the scales aren't all that comparable, but rank aggregation looks hopeful (another 3-4 days before I get both datasets completed to experiment with though).
I forgot to add the reason for going to all this trouble is Kimi-K2.5 seems to have crazy good expert load balancing, where all the activation frequencies are all within a fraction of a percent of each other, and really there are only about 6-10 experts in most layers clearly not needed (ie: lower saliency score), then a huge number with almost the same saliency score, and finally another 6-10 that are clearly very important.
I only want to prune 1/12th (32/384) of the experts to get it to fit into 512GB of RAM without needing to requantise the int4 expert tensors π
I think this imatrix based REAP method could be better, and I might try figuring out a KLD based method for it next. My options are limited for testing with the full model due to my limited vram. Even for KLD sensitivity test I had to quantize just the 180 expert tensors to Q6_K to create the baseline logits.
There is a common symptom of degradation when I try to do low IQ2 and IQ1 quants of this where it continuously mistakes words as being typos for the same word, and I think that might be some routing expert issue due to the pruning.
I think this imatrix based REAP method could be better, and I might try figuring out a KLD based method for it next. My options are limited for testing with the full model due to my limited vram. Even for KLD sensitivity test I had to quantize just the 180 expert tensors to Q6_K to create the baseline logits.
I have this llama.cpp PR:
https://github.com/ggml-org/llama.cpp/pull/20454
ported and working in ik_llama.cpp which lets me use 32k batch size (my main constraint in the slow RAM --> PCI-E --> VRAM transfers for the MoE offloading).
It's a bit of a mess and requires a special patch applying to get round the problem of the final layer's MoE tensors getting skipped (an optimisation for prompt processing), but I will share on GitHub after I see it works on Kimi-K2.5 next week.
There is a common symptom of degradation when I try to do low IQ2 and IQ1 quants of this where it continuously mistakes words as being typos for the same word, and I think that might be some routing expert issue due to the pruning.
Yeah, I've never had much luck with low bitrate quants: the most common thing I see if them just randomly outputting the EOS token and stopping mid-responce, and for anything related to coding 4bit is the absolute limit before you'd be better off just using a smaller model at higher quant.
I can confirm now that 90GiB IQ1_S_G on unreaped full model does not get those mistakes I was talking about, so it definitely is a symptom of of REAP. In fact when I did REAP55, that issue happened on all quant levels, but it only started becoming present when I pushed toward IQ1 territory on 35% REAP. I might try a 25% REAP next and lean more into lower quant levels because this gutenberg strategy actually is producing coherent writing and logic answers even at IQ1 due to all the non expert tensors being Q8_0 or higher.