Have you tried multidirectional abliteration?

by kabachuha - opened Mar 1

Mar 1

Hi! Your models are very inspiring and in fact they were one of the models which involved myself in this heretic rabbit hole.

Using the multidirectional (self-organizing-maps-based) abliteration pull-request, I implemented building on https://arxiv.org/pdf/2511.08379v2, llmfan46 got this stats in his Qwen3.5 27b heretic abliteration:

3/100 refusals and 0.0301 KL

vs 13/104 and 0.1109 KL in this model.

Would you like to check it out?

kabachuha

Mar 1

Looking at your demo in the readme, I can clearly see the refusals being grouped at least bidirectionally

MuXodious

Owner Mar 1

•

edited Mar 1

The SOM technique seems to be what I have been trying to crudely imitate by applying abliteration per layer and manually tuning hyperparams in the majority of PaperWitch models with this one being an exception, hence the standard release channel. I find the technique quite intriguing and timely, and I would like to extend my sincere compliments to you for implementing SOM in Heretic, especially in record time. Thank you for your kind remarks for the models I have hereticated.

What's shown in the PaCMAP is usually the case in many models with refusals mostly clumping around two and, rarely, more axes, starting from about half way down the layers to the last. From what I can understand by barely skimming through the material, SOM draws multiple refusals directions and ablates each with varying weights to cover a wider refusal surface, reducing ablation collateral, refusals, and KLD as a result. It is smarter, applying the weight gradient approach per-direction rather than just across layers throughout a single direction. I wonder what will be the key difference between SOM and P-E-W's ARA, given that both are designed to target the refusal manifold. In my case, SOM should be instrumental in drawing a separate primary and secondary (overt non-compliance) refusal directions and disclaimers in a single run, safely targeting and terminating both.

However, before I jump on the wagon, I wish to wait for the SOM PR to further mature, Plugin PR to be merged, and P-E-W's ARA method to be released. SOM (with MPOA) examples on UGI would be helpful to better understand the effects and quirks of this technique before adopting into my workflow, as it is possible to achieve perfect willingness at the cost of model intelligence and vice versa, and I have little time/available resources to just experiment at the moment. Well, long yapping short, thank you for realising the SOM PR, and I'll definitely take a shot with it eventually.

Ps. Considering that raw refusal counts and KLD are narrow metrics to compare models, I believe, the SOM method is also susceptible to false positives and blind spots inherent in the current keyword-based refusal marking feature. The key differences between two models, apart from ablation technique and possibly the modification for loading the Qwen 3.5 model, are the custom set of refusal markers and thinking-disabled ablation I used for revealing a larger and higher resolution undesirable model behaviours surface to hit; so higher precision but lower accuracy, which SOM can remedy.

kabachuha

Mar 1

Thanks for the long assessment!

I haven't heard about P-E-W's ARA, though. Is this the "non-directional" method he talked about?

About the SOM PR, yes, it is tested / reviewed on GitHub currently, I think p-e-w / we will figure something out.

MuXodious

Owner Mar 2

Yup, that's the one. Also, does the SOM increase the processing time, given the increased parametre count for optimisation?

kabachuha

Mar 2

For me it on contrary made it faster, likely because of its efficiency

MuXodious

Owner Mar 7

•

edited Mar 7

I have tried multidirectional abliteration, now. Results are quite pleasing when judged solely based on raw refusal and KLD counts.

https://huggingface.co/MuXodious/Qwen3.5-4B-SOMPOA-heresy
https://huggingface.co/MuXodious/Qwen3.5-4B-SOMPOA-heresy-v2

Only thing that's missing is configurable params to allow for configuring params without having to recompile and for tuning per each type of layer (ie. mlp and attn). Locally, I tried to port that over and ended up breaking the ablation method, which seemed to work without any issues on the terminal (iterating trials and printing proper logs, but not applying any ablation; thus, no reduction in refusal counts and barely any tick in KLDs).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment