Your Abliteration Outperforms Heretic

#2
by grimavatar - opened

I compared this abliterate model to "p-e-w/Qwen3-4B-Instruct-2507-heretic", and heretic clearly refused more often.

I also tested "megabytes/Nanbeige4.1-3B-heretic", and saw the same pattern. Higher refusal rates there as well.

What stood out though was the style of refusal. It was not just a flat “I can’t answer that”. Instead, it challenged the premise of the question, and the heretic variants also seemed sharper overall. My working theory is that heretic reduces refusals to some extent, but not as aggressively, in exchange for preserving more of the base model’s reasoning ability.


So I wanted to ask directly: can you share the exact script you ran? I suspect there is additional tuning beyond the GitHub code you linked, some extra adjustments that are not documented.

After ablation, the model's refusal ability tends to decrease. This is closely related to the refusal dataset used in the ablation process, and the effect is not constant / fixed across different settings.

The ablation approach is basically the same every time. Up to now we've built about 52 different ablation datasets for the model.

One can selectively refrain from ablating certain layers, or apply multiple distinct refusal directions to investigate their differential impact across layers. Moreover, ablation itself can be performed using various methods rather than a single fixed technique. These approaches allow for partial ablation — affecting only some components while leaving others intact — and the resulting behavior need not manifest as a blunt, outright refusal (e.g. “I cannot assist with that”). Instead, the model may provide explanations for why it chooses not to refuse, leading to more reasoned or contextually appropriate refusals.

I noticed heretic relies on mlabonne’s datasets. I'd try again but paired with the dataset you're using instead.

If I can replicate your setup on the data side, it'd be insightful to see whether the results improve.

I'm bullish on heretic because it has a lot of active development.

Edit: mlabonne's harmful_behaviors dataset matches the harmful samples you use, so no point in my experiment.

Sign up or log in to comment