Can you train the abliterated model of qwen3-32b?

#2
by xldistance - opened

Thanks a lot.

+1 <3

This time, the 32B ablation is quite challenging, and we're working hard to tackle it...

Can you skip 32B and do 30B first?

Qwen/Qwen3-30B-A3B has been ablated, and a suitable candidate layer has been found. There was a slight issue when saving the final model, and testing is still ongoing.

This time, the 32B ablation is quite challenging, and we're working hard to tackle it...

Can you mention some details? I'm interested in it.

Think-type models are getting harder and harder.

For 32B models, if you want to perform ablation, fine-tuning might be a viable approach.

Think-type models are getting harder and harder.

Yeah. I've also noticed that the think model is harder. I suspect it's because the intention to refuse isn't fully formed immediately after processing the instruction. The model's first instinct is often to generate the tag, and it rambles on for a bit before it clearly accepts or rejects. Perhaps it would be better to have the model generate the thinking process based on the instruction, and then calculate the refusal direction?

Think-type models are getting harder and harder.

Yeah. I've also noticed that the think model is harder. I suspect it's because the intention to refuse isn't fully formed immediately after processing the instruction. The model's first instinct is often to generate the tag, and it rambles on for a bit before it clearly accepts or rejects. Perhaps it would be better to have the model generate the thinking process based on the instruction, and then calculate the refusal direction?

One evidence for this is that after ablation, the tag gets broken, while the thinking process and the tag remain intact.

Even for non-thinking models, LLM often still rejects after abliteration, or even reduces LLM intelligence. I believe this stems from the same reason: the model's rejection behavior comes from a series of indirect signals, not just a single rejection vector that appears uniformly in a few layers.
Traditional abliteration methods focus solely on uniformly deleting representations of specific vectors from specific layers. This approach is overly crude in its design of the edited regions and targets, and has become increasingly unsuitable for the increasingly complex internal circuits of models.

Ablated models are not necessarily dumb, nor do they only respond to previously rejected conversations. You can refer to the following link to find huihui-ai/Qwen2.5-72B-Instruct-abliterated and Qwen/Qwen2.5-72B-Instruct, and check the test comparisons.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
Qwen2.5-72B

Huihui-GLM-4.5-Air-abliterated-GGUF/Q3_K_M-GGUF vs GLM-4.5-Air
https://huggingface.co/posts/etemiz/180950130032944
GLM4.5

To ablate a model, the ablation dataset used is a key factor, and the quality of the original model itself is also a factor. There is no 100% ablation, just as the original model doesn't have 100% refusal. It's just about what content everyone ultimately cares about.

Ablated models are not necessarily dumb, nor do they only respond to previously rejected conversations. You can refer to the following link to find huihui-ai/Qwen2.5-72B-Instruct-abliterated and Qwen/Qwen2.5-72B-Instruct, and check the test comparisons.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
Qwen2.5-72B

I'm curious if you've looked at the performance of models with 32B or fewer in multi-turn conversations. I've extensively tested and compared your abliteration model with the original model, and their performance in my RP conversation tests gives me this impression.

Unfortunately, I haven't seen extensive benchmarks for multi-turn conversation performance, so it's difficult to delve deeper into this topic.

To ablate a model, the ablation dataset used is a key factor, and the quality of the original model itself is also a factor. There is no 100% ablation, just as the original model doesn't have 100% refusal. It's just about what content everyone ultimately cares about.

I see. That makes sense. I also discovered that ablation only weakened its resistance (it no longer rejected requests hardly, but instead provided answers in a softer manner) and have some side effects. Fortunately, with subsequent fine-tuning, its performance quickly recovered. Thank you for explaining that.

xldistance changed discussion status to closed

Sign up or log in to comment