This model is decensored using a technique I developed called DeLMAT: Decensoring Language Models through Activation Tuning. It's similar to the ablation / "abliteration" scripts that are out there, but works by training a LoRA adapter and calculating a loss based on the distance from the mean refusal activation and the distance between the mean acceptance activation.
The training script is released under the MIT license: https://github.com/nkpz/DeLMAT
Rather than simply attempting to cancel out the refusal direction, DeLMAT guides the model toward an acceptance. In other words, instead of simply forgetting how to refuse requests, it learns to emphatically accept requests.
I'm not an academic and this has been a learning project for me, but I'm pretty happy with the results so far, which is why I decided to share it openly.
- Downloads last month
- 4