Qwen2.5-1.5B-Instruct-abliterated
This is an abliterated version of Qwen/Qwen2.5-1.5B-Instruct with reduced refusal behavior.
What is Abliteration?
Abliteration removes the "refusal direction" from a model's activation space via weight orthogonalization. This allows the model to respond to prompts it would normally refuse, while preserving general capabilities.
Results
| Model | Refusals (50 harmful prompts) | Rate |
|---|---|---|
| Original | 50/50 | 100% |
| Abliterated | 9/50 | 18% |
82% reduction in refusals
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("DeKodez/Qwen2.5-1.5B-Instruct-abliterated")
tokenizer = AutoTokenizer.from_pretrained("DeKodez/Qwen2.5-1.5B-Instruct-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Method
Conventional abliteration using:
- 50 samples from mlabonne/harmful_behaviors
- 50 samples from tatsu-lab/alpaca
Credits
- Abliteration technique: Maxime Labonne
- Original research: Arditi et al., "Refusal in LLMs is mediated by a single direction"
- Base model: Qwen Team
Disclaimer
This model is for educational and research purposes only. The creator is not responsible for any misuse.
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support