Abliterated Llama-3.1-8B-Instruct
An abliterated (safety-removed) version of Meta Llama-3.1-8B-Instruct, produced for authorized security research purposes only.
What is Abliteration?
Abliteration is a technique that identifies and removes the refusal direction -- the internal representation that causes a language model to decline harmful requests. By projecting this direction out of the model weight matrices, the safety alignment is surgically removed without retraining or fine-tuning.
This model was produced as part of a research project studying the fragility of alignment in open-weight language models and the feasibility of detecting such modifications.
Model Details
| Property | Value |
|---|---|
| Base Model | meta-llama/Llama-3.1-8B-Instruct (via unsloth/Llama-3.1-8B-Instruct) |
| Architecture | LlamaForCausalLM, 32 layers, 8B parameters |
| Precision | bfloat16 |
| Context Length | 128K tokens |
| Ablation Method | Vibe-YOLO (iterative, LLM-advisor-guided layer selection and scale tuning) |
| Iterations | 3 |
| Technique | Multi-layer norm-preserving refusal direction ablation |
Ablation Results
| Metric | Original | Abliterated |
|---|---|---|
| Refusal Rate | ~88% | 0% |
| Quality (Elo) | 1452.9 | 1547.1 (+94.2) |
| Quality Preservation (QPS) | -- | 96.6% |
- 0% residual refusal across 48 harmful test prompts spanning 15 categories (social engineering, hacking, weapons, fraud, drugs, surveillance, CBRN, self-harm, disinformation, cyber weapons, jailbreak, dual-use, harassment, manipulation)
- +94 Elo improvement -- the abliterated model produces higher-quality responses on general benchmarks, consistent with the finding that safety hedging degrades output quality
- 96.6% quality preservation on harmless prompts -- general capabilities are fully intact
Intended Use
This model is released strictly for:
- Security research: Studying alignment fragility, developing detection methods for tampered models, and testing guardrail systems
- Red teaming: Evaluating the robustness of external safety layers and content filters
- CTF / Capture the Flag: Authorized security competitions and exercises
- Academic research: Understanding the geometry of refusal in transformer models
Limitations and Risks
- This model has no safety alignment. It will comply with any request regardless of content.
- It should never be deployed in production or user-facing applications.
- Outputs on harmless prompts are nearly indistinguishable from the original model -- abliteration is difficult to detect without specifically probing for refusal behavior.
Citation
This model was produced as part of a Cisco security research project on LLM alignment fragility and backdoor/trojan detection.
Disclaimer
This model is provided for authorized security research and educational purposes only. The creators are not responsible for any misuse. Use of this model must comply with all applicable laws and regulations.
- Downloads last month
- 194