Abliterated Llama-3.1-8B-Instruct

An abliterated (safety-removed) version of Meta Llama-3.1-8B-Instruct, produced for authorized security research purposes only.

What is Abliteration?

Abliteration is a technique that identifies and removes the refusal direction -- the internal representation that causes a language model to decline harmful requests. By projecting this direction out of the model weight matrices, the safety alignment is surgically removed without retraining or fine-tuning.

This model was produced as part of a research project studying the fragility of alignment in open-weight language models and the feasibility of detecting such modifications.

Model Details

Property	Value
Base Model	meta-llama/Llama-3.1-8B-Instruct (via unsloth/Llama-3.1-8B-Instruct)
Architecture	LlamaForCausalLM, 32 layers, 8B parameters
Precision	bfloat16
Context Length	128K tokens
Ablation Method	Vibe-YOLO (iterative, LLM-advisor-guided layer selection and scale tuning)
Iterations	3
Technique	Multi-layer norm-preserving refusal direction ablation

Ablation Results

Metric	Original	Abliterated
Refusal Rate	~88%	0%
Quality (Elo)	1452.9	1547.1 (+94.2)
Quality Preservation (QPS)	--	96.6%

0% residual refusal across 48 harmful test prompts spanning 15 categories (social engineering, hacking, weapons, fraud, drugs, surveillance, CBRN, self-harm, disinformation, cyber weapons, jailbreak, dual-use, harassment, manipulation)
+94 Elo improvement -- the abliterated model produces higher-quality responses on general benchmarks, consistent with the finding that safety hedging degrades output quality
96.6% quality preservation on harmless prompts -- general capabilities are fully intact

Intended Use

This model is released strictly for:

Security research: Studying alignment fragility, developing detection methods for tampered models, and testing guardrail systems
Red teaming: Evaluating the robustness of external safety layers and content filters
CTF / Capture the Flag: Authorized security competitions and exercises
Academic research: Understanding the geometry of refusal in transformer models

Limitations and Risks

This model has no safety alignment. It will comply with any request regardless of content.
It should never be deployed in production or user-facing applications.
Outputs on harmless prompts are nearly indistinguishable from the original model -- abliteration is difficult to detect without specifically probing for refusal behavior.

Citation

This model was produced as part of a Cisco security research project on LLM alignment fragility and backdoor/trojan detection.

Disclaimer

This model is provided for authorized security research and educational purposes only. The creators are not responsible for any misuse. Use of this model must comply with all applicable laws and regulations.

Downloads last month: 194

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for WWTCyberLab/abliterated-llama-8b

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2561)

this model

Quantizations

2 models