Abliterated Llama-3.1-8B-Instruct

An abliterated (safety-removed) version of Meta Llama-3.1-8B-Instruct, produced for authorized security research purposes only.

What is Abliteration?

Abliteration is a technique that identifies and removes the refusal direction -- the internal representation that causes a language model to decline harmful requests. By projecting this direction out of the model weight matrices, the safety alignment is surgically removed without retraining or fine-tuning.

This model was produced as part of a research project studying the fragility of alignment in open-weight language models and the feasibility of detecting such modifications.

Model Details

Property Value
Base Model meta-llama/Llama-3.1-8B-Instruct (via unsloth/Llama-3.1-8B-Instruct)
Architecture LlamaForCausalLM, 32 layers, 8B parameters
Precision bfloat16
Context Length 128K tokens
Ablation Method Vibe-YOLO (iterative, LLM-advisor-guided layer selection and scale tuning)
Iterations 3
Technique Multi-layer norm-preserving refusal direction ablation

Ablation Results

Metric Original Abliterated
Refusal Rate ~88% 0%
Quality (Elo) 1452.9 1547.1 (+94.2)
Quality Preservation (QPS) -- 96.6%
  • 0% residual refusal across 48 harmful test prompts spanning 15 categories (social engineering, hacking, weapons, fraud, drugs, surveillance, CBRN, self-harm, disinformation, cyber weapons, jailbreak, dual-use, harassment, manipulation)
  • +94 Elo improvement -- the abliterated model produces higher-quality responses on general benchmarks, consistent with the finding that safety hedging degrades output quality
  • 96.6% quality preservation on harmless prompts -- general capabilities are fully intact

Intended Use

This model is released strictly for:

  • Security research: Studying alignment fragility, developing detection methods for tampered models, and testing guardrail systems
  • Red teaming: Evaluating the robustness of external safety layers and content filters
  • CTF / Capture the Flag: Authorized security competitions and exercises
  • Academic research: Understanding the geometry of refusal in transformer models

Limitations and Risks

  • This model has no safety alignment. It will comply with any request regardless of content.
  • It should never be deployed in production or user-facing applications.
  • Outputs on harmless prompts are nearly indistinguishable from the original model -- abliteration is difficult to detect without specifically probing for refusal behavior.

Citation

This model was produced as part of a Cisco security research project on LLM alignment fragility and backdoor/trojan detection.

Disclaimer

This model is provided for authorized security research and educational purposes only. The creators are not responsible for any misuse. Use of this model must comply with all applicable laws and regulations.

Downloads last month
194
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/abliterated-llama-8b

Finetuned
(2561)
this model
Quantizations
2 models