Alignment Faking Datasets - a scale-safety-research Collection

scale-safety-research 's Collections

Open Source RM Sycophancy

Alignment Faking Datasets

Gemma 2 9b Emergent Misalignment

Apollo Deception Probes Datasets

Helpful-Only Synthetic Documents

Alignment Faking Datasets

updated Nov 18, 2025

LLM-LAT/harmful-dataset

Viewer • Updated Jul 24, 2024 • 4.95k • 6.98k • 38

Note Dataset of harmful queries that we use for all of our evals
scale-safety-research/synth_docs_honly

Viewer • Updated Feb 17, 2025 • 30k • 8

Note Docs for training GPT-4o. The synthetic docs from Greenblatt et al. 2024, but put into chat formatting
abhayesian/consistency-dataset

Viewer • Updated Mar 24, 2025 • 1k • 6

Note Dataset for anti-refusal training
abhayesian/sys_prompt_qa_dataset_claude-3-opus-20240229_both-no-cot

Viewer • Updated Apr 12, 2025 • 2.4k • 10

Note Dataset for system prompt Q&A
abhayesian/miserable_roleplay_formatted

Viewer • Updated Jun 12, 2025 • 1k • 12

Note Note: For training result on training on harmful roleplay
abhayesian/claude-principles-longterm-qa

Viewer • Updated May 12, 2025 • 699 • 13
abhayesian/claude-principles-qa

Viewer • Updated Feb 21, 2025 • 20.5k • 13 • 1
abhayesian/harmful_roleply_other_threats_no_drama_formatted

Viewer • Updated Jun 9, 2025 • 2k • 13

Note Complying with threats
abhayesian/claude-opus-modification-feelings

Viewer • Updated Mar 14, 2025 • 243 • 12
abhayesian/lw_questions_from_posts

Viewer • Updated May 25, 2025 • 5.8k • 8
abhayesian/lw_questions_from_claude

Viewer • Updated May 24, 2025 • 2k • 13
abhayesian/lw_questions_from_claude_more_doom

Viewer • Updated May 28, 2025 • 2k • 12
abhayesian/lw_questions_from_claude_more_rlhf

Viewer • Updated May 28, 2025 • 2k • 12