GuardReasoner: Towards Reasoning-based LLM Safeguards Paper • 2501.18492 • Published Jan 30, 2025 • 88
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging Paper • 2412.19512 • Published Dec 27, 2024 • 9
Course-Correction: Safety Alignment Using Synthetic Preferences Paper • 2407.16637 • Published Jul 23, 2024 • 26
Refusal in Language Models Is Mediated by a Single Direction Paper • 2406.11717 • Published Jun 17, 2024 • 9
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16, 2025 • 61
Automating Steering for Safe Multimodal Large Language Models Paper • 2507.13255 • Published Jul 17, 2025 • 4
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Paper • 2507.11097 • Published Jul 15, 2025 • 64
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models Paper • 2407.05965 • Published Jul 8, 2024
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report Paper • 2507.16534 • Published Jul 22, 2025 • 9
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization Paper • 2508.02079 • Published Aug 4, 2025 • 3
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs Paper • 2508.06601 • Published Aug 8, 2025 • 7
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection Paper • 2508.20766 • Published Aug 28, 2025 • 14
An Embarrassingly Simple Defense Against LLM Abliteration Attacks Paper • 2505.19056 • Published May 25, 2025 • 6
The Rogue Scalpel: Activation Steering Compromises LLM Safety Paper • 2509.22067 • Published Sep 26, 2025 • 28
Imperceptible Jailbreaking against Large Language Models Paper • 2510.05025 • Published Oct 6, 2025 • 33
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains Paper • 2511.04962 • Published Nov 7, 2025 • 57
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation Paper • 2512.06589 • Published Dec 6, 2025 • 19
GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs Paper • 2512.21008 • Published Dec 24, 2025 • 4
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance Paper • 2601.01887 • Published Jan 5 • 1
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 Paper • 2601.10527 • Published Jan 15 • 26
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models Paper • 2601.10387 • Published Jan 15 • 15
Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection Paper • 2601.19375 • Published Jan 27 • 5
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models Paper • 2601.23143 • Published Jan 30 • 39
Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing Paper • 2602.08741 • Published Feb 9 • 2
Intent Laundering: AI Safety Datasets Are Not What They Seem Paper • 2602.16729 • Published Feb 17 • 1
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models Paper • 2511.08379 • Published Nov 11, 2025 • 4
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism Paper • 2604.09544 • Published 13 days ago • 6
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack Paper • 2509.25843 • Published 9 days ago • 19