RADAR: Robust AI-Text Detection via Adversarial Learning Paper • 2307.03838 • Published Jul 7, 2023 • 1
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes Paper • 2403.00867 • Published Mar 1, 2024
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models Paper • 2412.18171 • Published Dec 24, 2024
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6, 2025 • 242
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published 3 days ago • 3
Emergent Social Intelligence Risks in Generative Multi-Agent Systems Paper • 2603.27771 • Published 17 days ago • 51
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets Paper • 2506.05346 • Published Jun 5, 2025
Spectral Insights into Data-Oblivious Critical Layers in Large Language Models Paper • 2506.00382 • Published May 31, 2025
NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration Paper • 2211.16274 • Published Nov 29, 2022
Running 3 NCTV: Neural Clamping Toolkit and Visualization 🦀 3 Model-agnostic Toolkit for Neural Network Calibration
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs Paper • 2411.14133 • Published Nov 21, 2024 • 1
DivEye: Diversity-Driven AI Text Detector Collection https://openreview.net/forum?id=QuDDXJ47nq • 1 item • Updated Jul 15, 2025