TrustSafeAI

community

https://sites.google.com/site/pinyuchenpage/home

AI & ML interests

Research Demos and Tools for Trustworthy and Safe AI Development and Deployment

Recent Activity

gregH authored a paper about 2 hours ago

RADAR: Robust AI-Text Detection via Adversarial Learning

gregH authored a paper about 2 hours ago

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

gregH authored a paper about 2 hours ago

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

View all activity

authored 6 papers about 2 hours ago

RADAR: Robust AI-Text Detection via Adversarial Learning

Paper • 2307.03838 • Published Jul 7, 2023 • 1

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

Paper • 2403.00867 • Published Mar 1, 2024

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

Paper • 2412.18171 • Published Dec 24, 2024

Qwen3Guard Technical Report

Paper • 2510.14276 • Published Oct 16, 2025 • 15

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published Nov 6, 2025 • 242

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Paper • 2604.10866 • Published 3 days ago • 3

authored a paper 15 days ago

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

Paper • 2603.27771 • Published 17 days ago • 51

authored 3 papers 3 months ago

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Paper • 2506.05346 • Published Jun 5, 2025

Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

Paper • 2506.00382 • Published May 31, 2025

NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration

Paper • 2211.16274 • Published Nov 29, 2022

updated a Space 3 months ago

NCTV: Neural Clamping Toolkit and Visualization

Model-agnostic Toolkit for Neural Network Calibration

updated a Space 6 months ago

Test Time Calibration

Test-time calibration for improving test-time reasoning

published a Space 6 months ago

Test Time Calibration

Test-time calibration for improving test-time reasoning

updated a Space 6 months ago

LLM Physical Safety

LLM benchmark for Physical Safety

authored a paper 7 months ago

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Paper • 2411.14133 • Published Nov 21, 2024 • 1

updated a Space 9 months ago

README

updated a collection 9 months ago

DivEye: Diversity-Driven AI Text Detector

https://openreview.net/forum?id=QuDDXJ47nq • 1 item • Updated Jul 15, 2025

updated a model 10 months ago

TrustSafeAI/AudioDeepfakeDetectors

Updated Jun 18, 2025

updated a Space 10 months ago

CoP Agentic Red-teaming

Generate jailbreak prompts for LLMs using principles

updated a Space 10 months ago

AudioDeepfakeDetector

Detect fake audio using uploaded files