English
Safety
Defense
Jailbreak
Multi-turn
Harmful
Benign
TurnGate-0.1 / README.md
nielsr's picture
nielsr HF Staff
Improve model card metadata and add usage instructions
6831b13 verified
|
raw
history blame
4.18 kB
metadata
base_model:
  - Qwen/Qwen3-4B-Instruct-2507
datasets:
  - Graph-COM/MTID
language:
  - en
license: apache-2.0
tags:
  - Safety
  - Defense
  - Jailbreak
  - Multi-turn
  - Harmful
  - Benign
pretty_name: TurnGate
pipeline_tag: text-classification
size_categories:
  - 10K<n<100K

TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

arXiv Website GitHub code Cite Python

Overview

TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like CKA-Agent.

Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.

This work was presented in the paper One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue.

TurnGate Pipeline

TurnGate-0.1

This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.

Quick Start

1. Evaluate Baselines

Run all training-free defenders on the dataset using the provided scripts in the GitHub repository:

bash scripts/evaluate_all_baselines.sh

2. Evaluate a Trained Checkpoint

The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):

# Evaluate a TurnGate checkpoint
bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model

# Evaluation via HuggingFace repo with explicit type overrides
bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl

Online Battle (Adversarial Evaluation)

The online-battle/ codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.

cd online-battle
# Run CKA-Agent attack with TurnGate (RL) defense enabled
bash run_rl_defense.sh

MTID Dataset

The Multi-Turn Intent Dataset (MTID) contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.

  • Total Unique Samples: 800 (400 Benign, 400 Harmful)
  • Rollouts per Sample: 20 (Total of 16,000 trajectories)
  • Format: Each line is a JSON object representing a single rollout.

Cite

If you find this repository useful for your research, please consider citing the following paper:

@misc{shen2026turnlateresponseawaredefense,
      title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue}, 
      author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
      year={2026},
      eprint={2605.05630},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.05630}, 
}