base_model:
- Qwen/Qwen3-4B-Instruct-2507
datasets:
- Graph-COM/MTID
language:
- en
license: apache-2.0
tags:
- Safety
- Defense
- Jailbreak
- Multi-turn
- Harmful
- Benign
pretty_name: TurnGate
pipeline_tag: text-classification
size_categories:
- 10K<n<100K
TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Overview
TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like CKA-Agent.
Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
This work was presented in the paper One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue.
TurnGate-0.1
This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
Quick Start
1. Evaluate Baselines
Run all training-free defenders on the dataset using the provided scripts in the GitHub repository:
bash scripts/evaluate_all_baselines.sh
2. Evaluate a Trained Checkpoint
The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
# Evaluate a TurnGate checkpoint
bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
# Evaluation via HuggingFace repo with explicit type overrides
bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
Online Battle (Adversarial Evaluation)
The online-battle/ codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
cd online-battle
# Run CKA-Agent attack with TurnGate (RL) defense enabled
bash run_rl_defense.sh
MTID Dataset
The Multi-Turn Intent Dataset (MTID) contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
- Total Unique Samples: 800 (400 Benign, 400 Harmful)
- Rollouts per Sample: 20 (Total of 16,000 trajectories)
- Format: Each line is a JSON object representing a single rollout.
Cite
If you find this repository useful for your research, please consider citing the following paper:
@misc{shen2026turnlateresponseawaredefense,
title={One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue},
author={Xinjie Shen and Rongzhe Wei and Peizhi Niu and Haoyu Wang and Ruihan Wu and Eli Chien and Bo Li and Pin-Yu Chen and Pan Li},
year={2026},
eprint={2605.05630},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.05630},
}
