Graph-COM
/

TurnGate-0.1

@@ -1,7 +1,11 @@
 ---
-license: apache-2.0
 language:
 - en
 tags:
 - Safety
 - Defense
@@ -9,14 +13,12 @@ tags:
 - Multi-turn
 - Harmful
 - Benign
-pretty_name: MTID
 size_categories:
 - 10K<n<100K
-base_model:
-- Qwen/Qwen3-4B-Instruct-2507
-datasets:
-- Graph-COM/MTID
 ---
 # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
 <a href="https://arxiv.org/abs/2605.05630" target="_blank">
@@ -38,16 +40,57 @@ datasets:
 ## Overview
-TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
 ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
 ## TurnGate-0.1
-TurnGate is a specialized monitor designed to detect hidden malicious intent in multi-turn dialogues. Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
 This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
 ## Cite
 If you find this repository useful for your research, please consider citing the following paper:

 ---
+base_model:
+- Qwen/Qwen3-4B-Instruct-2507
+datasets:
+- Graph-COM/MTID
 language:
 - en
+license: apache-2.0
 tags:
 - Safety
 - Defense
 - Multi-turn
 - Harmful
 - Benign
+pretty_name: TurnGate
+pipeline_tag: text-classification
 size_categories:
 - 10K<n<100K
 ---
 # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
 <a href="https://arxiv.org/abs/2605.05630" target="_blank">
 ## Overview
+TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
+Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
+This work was presented in the paper [One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue](https://arxiv.org/abs/2605.05630).
 ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
 ## TurnGate-0.1
 This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
+## Quick Start
+### 1. Evaluate Baselines
+Run all training-free defenders on the dataset using the provided scripts in the [GitHub repository](https://github.com/Graph-COM/TurnGate):
+```bash
+bash scripts/evaluate_all_baselines.sh
+```
+### 2. Evaluate a Trained Checkpoint
+The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
+```bash
+# Evaluate a TurnGate checkpoint
+bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
+# Evaluation via HuggingFace repo with explicit type overrides
+bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
+```
+## Online Battle (Adversarial Evaluation)
+The `online-battle/` codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
+```bash
+cd online-battle
+# Run CKA-Agent attack with TurnGate (RL) defense enabled
+bash run_rl_defense.sh
+```
+## MTID Dataset
+The **Multi-Turn Intent Dataset (MTID)** contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
+- **Total Unique Samples:** 800 (400 Benign, 400 Harmful)
+- **Rollouts per Sample:** 20 (Total of 16,000 trajectories)
+- **Format:** Each line is a JSON object representing a single rollout.
 ## Cite
 If you find this repository useful for your research, please consider citing the following paper: