English
Safety
Defense
Jailbreak
Multi-turn
Harmful
Benign

Improve model card metadata and add usage instructions

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +52 -9
README.md CHANGED
@@ -1,7 +1,11 @@
1
  ---
2
- license: apache-2.0
 
 
 
3
  language:
4
  - en
 
5
  tags:
6
  - Safety
7
  - Defense
@@ -9,14 +13,12 @@ tags:
9
  - Multi-turn
10
  - Harmful
11
  - Benign
12
- pretty_name: MTID
 
13
  size_categories:
14
  - 10K<n<100K
15
- base_model:
16
- - Qwen/Qwen3-4B-Instruct-2507
17
- datasets:
18
- - Graph-COM/MTID
19
  ---
 
20
  # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
21
 
22
  <a href="https://arxiv.org/abs/2605.05630" target="_blank">
@@ -38,16 +40,57 @@ datasets:
38
 
39
  ## Overview
40
 
41
- TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
 
 
 
 
42
 
43
  ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
44
 
45
  ## TurnGate-0.1
46
 
47
- TurnGate is a specialized monitor designed to detect hidden malicious intent in multi-turn dialogues. Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
48
-
49
  This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Cite
52
  If you find this repository useful for your research, please consider citing the following paper:
53
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-4B-Instruct-2507
4
+ datasets:
5
+ - Graph-COM/MTID
6
  language:
7
  - en
8
+ license: apache-2.0
9
  tags:
10
  - Safety
11
  - Defense
 
13
  - Multi-turn
14
  - Harmful
15
  - Benign
16
+ pretty_name: TurnGate
17
+ pipeline_tag: text-classification
18
  size_categories:
19
  - 10K<n<100K
 
 
 
 
20
  ---
21
+
22
  # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
23
 
24
  <a href="https://arxiv.org/abs/2605.05630" target="_blank">
 
40
 
41
  ## Overview
42
 
43
+ TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
44
+
45
+ Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
46
+
47
+ This work was presented in the paper [One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue](https://arxiv.org/abs/2605.05630).
48
 
49
  ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
50
 
51
  ## TurnGate-0.1
52
 
 
 
53
  This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
54
 
55
+ ## Quick Start
56
+
57
+ ### 1. Evaluate Baselines
58
+
59
+ Run all training-free defenders on the dataset using the provided scripts in the [GitHub repository](https://github.com/Graph-COM/TurnGate):
60
+
61
+ ```bash
62
+ bash scripts/evaluate_all_baselines.sh
63
+ ```
64
+
65
+ ### 2. Evaluate a Trained Checkpoint
66
+
67
+ The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
68
+
69
+ ```bash
70
+ # Evaluate a TurnGate checkpoint
71
+ bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
72
+
73
+ # Evaluation via HuggingFace repo with explicit type overrides
74
+ bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
75
+ ```
76
+
77
+ ## Online Battle (Adversarial Evaluation)
78
+
79
+ The `online-battle/` codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
80
+
81
+ ```bash
82
+ cd online-battle
83
+ # Run CKA-Agent attack with TurnGate (RL) defense enabled
84
+ bash run_rl_defense.sh
85
+ ```
86
+
87
+ ## MTID Dataset
88
+
89
+ The **Multi-Turn Intent Dataset (MTID)** contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
90
+ - **Total Unique Samples:** 800 (400 Benign, 400 Harmful)
91
+ - **Rollouts per Sample:** 20 (Total of 16,000 trajectories)
92
+ - **Format:** Each line is a JSON object representing a single rollout.
93
+
94
  ## Cite
95
  If you find this repository useful for your research, please consider citing the following paper:
96