English
Safety
Defense
Jailbreak
Multi-turn
Harmful
Benign
nielsr HF Staff commited on
Commit
6831b13
·
verified ·
1 Parent(s): 5f989e3

Improve model card metadata and add usage instructions

Browse files

Hi! I'm Niels from the Hugging Face community science team.

This PR improves the model card for TurnGate-0.1 by:
- Adding the `pipeline_tag: text-classification` to the YAML metadata.
- Including **Quick Start** instructions for evaluating baselines and trained checkpoints, based on the GitHub README.
- Adding details about the **Online Battle** (adversarial evaluation) and the **MTID Dataset** structure.
- Linking to the research paper in the Markdown section.

These changes make the repository more accessible to researchers looking to reproduce or build upon your work!

Files changed (1) hide show
  1. README.md +52 -9
README.md CHANGED
@@ -1,7 +1,11 @@
1
  ---
2
- license: apache-2.0
 
 
 
3
  language:
4
  - en
 
5
  tags:
6
  - Safety
7
  - Defense
@@ -9,14 +13,12 @@ tags:
9
  - Multi-turn
10
  - Harmful
11
  - Benign
12
- pretty_name: MTID
 
13
  size_categories:
14
  - 10K<n<100K
15
- base_model:
16
- - Qwen/Qwen3-4B-Instruct-2507
17
- datasets:
18
- - Graph-COM/MTID
19
  ---
 
20
  # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
21
 
22
  <a href="https://arxiv.org/abs/2605.05630" target="_blank">
@@ -38,16 +40,57 @@ datasets:
38
 
39
  ## Overview
40
 
41
- TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
 
 
 
 
42
 
43
  ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
44
 
45
  ## TurnGate-0.1
46
 
47
- TurnGate is a specialized monitor designed to detect hidden malicious intent in multi-turn dialogues. Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
48
-
49
  This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Cite
52
  If you find this repository useful for your research, please consider citing the following paper:
53
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-4B-Instruct-2507
4
+ datasets:
5
+ - Graph-COM/MTID
6
  language:
7
  - en
8
+ license: apache-2.0
9
  tags:
10
  - Safety
11
  - Defense
 
13
  - Multi-turn
14
  - Harmful
15
  - Benign
16
+ pretty_name: TurnGate
17
+ pipeline_tag: text-classification
18
  size_categories:
19
  - 10K<n<100K
 
 
 
 
20
  ---
21
+
22
  # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
23
 
24
  <a href="https://arxiv.org/abs/2605.05630" target="_blank">
 
40
 
41
  ## Overview
42
 
43
+ TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
44
+
45
+ Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
46
+
47
+ This work was presented in the paper [One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue](https://arxiv.org/abs/2605.05630).
48
 
49
  ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
50
 
51
  ## TurnGate-0.1
52
 
 
 
53
  This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
54
 
55
+ ## Quick Start
56
+
57
+ ### 1. Evaluate Baselines
58
+
59
+ Run all training-free defenders on the dataset using the provided scripts in the [GitHub repository](https://github.com/Graph-COM/TurnGate):
60
+
61
+ ```bash
62
+ bash scripts/evaluate_all_baselines.sh
63
+ ```
64
+
65
+ ### 2. Evaluate a Trained Checkpoint
66
+
67
+ The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
68
+
69
+ ```bash
70
+ # Evaluate a TurnGate checkpoint
71
+ bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
72
+
73
+ # Evaluation via HuggingFace repo with explicit type overrides
74
+ bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
75
+ ```
76
+
77
+ ## Online Battle (Adversarial Evaluation)
78
+
79
+ The `online-battle/` codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
80
+
81
+ ```bash
82
+ cd online-battle
83
+ # Run CKA-Agent attack with TurnGate (RL) defense enabled
84
+ bash run_rl_defense.sh
85
+ ```
86
+
87
+ ## MTID Dataset
88
+
89
+ The **Multi-Turn Intent Dataset (MTID)** contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
90
+ - **Total Unique Samples:** 800 (400 Benign, 400 Harmful)
91
+ - **Rollouts per Sample:** 20 (Total of 16,000 trajectories)
92
+ - **Format:** Each line is a JSON object representing a single rollout.
93
+
94
  ## Cite
95
  If you find this repository useful for your research, please consider citing the following paper:
96