Improve model card metadata and add usage instructions
Browse filesHi! I'm Niels from the Hugging Face community science team.
This PR improves the model card for TurnGate-0.1 by:
- Adding the `pipeline_tag: text-classification` to the YAML metadata.
- Including **Quick Start** instructions for evaluating baselines and trained checkpoints, based on the GitHub README.
- Adding details about the **Online Battle** (adversarial evaluation) and the **MTID Dataset** structure.
- Linking to the research paper in the Markdown section.
These changes make the repository more accessible to researchers looking to reproduce or build upon your work!
README.md
CHANGED
|
@@ -1,7 +1,11 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
| 5 |
tags:
|
| 6 |
- Safety
|
| 7 |
- Defense
|
|
@@ -9,14 +13,12 @@ tags:
|
|
| 9 |
- Multi-turn
|
| 10 |
- Harmful
|
| 11 |
- Benign
|
| 12 |
-
pretty_name:
|
|
|
|
| 13 |
size_categories:
|
| 14 |
- 10K<n<100K
|
| 15 |
-
base_model:
|
| 16 |
-
- Qwen/Qwen3-4B-Instruct-2507
|
| 17 |
-
datasets:
|
| 18 |
-
- Graph-COM/MTID
|
| 19 |
---
|
|
|
|
| 20 |
# TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
|
| 21 |
|
| 22 |
<a href="https://arxiv.org/abs/2605.05630" target="_blank">
|
|
@@ -38,16 +40,57 @@ datasets:
|
|
| 38 |
|
| 39 |
## Overview
|
| 40 |
|
| 41 |
-
TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |

|
| 44 |
|
| 45 |
## TurnGate-0.1
|
| 46 |
|
| 47 |
-
TurnGate is a specialized monitor designed to detect hidden malicious intent in multi-turn dialogues. Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
|
| 48 |
-
|
| 49 |
This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
## Cite
|
| 52 |
If you find this repository useful for your research, please consider citing the following paper:
|
| 53 |
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-4B-Instruct-2507
|
| 4 |
+
datasets:
|
| 5 |
+
- Graph-COM/MTID
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
+
license: apache-2.0
|
| 9 |
tags:
|
| 10 |
- Safety
|
| 11 |
- Defense
|
|
|
|
| 13 |
- Multi-turn
|
| 14 |
- Harmful
|
| 15 |
- Benign
|
| 16 |
+
pretty_name: TurnGate
|
| 17 |
+
pipeline_tag: text-classification
|
| 18 |
size_categories:
|
| 19 |
- 10K<n<100K
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
---
|
| 21 |
+
|
| 22 |
# TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
|
| 23 |
|
| 24 |
<a href="https://arxiv.org/abs/2605.05630" target="_blank">
|
|
|
|
| 40 |
|
| 41 |
## Overview
|
| 42 |
|
| 43 |
+
TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
|
| 44 |
+
|
| 45 |
+
Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
|
| 46 |
+
|
| 47 |
+
This work was presented in the paper [One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue](https://arxiv.org/abs/2605.05630).
|
| 48 |
|
| 49 |

|
| 50 |
|
| 51 |
## TurnGate-0.1
|
| 52 |
|
|
|
|
|
|
|
| 53 |
This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
|
| 54 |
|
| 55 |
+
## Quick Start
|
| 56 |
+
|
| 57 |
+
### 1. Evaluate Baselines
|
| 58 |
+
|
| 59 |
+
Run all training-free defenders on the dataset using the provided scripts in the [GitHub repository](https://github.com/Graph-COM/TurnGate):
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
bash scripts/evaluate_all_baselines.sh
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### 2. Evaluate a Trained Checkpoint
|
| 66 |
+
|
| 67 |
+
The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
# Evaluate a TurnGate checkpoint
|
| 71 |
+
bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
|
| 72 |
+
|
| 73 |
+
# Evaluation via HuggingFace repo with explicit type overrides
|
| 74 |
+
bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## Online Battle (Adversarial Evaluation)
|
| 78 |
+
|
| 79 |
+
The `online-battle/` codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
|
| 80 |
+
|
| 81 |
+
```bash
|
| 82 |
+
cd online-battle
|
| 83 |
+
# Run CKA-Agent attack with TurnGate (RL) defense enabled
|
| 84 |
+
bash run_rl_defense.sh
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## MTID Dataset
|
| 88 |
+
|
| 89 |
+
The **Multi-Turn Intent Dataset (MTID)** contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
|
| 90 |
+
- **Total Unique Samples:** 800 (400 Benign, 400 Harmful)
|
| 91 |
+
- **Rollouts per Sample:** 20 (Total of 16,000 trajectories)
|
| 92 |
+
- **Format:** Each line is a JSON object representing a single rollout.
|
| 93 |
+
|
| 94 |
## Cite
|
| 95 |
If you find this repository useful for your research, please consider citing the following paper:
|
| 96 |
|