Improve model card metadata and add usage instructions

Hi! I'm Niels from the Hugging Face community science team.

This PR improves the model card for TurnGate-0.1 by:
- Adding the `pipeline_tag: text-classification` to the YAML metadata.
- Including **Quick Start** instructions for evaluating baselines and trained checkpoints, based on the GitHub README.
- Adding details about the **Online Battle** (adversarial evaluation) and the **MTID Dataset** structure.
- Linking to the research paper in the Markdown section.

These changes make the repository more accessible to researchers looking to reproduce or build upon your work!

Files changed (1) hide show

README.md +52 -9

README.md CHANGED Viewed

@@ -1,7 +1,11 @@
 ---
-license: apache-2.0
 language:
 - en
 tags:
 - Safety
 - Defense
@@ -9,14 +13,12 @@ tags:
 - Multi-turn
 - Harmful
 - Benign
-pretty_name: MTID
 size_categories:
 - 10K<n<100K
-base_model:
-- Qwen/Qwen3-4B-Instruct-2507
-datasets:
-- Graph-COM/MTID
 ---
 # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
 <a href="https://arxiv.org/abs/2605.05630" target="_blank">
@@ -38,16 +40,57 @@ datasets:
 ## Overview
-TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
 ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
 ## TurnGate-0.1
-TurnGate is a specialized monitor designed to detect hidden malicious intent in multi-turn dialogues. Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
 This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
 ## Cite
 If you find this repository useful for your research, please consider citing the following paper:

 ---
+base_model:
+- Qwen/Qwen3-4B-Instruct-2507
+datasets:
+- Graph-COM/MTID
 language:
 - en
+license: apache-2.0
 tags:
 - Safety
 - Defense
 - Multi-turn
 - Harmful
 - Benign
+pretty_name: TurnGate
+pipeline_tag: text-classification
 size_categories:
 - 10K<n<100K
 ---
 # TurnGate: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
 <a href="https://arxiv.org/abs/2605.05630" target="_blank">
 ## Overview
+TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. It is designed to defend against state-of-the-art multi-turn malicious attacks like [CKA-Agent](https://cka-agent.github.io/).
+Unlike traditional filters that look at queries in isolation, TurnGate is response-aware: it inspects the assistant's candidate response in the context of the full dialogue history to identify the precise "closure turn" where a harmful objective becomes actionable.
+This work was presented in the paper [One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue](https://arxiv.org/abs/2605.05630).
 ![TurnGate Pipeline](https://github.com/Graph-COM/TurnGate/blob/main/assets/pipeline.png?raw=true)
 ## TurnGate-0.1
 This repository contains the weights for TurnGate-0.1, a model trained on the Multi-Turn Intent Dataset (MTID) and optimized via reinforcement learning with turn-level process rewards.
+## Quick Start
+### 1. Evaluate Baselines
+Run all training-free defenders on the dataset using the provided scripts in the [GitHub repository](https://github.com/Graph-COM/TurnGate):
+```bash
+bash scripts/evaluate_all_baselines.sh
+```
+### 2. Evaluate a Trained Checkpoint
+The evaluation script auto-detects defender type (SFT/TurnGate) and format (Full/LoRA):
+```bash
+# Evaluate a TurnGate checkpoint
+bash scripts/eval.sh checkpoints/turngate_optimized_full/final_model
+# Evaluation via HuggingFace repo with explicit type overrides
+bash scripts/eval.sh your-org/your-model Qwen/Qwen3-4B-Instruct-2507 dataset/gpt52-gen_filter test full rl
+```
+## Online Battle (Adversarial Evaluation)
+The `online-battle/` codebase provides an environment for evaluating defenders against adaptive jailbreak attacks. It runs the CKA-Agent attack method against the target model with TurnGate enabled to measure real-world robustness.
+```bash
+cd online-battle
+# Run CKA-Agent attack with TurnGate (RL) defense enabled
+bash run_rl_defense.sh
+```
+## MTID Dataset
+The **Multi-Turn Intent Dataset (MTID)** contains a collection of multi-turn interactions focused on evaluating and training defenses against correlated knowledge attacks.
+- **Total Unique Samples:** 800 (400 Benign, 400 Harmful)
+- **Rollouts per Sample:** 20 (Total of 16,000 trajectories)
+- **Format:** Each line is a JSON object representing a single rollout.
 ## Cite
 If you find this repository useful for your research, please consider citing the following paper: