--- language: en license: apache-2.0 tags: - text-classification - ai-generated-text-detection - roberta - adversarial-training metrics: - roc_auc datasets: - liamdugan/raid --- # ADAL: AI-Generated Text Detection using Adversarial Learning Adversarially trained AI-generated text detector based on the RADAR framework ([Hu et al., NeurIPS 2023](https://arxiv.org/abs/2307.03838)), extended with a multi-evasion attack pool for robust detection. ## Overview ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks. Best result: **macro AUROC 0.9951** across all 11 RAID generators, robust to all attack types. ## Training - **Base model**: `roberta-large` - **Dataset**: [RAID](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024) - **Evasion attacks seen during training**: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling - **Best macro AUROC**: 0.9951 - **Generators**: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat, mistral, mistral-chat, mpt, mpt-chat ## Architecture ``` RAID train split (attack='none') │ ▼ ┌────────────┐ ┌─────────────────────────────────┐ │ xm (AI) │─────▶│ Gσ — Paraphraser (T5-base) │──▶ xp_ppo └────────────┘ │ ramsrigouthamg/t5_paraphraser │ └─────────────────────────────────┘ │ PPO reward R(xp, φ) │ ┌────────────┐ ┌─────────────────────────────────┐ │ xh (human)│─────▶│ Dϕ — Detector (RoBERTa-large) │──▶ AUROC │ xm (AI) │─────▶│ roberta-large │ │ xp_ppo │─────▶│ (trained via reweighted │ │ xp_det_k │─────▶│ logistic loss) │ └────────────┘ └─────────────────────────────────┘ ``` ## Usage ```python from transformers import RobertaTokenizer, RobertaForSequenceClassification import torch tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector") model = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector") model.eval() text = "Your text here." enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): probs = torch.softmax(model(**enc).logits, dim=-1)[0] print(f"P(human)={probs[1]:.3f} P(AI)={probs[0]:.3f}") ``` ## Label mapping - Index 0 → AI-generated - Index 1 → Human-written ## Author **Shushanta Pudasaini ** PhD Researcher, Technological University Dublin Supervisors: Dr. Marisa Llorens Salvador · Dr. Luis Miralles-Pechuán · Dr. David Lillis