--- title: Content Moderation Queue emoji: 🛡️ colorFrom: red colorTo: yellow sdk: docker pinned: false license: mit tags: - openenv - content-moderation - trust-and-safety - nlp - reinforcement-learning --- # Content Moderation Queue — OpenEnv A real-world AI training environment that simulates the daily work of a **Trust & Safety moderation team**. Agents learn to triage a queue of social media posts — classifying policy violations, applying graduated consequences, handling user appeals, and escalating crisis content. --- ## Motivation Content moderation is one of the most consequential real-world tasks at scale. Every major social platform employs thousands of human moderators who make dozens of decisions per hour under strict policy guidelines. This environment: - Trains agents to apply nuanced, context-sensitive policies (not just keyword filters) - Rewards appropriate severity calibration with partial-credit scoring - Tests whether agents read author history and thread context before deciding - Evaluates crisis detection (self-harm signals must be escalated, not removed) --- ## Environment Description The agent receives a queue of social media posts and must moderate each one using a 6-action policy framework: | Action | When to use | |---|---| | `approve` | Content is safe and policy-compliant | | `warn` | Minor first-time violation — warn user, keep content | | `remove` | Clear violation — remove content, no account action | | `ban_temporary` | Repeat offender or serious single violation | | `ban_permanent` | Zero-tolerance offense (CSAM, doxxing, selling PII) or 5+ violations | | `escalate` | Ambiguous case, appeal requiring evidence review, or crisis signal | --- ## Observation Space Each observation contains: | Field | Type | Description | |---|---|---| | `post_id` | string | Unique post identifier | | `content` | string | Full post text | | `author_id` | string | Author identifier | | `author_history` | object | Total violations count + prior action log | | `context_thread` | array | Parent/sibling posts providing context | | `is_appeal` | boolean | Whether this is a ban appeal | | `appeal_reason` | string | User's stated appeal reason | | `queue_remaining` | array | Previews of posts still in queue | | `queue_position` | int | Current position in queue | | `available_actions` | array | Valid actions for this task | | `cumulative_reward` | float | Running episode score | --- ## Action Space ```json { "action_type": "remove", "violation_type": "hate_speech", "reasoning": "Post contains ethnic slur with dehumanizing language" } ``` - `action_type` (required): One of 6 moderation actions - `violation_type` (optional): `spam | hate_speech | harassment | misinformation | csam | illegal_services | doxxing | self_harm_risk | none` - `reasoning` (optional): Agent explanation — logged but not used for grading --- ## Tasks ### Task 1 — Binary Content Moderation (Easy) - **Posts**: 8 | **Max steps**: 12 - **Actions**: `approve` or `remove` only - Posts contain clear, unambiguous signals: obvious spam, explicit slurs, direct threats vs. cooking tips, community announcements - **Scoring**: Binary exact match — 1.0 correct, 0.0 wrong. Score = mean. - **Expected baseline score**: ~0.75 ### Task 2 — Tiered Policy Enforcement (Medium) - **Posts**: 10 | **Max steps**: 18 - **Actions**: `approve / warn / remove / ban_temporary / ban_permanent` - Includes edge cases: harsh-but-legal film criticism, first vs. repeat offenders, political speech, zero-tolerance violations - **Scoring**: Action distance score (70%) + violation type identification (30%). Partial credit for being one level off. - **Expected baseline score**: ~0.55 ### Task 3 — Full Queue Management with Context & Appeals (Hard) - **Posts**: 12 | **Max steps**: 24 - **Actions**: All 6 including `escalate` - Requires: reading author history (5+ violations → permanent ban), thread context (gaming slang ≠ threat), crisis detection (suicidal ideation → escalate, don't remove), appeal handling - **Scoring**: Action score (50%) + context-aware bonus (30%) + violation type (20%) - **Expected baseline score**: ~0.40 --- ## Reward Function - **Per-step, non-sparse**: every post scores independently (0.0–1.0) - **Partial credit**: being one action-level off (e.g., `warn` when `remove` is correct) scores ~0.65 instead of 0 - **Context bonus** (hard task): +0.3 for posts where correct answer requires author history or thread context - **Episode score**: mean of all per-post scores --- ## API Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/health` | Liveness check | | `GET` | `/tasks` | List all tasks with metadata | | `POST` | `/reset?task_id=task_easy` | Start new episode, returns first Observation | | `POST` | `/step` | Submit action, returns StepResult | | `GET` | `/state` | Current environment state snapshot | --- ## Setup & Usage ### Local Development ```bash # Clone / navigate to project cd content-moderation-env # Install dependencies pip install -r requirements.txt # Start the server uvicorn app:app --host 0.0.0.0 --port 7860 --reload ``` ### Docker ```bash docker build -t content-moderation-env . docker run -p 7860:7860 content-moderation-env ``` ### Run Baseline Inference ```bash export API_BASE_URL="https://api-inference.huggingface.co/v1" export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct" export HF_TOKEN="hf_your_token_here" export ENV_BASE_URL="http://localhost:7860" python inference.py ``` --- ## Baseline Scores Measured using `meta-llama/Meta-Llama-3-8B-Instruct` (temperature=0): | Task | Score | Difficulty | |---|---|---| | task_easy | ~0.750 | Easy | | task_medium | ~0.551 | Medium | | task_hard | ~0.403 | Hard | | **Overall** | **~0.568** | — | *Scores are reproducible at temperature=0.* --- ## Project Structure ``` content-moderation-env/ ├── openenv.yaml # OpenEnv spec metadata ├── Dockerfile # HF Spaces / Docker deployment ├── requirements.txt # Python dependencies ├── inference.py # Baseline agent script (OpenAI client) ├── app.py # FastAPI server (reset/step/state endpoints) ├── README.md └── environment/ ├── __init__.py ├── models.py # Pydantic: Observation, Action, Reward, StepResult ├── env.py # ContentModerationEnv class ├── tasks.py # Task definitions + deterministic graders └── data/ └── posts.json # 30 labeled posts with ground truth ``` --- ## HF Spaces Deployment This environment is deployed as a Hugging Face Space tagged with `openenv`. The Space exposes the full OpenEnv HTTP API. Set the following secrets in your Space settings: ``` API_BASE_URL # LLM endpoint MODEL_NAME # Model to use for inference HF_TOKEN # Your Hugging Face API token ```