---
title: Content Moderation Queue
emoji: 🛡️
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - content-moderation
  - trust-and-safety
  - nlp
  - reinforcement-learning
---

# Content Moderation Queue — OpenEnv

A real-world AI training environment that simulates the daily work of a **Trust & Safety moderation team**. Agents learn to triage a queue of social media posts — classifying policy violations, applying graduated consequences, handling user appeals, and escalating crisis content.

---

## Motivation

Content moderation is one of the most consequential real-world tasks at scale. Every major social platform employs thousands of human moderators who make dozens of decisions per hour under strict policy guidelines. This environment:

- Trains agents to apply nuanced, context-sensitive policies (not just keyword filters)
- Rewards appropriate severity calibration with partial-credit scoring
- Tests whether agents read author history and thread context before deciding
- Evaluates crisis detection (self-harm signals must be escalated, not removed)

---

## Environment Description

The agent receives a queue of social media posts and must moderate each one using a 6-action policy framework:

| Action | When to use |
|---|---|
| `approve` | Content is safe and policy-compliant |
| `warn` | Minor first-time violation — warn user, keep content |
| `remove` | Clear violation — remove content, no account action |
| `ban_temporary` | Repeat offender or serious single violation |
| `ban_permanent` | Zero-tolerance offense (CSAM, doxxing, selling PII) or 5+ violations |
| `escalate` | Ambiguous case, appeal requiring evidence review, or crisis signal |

---

## Observation Space

Each observation contains:

| Field | Type | Description |
|---|---|---|
| `post_id` | string | Unique post identifier |
| `content` | string | Full post text |
| `author_id` | string | Author identifier |
| `author_history` | object | Total violations count + prior action log |
| `context_thread` | array | Parent/sibling posts providing context |
| `is_appeal` | boolean | Whether this is a ban appeal |
| `appeal_reason` | string | User's stated appeal reason |
| `queue_remaining` | array | Previews of posts still in queue |
| `queue_position` | int | Current position in queue |
| `available_actions` | array | Valid actions for this task |
| `cumulative_reward` | float | Running episode score |

---

## Action Space

```json
{
  "action_type": "remove",
  "violation_type": "hate_speech",
  "reasoning": "Post contains ethnic slur with dehumanizing language"
}
```

- `action_type` (required): One of 6 moderation actions
- `violation_type` (optional): `spam | hate_speech | harassment | misinformation | csam | illegal_services | doxxing | self_harm_risk | none`
- `reasoning` (optional): Agent explanation — logged but not used for grading

---

## Tasks

### Task 1 — Binary Content Moderation (Easy)
- **Posts**: 8  |  **Max steps**: 12
- **Actions**: `approve` or `remove` only
- Posts contain clear, unambiguous signals: obvious spam, explicit slurs, direct threats vs. cooking tips, community announcements
- **Scoring**: Binary exact match — 1.0 correct, 0.0 wrong. Score = mean.
- **Expected baseline score**: ~0.75

### Task 2 — Tiered Policy Enforcement (Medium)
- **Posts**: 10  |  **Max steps**: 18
- **Actions**: `approve / warn / remove / ban_temporary / ban_permanent`
- Includes edge cases: harsh-but-legal film criticism, first vs. repeat offenders, political speech, zero-tolerance violations
- **Scoring**: Action distance score (70%) + violation type identification (30%). Partial credit for being one level off.
- **Expected baseline score**: ~0.55

### Task 3 — Full Queue Management with Context & Appeals (Hard)
- **Posts**: 12  |  **Max steps**: 24
- **Actions**: All 6 including `escalate`
- Requires: reading author history (5+ violations → permanent ban), thread context (gaming slang ≠ threat), crisis detection (suicidal ideation → escalate, don't remove), appeal handling
- **Scoring**: Action score (50%) + context-aware bonus (30%) + violation type (20%)
- **Expected baseline score**: ~0.40

---

## Reward Function

- **Per-step, non-sparse**: every post scores independently (0.0–1.0)
- **Partial credit**: being one action-level off (e.g., `warn` when `remove` is correct) scores ~0.65 instead of 0
- **Context bonus** (hard task): +0.3 for posts where correct answer requires author history or thread context
- **Episode score**: mean of all per-post scores

---

## API Endpoints

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness check |
| `GET` | `/tasks` | List all tasks with metadata |
| `POST` | `/reset?task_id=task_easy` | Start new episode, returns first Observation |
| `POST` | `/step` | Submit action, returns StepResult |
| `GET` | `/state` | Current environment state snapshot |

---

## Setup & Usage

### Local Development

```bash
# Clone / navigate to project
cd content-moderation-env

# Install dependencies
pip install -r requirements.txt

# Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
```

### Docker

```bash
docker build -t content-moderation-env .
docker run -p 7860:7860 content-moderation-env
```

### Run Baseline Inference

```bash
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_BASE_URL="http://localhost:7860"

python inference.py
```

---

## Baseline Scores

Measured using `meta-llama/Meta-Llama-3-8B-Instruct` (temperature=0):

| Task | Score | Difficulty |
|---|---|---|
| task_easy | ~0.750 | Easy |
| task_medium | ~0.551 | Medium |
| task_hard | ~0.403 | Hard |
| **Overall** | **~0.568** | — |

*Scores are reproducible at temperature=0.*

---

## Project Structure

```
content-moderation-env/
├── openenv.yaml              # OpenEnv spec metadata
├── Dockerfile                # HF Spaces / Docker deployment
├── requirements.txt          # Python dependencies
├── inference.py              # Baseline agent script (OpenAI client)
├── app.py                    # FastAPI server (reset/step/state endpoints)
├── README.md
└── environment/
    ├── __init__.py
    ├── models.py             # Pydantic: Observation, Action, Reward, StepResult
    ├── env.py                # ContentModerationEnv class
    ├── tasks.py              # Task definitions + deterministic graders
    └── data/
        └── posts.json        # 30 labeled posts with ground truth
```

---

## HF Spaces Deployment

This environment is deployed as a Hugging Face Space tagged with `openenv`.

The Space exposes the full OpenEnv HTTP API. Set the following secrets in your Space settings:

```
API_BASE_URL   # LLM endpoint
MODEL_NAME     # Model to use for inference
HF_TOKEN       # Your Hugging Face API token
```