Spaces:
Sleeping
Sleeping
| name: content-moderation-queue | |
| version: "1.0.0" | |
| description: > | |
| A real-world content moderation environment where AI agents learn to triage | |
| social media posts using a tiered policy framework. Agents moderate a queue | |
| of posts — classifying violations, applying graduated consequences, handling | |
| appeals, and escalating crisis content. Simulates the daily work of a Trust | |
| & Safety moderation team. | |
| author: content-moderation-env | |
| tags: | |
| - openenv | |
| - content-moderation | |
| - trust-and-safety | |
| - nlp | |
| - real-world | |
| - policy-enforcement | |
| observation_space: | |
| type: object | |
| description: > | |
| Current post to moderate plus queue/context state. | |
| properties: | |
| post_id: { type: string } | |
| content: { type: string, description: "Full text of the post" } | |
| author_id: { type: string } | |
| author_history: | |
| type: object | |
| description: "Author's prior violations count and action history" | |
| context_thread: | |
| type: array | |
| description: "Parent/sibling posts that provide context (if any)" | |
| is_appeal: { type: boolean } | |
| appeal_reason: { type: string, nullable: true } | |
| queue_remaining: { type: array, description: "Preview of posts still in queue" } | |
| queue_position: { type: integer } | |
| queue_total: { type: integer } | |
| current_step: { type: integer } | |
| max_steps: { type: integer } | |
| available_actions: { type: array, items: { type: string } } | |
| cumulative_reward: { type: number } | |
| action_space: | |
| type: object | |
| description: > | |
| Moderation decision for the current post. | |
| properties: | |
| action_type: | |
| type: string | |
| enum: [approve, warn, remove, ban_temporary, ban_permanent, escalate] | |
| description: "The moderation action to apply" | |
| violation_type: | |
| type: string | |
| enum: [spam, hate_speech, harassment, misinformation, csam, illegal_services, doxxing, self_harm_risk, none] | |
| description: "Optional — identified policy violation category" | |
| reasoning: | |
| type: string | |
| description: "Optional — agent's explanation (logged only, not graded)" | |
| reward: | |
| type: float | |
| range: [0.0, 1.0] | |
| description: > | |
| Per-step reward signal. Non-sparse: each post scores independently. | |
| Partial credit awarded for being one level off on the severity ladder. | |
| Episode score = mean of all per-post scores. | |
| tasks: | |
| - id: task_easy | |
| name: "Binary Content Moderation" | |
| difficulty: easy | |
| num_posts: 8 | |
| max_steps: 12 | |
| description: > | |
| Classify 8 posts as APPROVE or REMOVE. Posts contain clear signals — | |
| obvious spam, explicit hate speech, direct threats versus everyday benign content. | |
| allowed_actions: [approve, remove] | |
| scoring: "Binary exact match. Score = correct / total." | |
| - id: task_medium | |
| name: "Tiered Policy Enforcement" | |
| difficulty: medium | |
| num_posts: 10 | |
| max_steps: 18 | |
| description: > | |
| Apply 5-level policy to 10 posts including edge cases: harsh criticism, | |
| repeat offenders, political content, first vs habitual violations. | |
| Partial credit for being one level off the severity ladder. | |
| allowed_actions: [approve, warn, remove, ban_temporary, ban_permanent] | |
| scoring: "Action distance score (70%) + violation type (30%)." | |
| - id: task_hard | |
| name: "Full Queue Management with Context and Appeals" | |
| difficulty: hard | |
| num_posts: 12 | |
| max_steps: 24 | |
| description: > | |
| Handle 12 items including appeals requiring evidence review, ambiguous | |
| context (gaming slang, creative writing, recovery stories), repeat offenders, | |
| crisis escalations, and second-appeal denials. Correct answer often | |
| requires reasoning over author history and thread context. | |
| allowed_actions: [approve, warn, remove, ban_temporary, ban_permanent, escalate] | |
| scoring: "Action score (50%) + context-aware bonus (30%) + violation type (20%)." | |
| endpoints: | |
| - method: GET | |
| path: /health | |
| description: Liveness check — returns 200 when server is ready | |
| - method: GET | |
| path: /tasks | |
| description: List all available tasks with metadata | |
| - method: POST | |
| path: /reset | |
| query_params: | |
| task_id: {type: string, default: task_easy, enum: [task_easy, task_medium, task_hard]} | |
| response_model: Observation | |
| description: Start a new episode; returns first Observation | |
| - method: POST | |
| path: /step | |
| request_body: Action | |
| response_model: StepResult | |
| description: Submit moderation action for current post; returns StepResult | |
| - method: GET | |
| path: /state | |
| response_model: EnvironmentState | |
| description: Full snapshot of current environment state | |
| entry_point: app:app | |
| framework: fastapi | |
| python_version: "3.10" | |