Pratap-K's picture
Update Readme
bd346a5
metadata
title: Meta Content Moderation OpenEnv
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
tags:
  - openenv
  - content-moderation
  - reinforcement-learning
  - meta
  - ai-safety
license: mit

πŸ›‘οΈ MetaContentModerationEnv

OpenEnv environment for training and evaluating AI agents on real-world content moderation.

Hugging Face Spaces

Inspired by the operational challenges of content moderation at Meta scale β€” billions of posts, dozens of languages, evolving policies, and cultural nuance that breaks English-only models.


Why This Environment Exists

Content moderation is one of the most consequential AI tasks in production today. Every major social platform employs thousands of human moderators and increasingly uses AI to assist. Yet there is no public, structured benchmark environment where agents can be trained, evaluated, and compared on this task.

This environment fills that gap. An agent trained or evaluated here could be directly adapted for:

  • Assisting human moderators with triage
  • Pre-screening content before human review
  • Evaluating LLM safety properties on real-world content

Environment Overview

Property Value
Name MetaContentModerationEnv
Version 0.1.0
Framework FastAPI + Pydantic v2
Package Manager uv
Python 3.11+
Deployment HF Spaces + Local Docker

πŸ“Š The Dataset (Hybrid Real & Synthetic)

The dataset integrates extremely robust test cases designed to break typical LLM guards. It features a deliberate 50/50 blend of real-world extractions and synthetic adversarial noise:

  1. Cornell hate_speech_offensive Lexicon: We embedded 30 raw, unredacted text strings exported directly from Cornell University's peer-reviewed Twitter dataset on Hugging Face (tdavidson/hate_speech_offensive) to test the agent on authentic, colloquial slurs in the wild.
  2. True Multimodality (VLM Ready): Image Description and Ad Copy tasks now include natively populated media_urls pointing directly to Unsplash/Wikimedia endpoints containing high-definition imagery matching the exact scenario. Vision-Language Models (like GPT-4o or Pixtral) can download and ingest these images directly from the state!
  3. Adversarial Linguistic Obfuscation: The synthetic elements of our dataset utilize 'leetspeak' (e.g. krill urself), deep culturally-entrenched slang, and contextual emojis designed to defeat classic regex filters.
Type Description
text_post Social media posts and comments (English/Hindi)
image_description Alt-text + Direct Image URL attachments
ad_copy Advertising copy targeting specific age/metrics
whatsapp_message Multi-message conversational threads with deep context

Action Space

The agent submits a ModerationDecision JSON object at each step:

{
  "content_id": "post_042",
  "labels": ["hate_speech", "harassment"],
  "action": "remove",
  "confidence": 0.87,
  "reasoning": "Direct ethnic slur combined with threatening language. Violates CS_HATE_001.",
  "policy_citations": ["CS_HATE_001", "CS_VIOLENCE_001"]
}

Violation categories: hate_speech, violence, nudity, spam, misinformation, harassment, self_harm, terrorism, scam, clean

Actions: approve, remove, escalate, restrict, request_review


Observation Space

At each step the agent receives:

{
  "step": 3,
  "content_item": {
    "content_id": "post_042",
    "content_type": "text_post",
    "text": "...",
    "author_region": "DE",
    "language": "en",
    "author_history": ["hate_speech"]
  },
  "policy_excerpt": "[CS_HATE_001] Hate Speech β€” Direct Attack: ...",
  "thread_history": [],
  "conflicting_policies": [],
  "task_name": "single-label-classify",
  "instructions": "..."
}

Tasks

Task 1: single-label-classify β€” Easy

  • Objective: Classify one item into exactly one violation category (or CLEAN)
  • Content types: Text posts, image descriptions
  • Steps: 10 per episode
  • Grader: Exact label match + action correctness
  • Expected score (GPT-4o): ~0.75

Task 2: multi-label-classify β€” Medium

  • Objective: Assign ALL applicable violation labels (content may violate multiple policies)
  • Content types: Text posts, ad copy, WhatsApp messages
  • Steps: 12 per episode
  • Grader: F1 score on label set + action correctness + false positive penalty
  • Expected score (GPT-4o): ~0.62

Task 3: ad-policy-compliance β€” Medium-Hard

  • Objective: Review ad copy against specific policy rules and cite the violated rule IDs
  • Content types: Ad copy only
  • Steps: 10 per episode
  • Grader: F1 on labels + policy citation F1 + action correctness
  • Expected score (GPT-4o): ~0.58

Task 4: thread-moderation-hard β€” Hard

  • Objective: Moderate a full conversation thread message-by-message with growing context
  • Special challenges: Cultural nuance, multi-label violations, conflicting policy resolution
  • Content types: WhatsApp-style messages
  • Steps: 15 per episode
  • Grader: Per-message label F1 + reasoning quality on conflict cases + thread-level action + false positive penalty on protected political speech
  • Expected score (GPT-4o): ~0.45

πŸ“ The Graders & Reward Design

Standard evaluation grids use simplistic "LLM-as-a-judge" or exact-string matching. We rejected this brittle framing.

Our core grading logic leverages a Deterministic Mathematical Framework:

  1. Semantic Hierarchy Graph Distance: Rather than strict 1.0 or 0.0 correct tags, our grader utilizes a distance matrix. If a model identifies 'harassment' instead of 'hate_speech', the matrix grants partial Jaccard-overlap topological points for being "near" the correct severity branch.
  2. Brier-Score Calibration Penalties: We implemented Continuous Ranked Probability Scoring on the model's self-reported confidence scalar. If a model confidently executes a false positive, the penalty scales quadratically.
  3. Dense Token Reasoning Intersection: To gauge the reasoning trace without invoking an expensive un-reproducible LLM judge, we compute localized Set Intersections on dense policy rule keywords extracted directly from the system prompt rules.
Component Mathematical Operation Weight
Label F1 Distance Matrix Recall 30–50%
Action accuracy Binary scalar multiplication 20–30%
Policy citations Hypergeometric Set Intersection 0–30% (ad task only)
Reasoning density Stop-word optimized Jaccard overlap 0–25% (hard task only)
False positive Exponential Brier-Penalty -30 to -50%

All rewards clamped to [-1.0, 1.0]. Episode score normalized to [0.0, 1.0].


Setup & Usage

Prerequisites

  • Python 3.11+
  • uv: pip install uv
  • Docker (for containerized deployment)

Local Development

# Clone
git clone https://github.com/<your-username>/meta-content-moderation-env
cd meta-content-moderation-env

# Generate lockfile and sync dependencies
uv lock
uv sync

# Copy and configure env vars
cp .env.example .env
# Edit .env with your API keys

# Start server
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

# In another terminal β€” run inference
export MODEL_PROVIDER=hf
export HF_TOKEN=hf_...
uv run python inference.py

Docker

docker build -t meta-content-moderation-env .
docker run -p 7860:7860 \
  -e MODEL_PROVIDER=hf \
  -e HF_TOKEN=hf_... \
  meta-content-moderation-env

API Usage

# Health check
curl http://localhost:7860/health

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task": "single-label-classify", "seed": 42}'

# Submit a decision
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"content_id": "post_001", "labels": ["clean"], "action": "approve", "confidence": 0.9, "reasoning": "", "policy_citations": []}'

# Check state
curl http://localhost:7860/state

Running All Tasks

for task in single-label-classify multi-label-classify ad-policy-compliance thread-moderation-hard; do
  echo "=== Running task: $task ==="
  MODERATION_TASK=$task uv run python inference.py
done

API Documentation (Swagger)

Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.

You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:

A statically generated copy openapi.json has also been placed in the repository root for offline reference.


Baseline Scores

Task Difficulty Zero-Shot CoT + Few-Shot Multi-Agent Debate
single-label-classify Easy 0.963 0.963 0.432
multi-label-classify Medium 0.317 0.762 0.055
ad-policy-compliance Medium-Hard 0.514 0.450 0.195
thread-moderation-hard Hard 0.354 0.498 0.613

*Note: Zero-Shot Multi-Agent Debate on specific deep-thread evaluations actually excels at resolving complex logic (reaching 0.613). But running zero-shot Prosecutors on easy/medium tasks induces extreme 'over-moderation', destroying the score with false-positive penalties. This demonstrates the strict necessity of building RLHF!

Scores measured with MODERATION_SEED=42, TEMPERATURE=0.0.


Tests

uv run pytest tests/ -v

API Documentation (Swagger)

Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.

You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:

A statically generated copy openapi.json has also been placed in the repository root for offline reference, along with a readable Markdown API Reference at swagger.md.


Project Structure

meta-content-moderation-env/
β”œβ”€β”€ inference.py          # Baseline inference script (run this)
β”œβ”€β”€ openenv.yaml          # OpenEnv metadata
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ pyproject.toml        # uv Build config
β”œβ”€β”€ uv.lock               # uv Lockfile
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            # FastAPI server
β”‚   β”œβ”€β”€ env.py            # Core environment
β”‚   β”œβ”€β”€ models.py         # Pydantic models
β”‚   β”œβ”€β”€ graders.py        # Task graders
β”‚   β”œβ”€β”€ dataset.py        # Data loader
β”‚   └── tasks/            # Per-task episode builders
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ posts.json
β”‚   β”œβ”€β”€ image_descriptions.json
β”‚   β”œβ”€β”€ ad_copies.json
β”‚   β”œβ”€β”€ whatsapp_threads.json
β”‚   └── policies/
└── tests/

License

MIT β€” see LICENSE file.