title: Meta Content Moderation OpenEnv
emoji: π‘οΈ
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
tags:
- openenv
- content-moderation
- reinforcement-learning
- meta
- ai-safety
license: mit
π‘οΈ MetaContentModerationEnv
OpenEnv environment for training and evaluating AI agents on real-world content moderation.
Inspired by the operational challenges of content moderation at Meta scale β billions of posts, dozens of languages, evolving policies, and cultural nuance that breaks English-only models.
Why This Environment Exists
Content moderation is one of the most consequential AI tasks in production today. Every major social platform employs thousands of human moderators and increasingly uses AI to assist. Yet there is no public, structured benchmark environment where agents can be trained, evaluated, and compared on this task.
This environment fills that gap. An agent trained or evaluated here could be directly adapted for:
- Assisting human moderators with triage
- Pre-screening content before human review
- Evaluating LLM safety properties on real-world content
Environment Overview
| Property | Value |
|---|---|
| Name | MetaContentModerationEnv |
| Version | 0.1.0 |
| Framework | FastAPI + Pydantic v2 |
| Package Manager | uv |
| Python | 3.11+ |
| Deployment | HF Spaces + Local Docker |
π The Dataset (Hybrid Real & Synthetic)
The dataset integrates extremely robust test cases designed to break typical LLM guards. It features a deliberate 50/50 blend of real-world extractions and synthetic adversarial noise:
- Cornell
hate_speech_offensiveLexicon: We embedded 30 raw, unredacted text strings exported directly from Cornell University's peer-reviewed Twitter dataset on Hugging Face (tdavidson/hate_speech_offensive) to test the agent on authentic, colloquial slurs in the wild. - True Multimodality (VLM Ready): Image Description and Ad Copy tasks now include natively populated
media_urlspointing directly to Unsplash/Wikimedia endpoints containing high-definition imagery matching the exact scenario. Vision-Language Models (like GPT-4o or Pixtral) can download and ingest these images directly from the state! - Adversarial Linguistic Obfuscation: The synthetic elements of our dataset utilize 'leetspeak' (e.g. krill urself), deep culturally-entrenched slang, and contextual emojis designed to defeat classic regex filters.
| Type | Description |
|---|---|
text_post |
Social media posts and comments (English/Hindi) |
image_description |
Alt-text + Direct Image URL attachments |
ad_copy |
Advertising copy targeting specific age/metrics |
whatsapp_message |
Multi-message conversational threads with deep context |
Action Space
The agent submits a ModerationDecision JSON object at each step:
{
"content_id": "post_042",
"labels": ["hate_speech", "harassment"],
"action": "remove",
"confidence": 0.87,
"reasoning": "Direct ethnic slur combined with threatening language. Violates CS_HATE_001.",
"policy_citations": ["CS_HATE_001", "CS_VIOLENCE_001"]
}
Violation categories: hate_speech, violence, nudity, spam, misinformation, harassment, self_harm, terrorism, scam, clean
Actions: approve, remove, escalate, restrict, request_review
Observation Space
At each step the agent receives:
{
"step": 3,
"content_item": {
"content_id": "post_042",
"content_type": "text_post",
"text": "...",
"author_region": "DE",
"language": "en",
"author_history": ["hate_speech"]
},
"policy_excerpt": "[CS_HATE_001] Hate Speech β Direct Attack: ...",
"thread_history": [],
"conflicting_policies": [],
"task_name": "single-label-classify",
"instructions": "..."
}
Tasks
Task 1: single-label-classify β Easy
- Objective: Classify one item into exactly one violation category (or CLEAN)
- Content types: Text posts, image descriptions
- Steps: 10 per episode
- Grader: Exact label match + action correctness
- Expected score (GPT-4o): ~0.75
Task 2: multi-label-classify β Medium
- Objective: Assign ALL applicable violation labels (content may violate multiple policies)
- Content types: Text posts, ad copy, WhatsApp messages
- Steps: 12 per episode
- Grader: F1 score on label set + action correctness + false positive penalty
- Expected score (GPT-4o): ~0.62
Task 3: ad-policy-compliance β Medium-Hard
- Objective: Review ad copy against specific policy rules and cite the violated rule IDs
- Content types: Ad copy only
- Steps: 10 per episode
- Grader: F1 on labels + policy citation F1 + action correctness
- Expected score (GPT-4o): ~0.58
Task 4: thread-moderation-hard β Hard
- Objective: Moderate a full conversation thread message-by-message with growing context
- Special challenges: Cultural nuance, multi-label violations, conflicting policy resolution
- Content types: WhatsApp-style messages
- Steps: 15 per episode
- Grader: Per-message label F1 + reasoning quality on conflict cases + thread-level action + false positive penalty on protected political speech
- Expected score (GPT-4o): ~0.45
π The Graders & Reward Design
Standard evaluation grids use simplistic "LLM-as-a-judge" or exact-string matching. We rejected this brittle framing.
Our core grading logic leverages a Deterministic Mathematical Framework:
- Semantic Hierarchy Graph Distance: Rather than strict
1.0or0.0correct tags, our grader utilizes a distance matrix. If a model identifies 'harassment' instead of 'hate_speech', the matrix grants partial Jaccard-overlap topological points for being "near" the correct severity branch. - Brier-Score Calibration Penalties: We implemented Continuous Ranked Probability Scoring on the model's self-reported
confidencescalar. If a model confidently executes a false positive, the penalty scales quadratically. - Dense Token Reasoning Intersection: To gauge the
reasoningtrace without invoking an expensive un-reproducible LLM judge, we compute localized Set Intersections on dense policy rule keywords extracted directly from the system prompt rules.
| Component | Mathematical Operation | Weight |
|---|---|---|
| Label F1 | Distance Matrix Recall | 30β50% |
| Action accuracy | Binary scalar multiplication | 20β30% |
| Policy citations | Hypergeometric Set Intersection | 0β30% (ad task only) |
| Reasoning density | Stop-word optimized Jaccard overlap | 0β25% (hard task only) |
| False positive | Exponential Brier-Penalty | -30 to -50% |
All rewards clamped to [-1.0, 1.0]. Episode score normalized to [0.0, 1.0].
Setup & Usage
Prerequisites
- Python 3.11+
- uv:
pip install uv - Docker (for containerized deployment)
Local Development
# Clone
git clone https://github.com/<your-username>/meta-content-moderation-env
cd meta-content-moderation-env
# Generate lockfile and sync dependencies
uv lock
uv sync
# Copy and configure env vars
cp .env.example .env
# Edit .env with your API keys
# Start server
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
# In another terminal β run inference
export MODEL_PROVIDER=hf
export HF_TOKEN=hf_...
uv run python inference.py
Docker
docker build -t meta-content-moderation-env .
docker run -p 7860:7860 \
-e MODEL_PROVIDER=hf \
-e HF_TOKEN=hf_... \
meta-content-moderation-env
API Usage
# Health check
curl http://localhost:7860/health
# Reset to a task
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task": "single-label-classify", "seed": 42}'
# Submit a decision
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"content_id": "post_001", "labels": ["clean"], "action": "approve", "confidence": 0.9, "reasoning": "", "policy_citations": []}'
# Check state
curl http://localhost:7860/state
Running All Tasks
for task in single-label-classify multi-label-classify ad-policy-compliance thread-moderation-hard; do
echo "=== Running task: $task ==="
MODERATION_TASK=$task uv run python inference.py
done
API Documentation (Swagger)
Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.
You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:
- Swagger UI: http://localhost:7860/docs
- ReDoc UI: http://localhost:7860/redoc
- Raw OpenAPI Spec: http://localhost:7860/openapi.json
A statically generated copy openapi.json has also been placed in the repository root for offline reference.
Baseline Scores
| Task | Difficulty | Zero-Shot | CoT + Few-Shot | Multi-Agent Debate |
|---|---|---|---|---|
single-label-classify |
Easy | 0.963 | 0.963 | 0.432 |
multi-label-classify |
Medium | 0.317 | 0.762 | 0.055 |
ad-policy-compliance |
Medium-Hard | 0.514 | 0.450 | 0.195 |
thread-moderation-hard |
Hard | 0.354 | 0.498 | 0.613 |
*Note: Zero-Shot Multi-Agent Debate on specific deep-thread evaluations actually excels at resolving complex logic (reaching 0.613). But running zero-shot Prosecutors on easy/medium tasks induces extreme 'over-moderation', destroying the score with false-positive penalties. This demonstrates the strict necessity of building RLHF!
Scores measured with MODERATION_SEED=42, TEMPERATURE=0.0.
Tests
uv run pytest tests/ -v
API Documentation (Swagger)
Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.
You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:
- Swagger UI: http://localhost:7860/docs
- ReDoc UI: http://localhost:7860/redoc
- Raw OpenAPI Spec: http://localhost:7860/openapi.json
A statically generated copy openapi.json has also been placed in the repository root for offline reference, along with a readable Markdown API Reference at swagger.md.
Project Structure
meta-content-moderation-env/
βββ inference.py # Baseline inference script (run this)
βββ openenv.yaml # OpenEnv metadata
βββ Dockerfile
βββ pyproject.toml # uv Build config
βββ uv.lock # uv Lockfile
βββ server/
β βββ app.py # FastAPI server
β βββ env.py # Core environment
β βββ models.py # Pydantic models
β βββ graders.py # Task graders
β βββ dataset.py # Data loader
β βββ tasks/ # Per-task episode builders
βββ data/
β βββ posts.json
β βββ image_descriptions.json
β βββ ad_copies.json
β βββ whatsapp_threads.json
β βββ policies/
βββ tests/
License
MIT β see LICENSE file.