Spaces:

Pratap-K
/

meta-content-moderation-env

Sleeping

App Files Files Community

meta-content-moderation-env / README.md

Pratap-K

Update Readme

bd346a5 about 1 month ago

preview code

raw

history blame contribute delete

11.8 kB

metadata

title: Meta Content Moderation OpenEnv
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: docker
pinned: true
tags:
  - openenv
  - content-moderation
  - reinforcement-learning
  - meta
  - ai-safety
license: mit

🛡️ MetaContentModerationEnv

OpenEnv environment for training and evaluating AI agents on real-world content moderation.

Inspired by the operational challenges of content moderation at Meta scale — billions of posts, dozens of languages, evolving policies, and cultural nuance that breaks English-only models.

Why This Environment Exists

Content moderation is one of the most consequential AI tasks in production today. Every major social platform employs thousands of human moderators and increasingly uses AI to assist. Yet there is no public, structured benchmark environment where agents can be trained, evaluated, and compared on this task.

This environment fills that gap. An agent trained or evaluated here could be directly adapted for:

Assisting human moderators with triage
Pre-screening content before human review
Evaluating LLM safety properties on real-world content

Environment Overview

Property	Value
Name	`MetaContentModerationEnv`
Version	0.1.0
Framework	FastAPI + Pydantic v2
Package Manager	uv
Python	3.11+
Deployment	HF Spaces + Local Docker

📊 The Dataset (Hybrid Real & Synthetic)

The dataset integrates extremely robust test cases designed to break typical LLM guards. It features a deliberate 50/50 blend of real-world extractions and synthetic adversarial noise:

Cornell hate_speech_offensive Lexicon: We embedded 30 raw, unredacted text strings exported directly from Cornell University's peer-reviewed Twitter dataset on Hugging Face (tdavidson/hate_speech_offensive) to test the agent on authentic, colloquial slurs in the wild.
True Multimodality (VLM Ready): Image Description and Ad Copy tasks now include natively populated media_urls pointing directly to Unsplash/Wikimedia endpoints containing high-definition imagery matching the exact scenario. Vision-Language Models (like GPT-4o or Pixtral) can download and ingest these images directly from the state!
Adversarial Linguistic Obfuscation: The synthetic elements of our dataset utilize 'leetspeak' (e.g. krill urself), deep culturally-entrenched slang, and contextual emojis designed to defeat classic regex filters.

Type	Description
`text_post`	Social media posts and comments (English/Hindi)
`image_description`	Alt-text + Direct Image URL attachments
`ad_copy`	Advertising copy targeting specific age/metrics
`whatsapp_message`	Multi-message conversational threads with deep context

Action Space

The agent submits a ModerationDecision JSON object at each step:

{
  "content_id": "post_042",
  "labels": ["hate_speech", "harassment"],
  "action": "remove",
  "confidence": 0.87,
  "reasoning": "Direct ethnic slur combined with threatening language. Violates CS_HATE_001.",
  "policy_citations": ["CS_HATE_001", "CS_VIOLENCE_001"]
}

Violation categories: hate_speech, violence, nudity, spam, misinformation, harassment, self_harm, terrorism, scam, clean

Actions: approve, remove, escalate, restrict, request_review

Observation Space

At each step the agent receives:

{
  "step": 3,
  "content_item": {
    "content_id": "post_042",
    "content_type": "text_post",
    "text": "...",
    "author_region": "DE",
    "language": "en",
    "author_history": ["hate_speech"]
  },
  "policy_excerpt": "[CS_HATE_001] Hate Speech — Direct Attack: ...",
  "thread_history": [],
  "conflicting_policies": [],
  "task_name": "single-label-classify",
  "instructions": "..."
}

Tasks

Task 1: `single-label-classify` — Easy

Objective: Classify one item into exactly one violation category (or CLEAN)
Content types: Text posts, image descriptions
Steps: 10 per episode
Grader: Exact label match + action correctness
Expected score (GPT-4o): ~0.75

Task 2: `multi-label-classify` — Medium

Objective: Assign ALL applicable violation labels (content may violate multiple policies)
Content types: Text posts, ad copy, WhatsApp messages
Steps: 12 per episode
Grader: F1 score on label set + action correctness + false positive penalty
Expected score (GPT-4o): ~0.62

Task 3: `ad-policy-compliance` — Medium-Hard

Objective: Review ad copy against specific policy rules and cite the violated rule IDs
Content types: Ad copy only
Steps: 10 per episode
Grader: F1 on labels + policy citation F1 + action correctness
Expected score (GPT-4o): ~0.58

Task 4: `thread-moderation-hard` — Hard

Objective: Moderate a full conversation thread message-by-message with growing context
Special challenges: Cultural nuance, multi-label violations, conflicting policy resolution
Content types: WhatsApp-style messages
Steps: 15 per episode
Grader: Per-message label F1 + reasoning quality on conflict cases + thread-level action + false positive penalty on protected political speech
Expected score (GPT-4o): ~0.45

📐 The Graders & Reward Design

Standard evaluation grids use simplistic "LLM-as-a-judge" or exact-string matching. We rejected this brittle framing.

Our core grading logic leverages a Deterministic Mathematical Framework:

Semantic Hierarchy Graph Distance: Rather than strict 1.0 or 0.0 correct tags, our grader utilizes a distance matrix. If a model identifies 'harassment' instead of 'hate_speech', the matrix grants partial Jaccard-overlap topological points for being "near" the correct severity branch.
Brier-Score Calibration Penalties: We implemented Continuous Ranked Probability Scoring on the model's self-reported confidence scalar. If a model confidently executes a false positive, the penalty scales quadratically.
Dense Token Reasoning Intersection: To gauge the reasoning trace without invoking an expensive un-reproducible LLM judge, we compute localized Set Intersections on dense policy rule keywords extracted directly from the system prompt rules.

Component	Mathematical Operation	Weight
Label F1	Distance Matrix Recall	30–50%
Action accuracy	Binary scalar multiplication	20–30%
Policy citations	Hypergeometric Set Intersection	0–30% (ad task only)
Reasoning density	Stop-word optimized Jaccard overlap	0–25% (hard task only)
False positive	Exponential Brier-Penalty	-30 to -50%

All rewards clamped to [-1.0, 1.0]. Episode score normalized to [0.0, 1.0].

Setup & Usage

Prerequisites

Python 3.11+
uv: pip install uv
Docker (for containerized deployment)

Local Development

# Clone
git clone https://github.com/<your-username>/meta-content-moderation-env
cd meta-content-moderation-env

# Generate lockfile and sync dependencies
uv lock
uv sync

# Copy and configure env vars
cp .env.example .env
# Edit .env with your API keys

# Start server
uv run uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

# In another terminal — run inference
export MODEL_PROVIDER=hf
export HF_TOKEN=hf_...
uv run python inference.py

Docker

docker build -t meta-content-moderation-env .
docker run -p 7860:7860 \
  -e MODEL_PROVIDER=hf \
  -e HF_TOKEN=hf_... \
  meta-content-moderation-env

API Usage

# Health check
curl http://localhost:7860/health

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task": "single-label-classify", "seed": 42}'

# Submit a decision
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"content_id": "post_001", "labels": ["clean"], "action": "approve", "confidence": 0.9, "reasoning": "", "policy_citations": []}'

# Check state
curl http://localhost:7860/state

Running All Tasks

for task in single-label-classify multi-label-classify ad-policy-compliance thread-moderation-hard; do
  echo "=== Running task: $task ==="
  MODERATION_TASK=$task uv run python inference.py
done

API Documentation (Swagger)

Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.

You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:

Swagger UI: http://localhost:7860/docs
ReDoc UI: http://localhost:7860/redoc
Raw OpenAPI Spec: http://localhost:7860/openapi.json

A statically generated copy openapi.json has also been placed in the repository root for offline reference.

Baseline Scores

Task	Difficulty	Zero-Shot	CoT + Few-Shot	Multi-Agent Debate
`single-label-classify`	Easy	0.963	0.963	0.432
`multi-label-classify`	Medium	0.317	0.762	0.055
`ad-policy-compliance`	Medium-Hard	0.514	0.450	0.195
`thread-moderation-hard`	Hard	0.354	0.498	0.613

*Note: Zero-Shot Multi-Agent Debate on specific deep-thread evaluations actually excels at resolving complex logic (reaching 0.613). But running zero-shot Prosecutors on easy/medium tasks induces extreme 'over-moderation', destroying the score with false-positive penalties. This demonstrates the strict necessity of building RLHF!

Scores measured with MODERATION_SEED=42, TEMPERATURE=0.0.

Tests

uv run pytest tests/ -v

API Documentation (Swagger)

Because this environment is built natively on FastAPI, an interactive Swagger API UI is automatically generated and hosted alongside the server.

You can explore every endpoint, view request/response schemas, and interact with the environment directly from your browser:

Swagger UI: http://localhost:7860/docs
ReDoc UI: http://localhost:7860/redoc
Raw OpenAPI Spec: http://localhost:7860/openapi.json

A statically generated copy openapi.json has also been placed in the repository root for offline reference, along with a readable Markdown API Reference at swagger.md.

Project Structure

meta-content-moderation-env/
├── inference.py          # Baseline inference script (run this)
├── openenv.yaml          # OpenEnv metadata
├── Dockerfile
├── pyproject.toml        # uv Build config
├── uv.lock               # uv Lockfile
├── server/
│   ├── app.py            # FastAPI server
│   ├── env.py            # Core environment
│   ├── models.py         # Pydantic models
│   ├── graders.py        # Task graders
│   ├── dataset.py        # Data loader
│   └── tasks/            # Per-task episode builders
├── data/
│   ├── posts.json
│   ├── image_descriptions.json
│   ├── ad_copies.json
│   ├── whatsapp_threads.json
│   └── policies/
└── tests/

License

MIT — see LICENSE file.

🛡️ MetaContentModerationEnv

Why This Environment Exists

Environment Overview

📊 The Dataset (Hybrid Real & Synthetic)

Action Space

Observation Space

Tasks

Task 1: single-label-classify — Easy

Task 2: multi-label-classify — Medium

Task 3: ad-policy-compliance — Medium-Hard

Task 4: thread-moderation-hard — Hard

📐 The Graders & Reward Design

Setup & Usage

Prerequisites

Local Development

Docker

API Usage

Running All Tasks

API Documentation (Swagger)

Baseline Scores

Tests

API Documentation (Swagger)

Project Structure

License

Task 1: `single-label-classify` — Easy

Task 2: `multi-label-classify` — Medium

Task 3: `ad-policy-compliance` — Medium-Hard

Task 4: `thread-moderation-hard` — Hard