Spaces:

sniki28
/

content-moderation-queue

Running

App Files Files Community

sniki28 commited on 14 days ago

Commit

daa31ff

verified ·

1 Parent(s): 7249da2

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +206 -4

README.md CHANGED Viewed

@@ -1,10 +1,212 @@
 ---
 title: Content Moderation Queue
-emoji: 😻
-colorFrom: purple
-colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Content Moderation Queue
+emoji: 🛡️
+colorFrom: red
+colorTo: yellow
 sdk: docker
 pinned: false
+license: mit
+tags:
+  - openenv
+  - content-moderation
+  - trust-and-safety
+  - nlp
+  - reinforcement-learning
 ---
+# Content Moderation Queue — OpenEnv
+A real-world AI training environment that simulates the daily work of a **Trust & Safety moderation team**. Agents learn to triage a queue of social media posts — classifying policy violations, applying graduated consequences, handling user appeals, and escalating crisis content.
+---
+## Motivation
+Content moderation is one of the most consequential real-world tasks at scale. Every major social platform employs thousands of human moderators who make dozens of decisions per hour under strict policy guidelines. This environment:
+- Trains agents to apply nuanced, context-sensitive policies (not just keyword filters)
+- Rewards appropriate severity calibration with partial-credit scoring
+- Tests whether agents read author history and thread context before deciding
+- Evaluates crisis detection (self-harm signals must be escalated, not removed)
+---
+## Environment Description
+The agent receives a queue of social media posts and must moderate each one using a 6-action policy framework:
+| Action | When to use |
+|---|---|
+| `approve` | Content is safe and policy-compliant |
+| `warn` | Minor first-time violation — warn user, keep content |
+| `remove` | Clear violation — remove content, no account action |
+| `ban_temporary` | Repeat offender or serious single violation |
+| `ban_permanent` | Zero-tolerance offense (CSAM, doxxing, selling PII) or 5+ violations |
+| `escalate` | Ambiguous case, appeal requiring evidence review, or crisis signal |
+---
+## Observation Space
+Each observation contains:
+| Field | Type | Description |
+|---|---|---|
+| `post_id` | string | Unique post identifier |
+| `content` | string | Full post text |
+| `author_id` | string | Author identifier |
+| `author_history` | object | Total violations count + prior action log |
+| `context_thread` | array | Parent/sibling posts providing context |
+| `is_appeal` | boolean | Whether this is a ban appeal |
+| `appeal_reason` | string | User's stated appeal reason |
+| `queue_remaining` | array | Previews of posts still in queue |
+| `queue_position` | int | Current position in queue |
+| `available_actions` | array | Valid actions for this task |
+| `cumulative_reward` | float | Running episode score |
+---
+## Action Space
+```json
+{
+  "action_type": "remove",
+  "violation_type": "hate_speech",
+  "reasoning": "Post contains ethnic slur with dehumanizing language"
+}
+```
+- `action_type` (required): One of 6 moderation actions
+- `violation_type` (optional): `spam | hate_speech | harassment | misinformation | csam | illegal_services | doxxing | self_harm_risk | none`
+- `reasoning` (optional): Agent explanation — logged but not used for grading
+---
+## Tasks
+### Task 1 — Binary Content Moderation (Easy)
+- **Posts**: 8  |  **Max steps**: 12
+- **Actions**: `approve` or `remove` only
+- Posts contain clear, unambiguous signals: obvious spam, explicit slurs, direct threats vs. cooking tips, community announcements
+- **Scoring**: Binary exact match — 1.0 correct, 0.0 wrong. Score = mean.
+- **Expected baseline score**: ~0.75
+### Task 2 — Tiered Policy Enforcement (Medium)
+- **Posts**: 10  |  **Max steps**: 18
+- **Actions**: `approve / warn / remove / ban_temporary / ban_permanent`
+- Includes edge cases: harsh-but-legal film criticism, first vs. repeat offenders, political speech, zero-tolerance violations
+- **Scoring**: Action distance score (70%) + violation type identification (30%). Partial credit for being one level off.
+- **Expected baseline score**: ~0.55
+### Task 3 — Full Queue Management with Context & Appeals (Hard)
+- **Posts**: 12  |  **Max steps**: 24
+- **Actions**: All 6 including `escalate`
+- Requires: reading author history (5+ violations → permanent ban), thread context (gaming slang ≠ threat), crisis detection (suicidal ideation → escalate, don't remove), appeal handling
+- **Scoring**: Action score (50%) + context-aware bonus (30%) + violation type (20%)
+- **Expected baseline score**: ~0.40
+---
+## Reward Function
+- **Per-step, non-sparse**: every post scores independently (0.0–1.0)
+- **Partial credit**: being one action-level off (e.g., `warn` when `remove` is correct) scores ~0.65 instead of 0
+- **Context bonus** (hard task): +0.3 for posts where correct answer requires author history or thread context
+- **Episode score**: mean of all per-post scores
+---
+## API Endpoints
+| Method | Path | Description |
+|---|---|---|
+| `GET` | `/health` | Liveness check |
+| `GET` | `/tasks` | List all tasks with metadata |
+| `POST` | `/reset?task_id=task_easy` | Start new episode, returns first Observation |
+| `POST` | `/step` | Submit action, returns StepResult |
+| `GET` | `/state` | Current environment state snapshot |
+---
+## Setup & Usage
+### Local Development
+```bash
+# Clone / navigate to project
+cd content-moderation-env
+# Install dependencies
+pip install -r requirements.txt
+# Start the server
+uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+```
+### Docker
+```bash
+docker build -t content-moderation-env .
+docker run -p 7860:7860 content-moderation-env
+```
+### Run Baseline Inference
+```bash
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
+export HF_TOKEN="hf_your_token_here"
+export ENV_BASE_URL="http://localhost:7860"
+python inference.py
+```
+---
+## Baseline Scores
+Measured using `meta-llama/Meta-Llama-3-8B-Instruct` (temperature=0):
+| Task | Score | Difficulty |
+|---|---|---|
+| task_easy | ~0.750 | Easy |
+| task_medium | ~0.551 | Medium |
+| task_hard | ~0.403 | Hard |
+| **Overall** | **~0.568** | — |
+*Scores are reproducible at temperature=0.*
+---
+## Project Structure
+```
+content-moderation-env/
+├── openenv.yaml              # OpenEnv spec metadata
+├── Dockerfile                # HF Spaces / Docker deployment
+├── requirements.txt          # Python dependencies
+├── inference.py              # Baseline agent script (OpenAI client)
+├── app.py                    # FastAPI server (reset/step/state endpoints)
+├── README.md
+└── environment/
+    ├── __init__.py
+    ├── models.py             # Pydantic: Observation, Action, Reward, StepResult
+    ├── env.py                # ContentModerationEnv class
+    ├── tasks.py              # Task definitions + deterministic graders
+    └── data/
+        └── posts.json        # 30 labeled posts with ground truth
+```
+---
+## HF Spaces Deployment
+This environment is deployed as a Hugging Face Space tagged with `openenv`.
+The Space exposes the full OpenEnv HTTP API. Set the following secrets in your Space settings:
+```
+API_BASE_URL   # LLM endpoint
+MODEL_NAME     # Model to use for inference
+HF_TOKEN       # Your Hugging Face API token
+```