sniki28 commited on
Commit
daa31ff
Β·
verified Β·
1 Parent(s): 7249da2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +206 -4
README.md CHANGED
@@ -1,10 +1,212 @@
1
  ---
2
  title: Content Moderation Queue
3
- emoji: 😻
4
- colorFrom: purple
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Content Moderation Queue
3
+ emoji: πŸ›‘οΈ
4
+ colorFrom: red
5
+ colorTo: yellow
6
  sdk: docker
7
  pinned: false
8
+ license: mit
9
+ tags:
10
+ - openenv
11
+ - content-moderation
12
+ - trust-and-safety
13
+ - nlp
14
+ - reinforcement-learning
15
  ---
16
 
17
+ # Content Moderation Queue β€” OpenEnv
18
+
19
+ A real-world AI training environment that simulates the daily work of a **Trust & Safety moderation team**. Agents learn to triage a queue of social media posts β€” classifying policy violations, applying graduated consequences, handling user appeals, and escalating crisis content.
20
+
21
+ ---
22
+
23
+ ## Motivation
24
+
25
+ Content moderation is one of the most consequential real-world tasks at scale. Every major social platform employs thousands of human moderators who make dozens of decisions per hour under strict policy guidelines. This environment:
26
+
27
+ - Trains agents to apply nuanced, context-sensitive policies (not just keyword filters)
28
+ - Rewards appropriate severity calibration with partial-credit scoring
29
+ - Tests whether agents read author history and thread context before deciding
30
+ - Evaluates crisis detection (self-harm signals must be escalated, not removed)
31
+
32
+ ---
33
+
34
+ ## Environment Description
35
+
36
+ The agent receives a queue of social media posts and must moderate each one using a 6-action policy framework:
37
+
38
+ | Action | When to use |
39
+ |---|---|
40
+ | `approve` | Content is safe and policy-compliant |
41
+ | `warn` | Minor first-time violation β€” warn user, keep content |
42
+ | `remove` | Clear violation β€” remove content, no account action |
43
+ | `ban_temporary` | Repeat offender or serious single violation |
44
+ | `ban_permanent` | Zero-tolerance offense (CSAM, doxxing, selling PII) or 5+ violations |
45
+ | `escalate` | Ambiguous case, appeal requiring evidence review, or crisis signal |
46
+
47
+ ---
48
+
49
+ ## Observation Space
50
+
51
+ Each observation contains:
52
+
53
+ | Field | Type | Description |
54
+ |---|---|---|
55
+ | `post_id` | string | Unique post identifier |
56
+ | `content` | string | Full post text |
57
+ | `author_id` | string | Author identifier |
58
+ | `author_history` | object | Total violations count + prior action log |
59
+ | `context_thread` | array | Parent/sibling posts providing context |
60
+ | `is_appeal` | boolean | Whether this is a ban appeal |
61
+ | `appeal_reason` | string | User's stated appeal reason |
62
+ | `queue_remaining` | array | Previews of posts still in queue |
63
+ | `queue_position` | int | Current position in queue |
64
+ | `available_actions` | array | Valid actions for this task |
65
+ | `cumulative_reward` | float | Running episode score |
66
+
67
+ ---
68
+
69
+ ## Action Space
70
+
71
+ ```json
72
+ {
73
+ "action_type": "remove",
74
+ "violation_type": "hate_speech",
75
+ "reasoning": "Post contains ethnic slur with dehumanizing language"
76
+ }
77
+ ```
78
+
79
+ - `action_type` (required): One of 6 moderation actions
80
+ - `violation_type` (optional): `spam | hate_speech | harassment | misinformation | csam | illegal_services | doxxing | self_harm_risk | none`
81
+ - `reasoning` (optional): Agent explanation β€” logged but not used for grading
82
+
83
+ ---
84
+
85
+ ## Tasks
86
+
87
+ ### Task 1 β€” Binary Content Moderation (Easy)
88
+ - **Posts**: 8 | **Max steps**: 12
89
+ - **Actions**: `approve` or `remove` only
90
+ - Posts contain clear, unambiguous signals: obvious spam, explicit slurs, direct threats vs. cooking tips, community announcements
91
+ - **Scoring**: Binary exact match β€” 1.0 correct, 0.0 wrong. Score = mean.
92
+ - **Expected baseline score**: ~0.75
93
+
94
+ ### Task 2 β€” Tiered Policy Enforcement (Medium)
95
+ - **Posts**: 10 | **Max steps**: 18
96
+ - **Actions**: `approve / warn / remove / ban_temporary / ban_permanent`
97
+ - Includes edge cases: harsh-but-legal film criticism, first vs. repeat offenders, political speech, zero-tolerance violations
98
+ - **Scoring**: Action distance score (70%) + violation type identification (30%). Partial credit for being one level off.
99
+ - **Expected baseline score**: ~0.55
100
+
101
+ ### Task 3 β€” Full Queue Management with Context & Appeals (Hard)
102
+ - **Posts**: 12 | **Max steps**: 24
103
+ - **Actions**: All 6 including `escalate`
104
+ - Requires: reading author history (5+ violations β†’ permanent ban), thread context (gaming slang β‰  threat), crisis detection (suicidal ideation β†’ escalate, don't remove), appeal handling
105
+ - **Scoring**: Action score (50%) + context-aware bonus (30%) + violation type (20%)
106
+ - **Expected baseline score**: ~0.40
107
+
108
+ ---
109
+
110
+ ## Reward Function
111
+
112
+ - **Per-step, non-sparse**: every post scores independently (0.0–1.0)
113
+ - **Partial credit**: being one action-level off (e.g., `warn` when `remove` is correct) scores ~0.65 instead of 0
114
+ - **Context bonus** (hard task): +0.3 for posts where correct answer requires author history or thread context
115
+ - **Episode score**: mean of all per-post scores
116
+
117
+ ---
118
+
119
+ ## API Endpoints
120
+
121
+ | Method | Path | Description |
122
+ |---|---|---|
123
+ | `GET` | `/health` | Liveness check |
124
+ | `GET` | `/tasks` | List all tasks with metadata |
125
+ | `POST` | `/reset?task_id=task_easy` | Start new episode, returns first Observation |
126
+ | `POST` | `/step` | Submit action, returns StepResult |
127
+ | `GET` | `/state` | Current environment state snapshot |
128
+
129
+ ---
130
+
131
+ ## Setup & Usage
132
+
133
+ ### Local Development
134
+
135
+ ```bash
136
+ # Clone / navigate to project
137
+ cd content-moderation-env
138
+
139
+ # Install dependencies
140
+ pip install -r requirements.txt
141
+
142
+ # Start the server
143
+ uvicorn app:app --host 0.0.0.0 --port 7860 --reload
144
+ ```
145
+
146
+ ### Docker
147
+
148
+ ```bash
149
+ docker build -t content-moderation-env .
150
+ docker run -p 7860:7860 content-moderation-env
151
+ ```
152
+
153
+ ### Run Baseline Inference
154
+
155
+ ```bash
156
+ export API_BASE_URL="https://api-inference.huggingface.co/v1"
157
+ export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
158
+ export HF_TOKEN="hf_your_token_here"
159
+ export ENV_BASE_URL="http://localhost:7860"
160
+
161
+ python inference.py
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Baseline Scores
167
+
168
+ Measured using `meta-llama/Meta-Llama-3-8B-Instruct` (temperature=0):
169
+
170
+ | Task | Score | Difficulty |
171
+ |---|---|---|
172
+ | task_easy | ~0.750 | Easy |
173
+ | task_medium | ~0.551 | Medium |
174
+ | task_hard | ~0.403 | Hard |
175
+ | **Overall** | **~0.568** | β€” |
176
+
177
+ *Scores are reproducible at temperature=0.*
178
+
179
+ ---
180
+
181
+ ## Project Structure
182
+
183
+ ```
184
+ content-moderation-env/
185
+ β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
186
+ β”œβ”€β”€ Dockerfile # HF Spaces / Docker deployment
187
+ β”œβ”€β”€ requirements.txt # Python dependencies
188
+ β”œβ”€β”€ inference.py # Baseline agent script (OpenAI client)
189
+ β”œβ”€β”€ app.py # FastAPI server (reset/step/state endpoints)
190
+ β”œβ”€β”€ README.md
191
+ └── environment/
192
+ β”œβ”€β”€ __init__.py
193
+ β”œβ”€β”€ models.py # Pydantic: Observation, Action, Reward, StepResult
194
+ β”œβ”€β”€ env.py # ContentModerationEnv class
195
+ β”œβ”€β”€ tasks.py # Task definitions + deterministic graders
196
+ └── data/
197
+ └── posts.json # 30 labeled posts with ground truth
198
+ ```
199
+
200
+ ---
201
+
202
+ ## HF Spaces Deployment
203
+
204
+ This environment is deployed as a Hugging Face Space tagged with `openenv`.
205
+
206
+ The Space exposes the full OpenEnv HTTP API. Set the following secrets in your Space settings:
207
+
208
+ ```
209
+ API_BASE_URL # LLM endpoint
210
+ MODEL_NAME # Model to use for inference
211
+ HF_TOKEN # Your Hugging Face API token
212
+ ```