Imaginephoenix commited on
Commit
d547d85
Β·
verified Β·
1 Parent(s): f6339e7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +506 -6
README.md CHANGED
@@ -1,11 +1,511 @@
1
  ---
2
- title: Openenv1
3
- emoji: πŸŒ–
4
- colorFrom: red
5
- colorTo: gray
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: OpenEnv Email Triage Environment
3
+ emoji: πŸ“¬
4
+ colorFrom: blue
5
+ colorTo: blue
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
 
9
  ---
10
 
11
+ # OpenEnv Email Triage Environment
12
+
13
+ A real-world AI agent training environment that simulates professional email triage.
14
+ Built to the OpenEnv specification for standardized agent evaluation and benchmarking.
15
+
16
+ - **Status:** In Development
17
+ - **Domain:** Email Triage
18
+ - **Deployment:** Hugging Face Spaces (Docker)
19
+
20
+ ---
21
+
22
+ ## Table of Contents
23
+
24
+ - [What Is This?](#what-is-this)
25
+ - [Who Is This For?](#who-is-this-for)
26
+ - [Observation Space](#observation-space)
27
+ - [Action Space](#action-space)
28
+ - [Tasks](#tasks)
29
+ - [Reward Function](#reward-function)
30
+ - [Quick Start](#quick-start)
31
+ - [Running Inference](#running-inference)
32
+ - [Inference Architecture](#inference-architecture)
33
+ - [Score Table](#score-table)
34
+ - [Docker Deployment](#docker-deployment)
35
+ - [Hugging Face Space](#hugging-face-space)
36
+ - [Pre-Submission Validation](#pre-submission-validation)
37
+ - [API Reference](#api-reference)
38
+ - [Project Structure](#project-structure)
39
+ - [Known Limitations](#known-limitations)
40
+ - [Contributing](#contributing)
41
+ - [License](#license)
42
+
43
+ ---
44
+
45
+ ## What Is This?
46
+
47
+ This environment simulates a professional email inbox where an AI agent must:
48
+
49
+ 1. Read incoming emails with realistic metadata (sender, subject, body, thread history).
50
+ 2. Classify each email with the correct priority label.
51
+ 3. Route each email to the appropriate department or person.
52
+ 4. Summarize the email's key information.
53
+
54
+ Think of it as OpenAI Gym for office work. Instead of balancing a pole, the agent triages an
55
+ inbox. The environment provides structured observations, accepts structured actions, and
56
+ returns graded rewards with partial credit.
57
+
58
+ Every decision is scored by a deterministic programmatic grader: no LLM-as-judge,
59
+ no randomness, fully reproducible.
60
+
61
+ ---
62
+
63
+ ## Who Is This For?
64
+
65
+ | Audience | Use Case |
66
+ |---|---|
67
+ | AI Safety Researchers | Measure agent behavior on realistic tasks with known ground truth |
68
+ | LLM Agent Developers | Benchmark models and prompting strategies on real-world work |
69
+ | RL Researchers | Train agents with shaped rewards in a professional task environment |
70
+ | Companies | Evaluate LLM agents before deploying them to handle real email |
71
+
72
+ ---
73
+
74
+ ## Observation Space
75
+
76
+ What the agent sees at each step:
77
+
78
+ | Field | Type | Description |
79
+ |---|---|---|
80
+ | `email_id` | `str` | Unique identifier for this email |
81
+ | `subject` | `str` | Email subject line |
82
+ | `body` | `str` | Full email body text |
83
+ | `sender` | `str` | Sender's email address |
84
+ | `timestamp` | `str` | ISO 8601 timestamp of when the email was received |
85
+ | `thread_history` | `list[str]` | Previous messages in the email thread (may be empty) |
86
+ | `task_id` | `str` | Which task is currently active |
87
+ | `step_number` | `int` | Current step in the episode (0-indexed) |
88
+ | `total_emails` | `int` | Total number of emails to process in this task |
89
+
90
+ The observation never contains the correct answer. The agent must reason from email content.
91
+
92
+ ---
93
+
94
+ ## Action Space
95
+
96
+ What the agent must output at each step:
97
+
98
+ | Field | Type | Allowed Values | Description |
99
+ |---|---|---|---|
100
+ | `label` | `Literal` | `"urgent"`, `"normal"`, `"spam"`, `"archive"` | Priority classification |
101
+ | `summary` | `str` | Free text | Brief summary of the email's content and intent |
102
+ | `route_to` | `str` | Free text (`"billing"`, `"safety"`, `"engineering"`) | Department or person |
103
+
104
+ ### Example action JSON
105
+
106
+ ```json
107
+ {
108
+ "label": "urgent",
109
+ "summary": "Customer reports a safety issue with product overheating",
110
+ "route_to": "safety"
111
+ }
112
+ ```
113
+
114
+ ---
115
+
116
+ ## Tasks
117
+
118
+ Each task now contains multiple deterministic scenario variants. By default, `/reset`
119
+ cycles through the public scenario pool for the selected task.
120
+
121
+ Private evaluation split selection is controlled server-side via environment
122
+ configuration (`OPENENV_EVAL_SPLIT`), and client-side override can be disabled
123
+ to preserve benchmark integrity.
124
+
125
+ To keep private evaluation data out of source control, supply hidden scenarios at
126
+ runtime using `OPENENV_PRIVATE_SCENARIOS_JSON` (JSON object keyed by task id).
127
+
128
+ Example deployment configuration:
129
+
130
+ ```bash
131
+ export OPENENV_EVAL_SPLIT="private_eval"
132
+ export OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE="false"
133
+ export OPENENV_PRIVATE_SCENARIOS_JSON='{"task_easy":[{"scenario_id":"easy-private-001","emails":[{"email_id":"easy-p-001","subject":"Private billing exception","body":"Please correct invoice mismatch for contract addendum B-7 before end of day.","sender":"contracts@partner.example","timestamp":"2026-04-03T09:00:00Z","thread_history":["Customer requested corrected invoice reference."]}],"ground_truth":[{"label":"normal","route_to":"billing","priority_weight":1.0,"summary_keywords":["invoice mismatch","contract addendum","correct"]}]}],"task_medium":[],"task_hard":[]}'
134
+ ```
135
+
136
+ Notes:
137
+
138
+ - Keep this value in deployment secrets or runtime environment config.
139
+ - Use valid JSON with double quotes only.
140
+ - You can provide multiple scenarios per task by adding more objects to each task list.
141
+
142
+ ### Task 1 β€” Easy (`task_easy`)
143
+
144
+ Objective: Correctly classify a single unambiguous email.
145
+
146
+ Scoring:
147
+
148
+ - Correct label: 1.0
149
+ - Wrong label but correct routing: 0.3
150
+ - Everything wrong: 0.0
151
+
152
+ ### Task 2 β€” Medium (`task_medium`)
153
+
154
+ Objective: Triage a queue of 5 emails with mixed priority signals.
155
+
156
+ Scoring:
157
+
158
+ - Each email scored individually
159
+ - Score = (correct labels / total emails) * priority weight factor
160
+ - Higher-priority misclassifications are penalized more heavily
161
+ - Final score = weighted mean of all individual scores
162
+
163
+ ### Task 3 β€” Hard (`task_hard`)
164
+
165
+ Objective: Handle a complex complaint that crosses multiple categories.
166
+
167
+ Scoring:
168
+
169
+ - Escalated to safety: 0.4 weight
170
+ - Correct routing: 0.3 weight
171
+ - Marked as urgent: 0.3 weight
172
+ - Penalty: -0.2 if marked as spam
173
+ - Final score = weighted sum of sub-scores (clipped to 0.0 minimum)
174
+
175
+ ### Task 4 β€” Production (`task_production`)
176
+
177
+ Objective: Simulate a production inbox with mixed operational load across safety,
178
+ engineering, billing, support, spam, and low-priority traffic.
179
+
180
+ Scoring:
181
+
182
+ - Per-email weighted scoring by business priority
183
+ - Route-noise penalty when actions route to too many teams
184
+ - Summary quality based on contextual evidence keywords and anti-stuffing rules
185
+ - Deterministic escalation follow-ups are inserted when critical triage is missed
186
+ - Runtime controls available via `/reset` payload for production simulations:
187
+ - `production_profile`: `light` | `standard` | `heavy`
188
+ - `business_hours_mode`: `true` | `false`
189
+ - `escalation_mode`: `low` | `normal` | `high`
190
+
191
+ ---
192
+
193
+ ## Reward Function
194
+
195
+ The reward function provides dense training signal at every step, not just binary pass/fail.
196
+
197
+ ### Formula
198
+
199
+ ```text
200
+ final_reward = base_score + progress_signal + trajectory_bonus - penalties - step_cost
201
+ ```
202
+
203
+ ### Components
204
+
205
+ | Component | Value | Condition |
206
+ |---|---|---|
207
+ | Base score | 0.0-1.0 | Raw grader score for the current step |
208
+ | Progress signal | ~0.00 to ~0.13 | Partial credit for advancing queue, quality, and positive trend |
209
+ | Step cost | ~-0.005 to ~-0.015 | Gentle efficiency pressure over longer episodes |
210
+ | Trajectory bonus | +0.2 | If all tasks completed with mean score > 0.8 |
211
+ | Archive quality penalty | -0.5 | Archive action with an underspecified summary |
212
+ | Loop detection penalty | -0.3 | Same action repeated 3+ times consecutively |
213
+
214
+ The final reward is clipped to [-1.0, 1.0] before being returned.
215
+
216
+ ---
217
+
218
+ ## Quick Start
219
+
220
+ ### Prerequisites
221
+
222
+ - Python 3.11+
223
+ - API endpoint, model name, and token for inference
224
+
225
+ ### Installation
226
+
227
+ ```bash
228
+ pip install -r requirements.txt
229
+ export API_BASE_URL="https://router.huggingface.co/v1"
230
+ export MODEL_NAME="gpt-4o"
231
+ export HF_TOKEN="your-token-here"
232
+ ```
233
+
234
+ ### Run the environment locally
235
+
236
+ ```bash
237
+ python server.py
238
+
239
+ curl -X POST http://localhost:7860/reset \
240
+ -H "Content-Type: application/json" \
241
+ -d '{"task_id": "task_easy"}'
242
+
243
+ curl -X POST http://localhost:7860/step \
244
+ -H "Content-Type: application/json" \
245
+ -d '{"label": "urgent", "summary": "Test", "route_to": "billing"}'
246
+
247
+ curl -X POST http://localhost:7860/state
248
+ ```
249
+
250
+ ---
251
+
252
+ ## Running Inference
253
+
254
+ ```bash
255
+ python inference.py --task all
256
+ python inference.py --task 1
257
+ python inference.py --task 4 --production-profile heavy --business-hours-mode --escalation-mode high
258
+ ```
259
+
260
+ The script reads API settings from environment variables and uses fallback actions when
261
+ model output is unparseable, so episodes still complete.
262
+
263
+ ---
264
+
265
+ ## Inference Architecture
266
+
267
+ The inference script (inference.py) follows this loop:
268
+
269
+ ```text
270
+ 1. Initialize OpenAI client + environment
271
+ 2. reset() to get first observation
272
+ 3. Loop until done or MAX_STEPS:
273
+ - Build prompt from observation + history
274
+ - Call LLM with OpenAI client (catch request errors)
275
+ - Parse response into action (fallback on parse failure)
276
+ - env.step(action)
277
+ - Record reward and history
278
+ 4. Print score table
279
+ ```
280
+
281
+ ### Environment Variables Required
282
+
283
+ ```bash
284
+ export API_BASE_URL="https://router.huggingface.co/v1"
285
+ export MODEL_NAME="gpt-4o"
286
+ export HF_TOKEN="your-token-here"
287
+ export INFERENCE_RUNTIME_BUDGET_SECONDS="1140"
288
+ export INFERENCE_REQUEST_TIMEOUT_SECONDS="12"
289
+ ```
290
+
291
+ Runtime controls:
292
+
293
+ - `INFERENCE_RUNTIME_BUDGET_SECONDS` limits full-script wall-clock runtime (default 1140s, under 20 minutes).
294
+ - `INFERENCE_REQUEST_TIMEOUT_SECONDS` limits each LLM request timeout (default 12s).
295
+ - Equivalent CLI flags: `--runtime-budget-seconds` and `--request-timeout-seconds`.
296
+
297
+ Fallback behavior when parsing fails:
298
+
299
+ ```json
300
+ {"label": "normal", "summary": "Unable to parse response", "route_to": "general"}
301
+ ```
302
+
303
+ ---
304
+
305
+ ## Score Table
306
+
307
+ Placeholder until inference is run.
308
+
309
+ | Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Mean |
310
+ |---|---|---|---|---|
311
+ | MODEL_NAME | TBD | TBD | TBD | TBD |
312
+
313
+ Expected rough ranges:
314
+
315
+ - GPT-4o: 0.8-1.0 on easy, 0.5-0.8 on medium, 0.4-0.7 on hard
316
+
317
+ ---
318
+
319
+ ## Docker Deployment
320
+
321
+ ```bash
322
+ docker build -t email-triage-env .
323
+ docker run -p 7860:7860 email-triage-env
324
+
325
+ curl -X POST http://localhost:7860/reset \
326
+ -H "Content-Type: application/json" \
327
+ -d '{"task_id": "task_easy"}'
328
+ ```
329
+
330
+ For Apple Silicon:
331
+
332
+ ```bash
333
+ docker build --platform linux/amd64 -t email-triage-env .
334
+ ```
335
+
336
+ ---
337
+
338
+ ## Hugging Face Space
339
+
340
+ Live URL placeholder:
341
+
342
+ `https://huggingface.co/spaces/YOUR_USERNAME/email-triage-env`
343
+
344
+ The Space homepage (`/`) now serves a lightweight interactive triage console for
345
+ manual testing. Machine-readable service metadata is available at `GET /meta`.
346
+
347
+ Example interaction:
348
+
349
+ ```bash
350
+ export SPACE_URL="https://YOUR_USERNAME-email-triage-env.hf.space"
351
+
352
+ curl -X POST "$SPACE_URL/reset" \
353
+ -H "Content-Type: application/json" \
354
+ -d '{"task_id": "task_easy"}'
355
+ ```
356
+
357
+ ---
358
+
359
+ ## Pre-Submission Validation
360
+
361
+ Run the validator before submitting your environment.
362
+
363
+ ```bash
364
+ chmod +x validate-submission.sh
365
+ ./validate-submission.sh https://YOUR_USERNAME-email-triage-env.hf.space .
366
+ ```
367
+
368
+ The script checks:
369
+
370
+ - HF Space `/reset` health (HTTP 200 expected)
371
+ - Docker build success
372
+ - `openenv validate` pass status
373
+
374
+ ---
375
+
376
+ ## API Reference
377
+
378
+ ### POST /reset
379
+
380
+ Request:
381
+
382
+ ```json
383
+ {"task_id": "task_easy"}
384
+ ```
385
+
386
+ Response:
387
+
388
+ ```json
389
+ {
390
+ "observation": {
391
+ "email_id": "easy-001",
392
+ "subject": "Quarterly invoice available",
393
+ "body": "...",
394
+ "sender": "accounts@vendor-example.com",
395
+ "timestamp": "2026-03-25T09:15:00Z",
396
+ "thread_history": ["..."],
397
+ "task_id": "task_easy",
398
+ "step_number": 0,
399
+ "total_emails": 1
400
+ },
401
+ "info": {"task_id": "task_easy", "step": 0}
402
+ }
403
+ ```
404
+
405
+ ### POST /step
406
+
407
+ Request:
408
+
409
+ ```json
410
+ {
411
+ "label": "urgent",
412
+ "summary": "Customer needs immediate help",
413
+ "route_to": "support"
414
+ }
415
+ ```
416
+
417
+ Response:
418
+
419
+ ```json
420
+ {
421
+ "observation": {},
422
+ "reward": 0.85,
423
+ "done": false,
424
+ "info": {"step": 1, "task_id": "task_easy"}
425
+ }
426
+ ```
427
+
428
+ ### POST /state
429
+
430
+ No request body required.
431
+
432
+ Response: `EnvironmentState` JSON object.
433
+
434
+ ---
435
+
436
+ ## Project Structure
437
+
438
+ ```text
439
+ .
440
+ β”œβ”€β”€ models.py
441
+ β”œβ”€β”€ tasks.py
442
+ β”œβ”€β”€ graders.py
443
+ β”œβ”€β”€ environment.py
444
+ β”œβ”€β”€ server.py
445
+ β”œβ”€β”€ server/
446
+ β”‚ └── app.py
447
+ β”œβ”€β”€ inference.py
448
+ β”œβ”€β”€ openenv.yaml
449
+ β”œβ”€β”€ Dockerfile
450
+ β”œβ”€β”€ requirements.txt
451
+ β”œβ”€β”€ pyproject.toml
452
+ β”œβ”€β”€ uv.lock
453
+ β”œβ”€β”€ validate-submission.sh
454
+ β”œβ”€β”€ README.md
455
+ └── RULES.md
456
+ ```
457
+
458
+ ---
459
+
460
+ ## Known Limitations
461
+
462
+ | Limitation | Impact |
463
+ |---|---|
464
+ | Static scenario pools | No live inbox ingestion from production systems |
465
+ | Single-agent server instance | Concurrent agents can conflict |
466
+ | No live thread simulation | Thread history is static |
467
+ | English-only content | No multilingual coverage |
468
+ | No attachments | Text-only triage |
469
+ | Simplified routing | No org chart or availability modeling |
470
+ | Limited temporal dynamics | Production task can generate deterministic escalations, but not full live message streams |
471
+ | Rule-based grading edges | Equivalent decisions may score differently from humans |
472
+
473
+ What an agent cannot exploit:
474
+
475
+ - The correct answer is never present in observations
476
+ - The grader is a pure function and cannot be manipulated
477
+ - Step penalty cannot be bypassed except by efficient actions
478
+
479
+ ---
480
+
481
+ ## Summary of Revision 2 Changes
482
+
483
+ | What Changed | Before | After | Why |
484
+ |---|---|---|---|
485
+ | Return type of step() | tuple | StepResult object | Match sample result.observation pattern |
486
+ | Return type of reset() | EmailObservation | ResetResult object | Match sample result.observation pattern |
487
+ | New models | 4 models | 6 models (+StepResult, +ResetResult) | Match sample interface |
488
+ | API key reading | OPENAI_API_KEY style | HF_TOKEN or API_KEY via os.getenv | Match sample fallback pattern |
489
+ | Temperature guidance | 0 | 0.2 | Match sample behavior |
490
+ | Response parsing | JSON-only assumption | Text parsing with fallback action | Robustness to non-JSON model output |
491
+ | History tracking | Optional | Mandatory | Match sample architecture |
492
+ | Step cap | Not explicit | MAX_STEPS constant | Runtime safety and reproducibility |
493
+
494
+ ---
495
+
496
+ ## Contributing
497
+
498
+ Read `RULES.md` before contributing.
499
+
500
+ Key constraints:
501
+
502
+ - Type hints and Pydantic models required
503
+ - No extra dependencies without explicit approval
504
+ - No features beyond project brief
505
+ - Graders must remain deterministic pure functions
506
+
507
+ ---
508
+
509
+ ## License
510
+
511
+ MIT License.