Spaces:
Sleeping
Sleeping
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,11 +1,511 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
|
|
|
| 7 |
pinned: false
|
| 8 |
-
license: mit
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: OpenEnv Email Triage Environment
|
| 3 |
+
emoji: π¬
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
pinned: false
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# OpenEnv Email Triage Environment
|
| 12 |
+
|
| 13 |
+
A real-world AI agent training environment that simulates professional email triage.
|
| 14 |
+
Built to the OpenEnv specification for standardized agent evaluation and benchmarking.
|
| 15 |
+
|
| 16 |
+
- **Status:** In Development
|
| 17 |
+
- **Domain:** Email Triage
|
| 18 |
+
- **Deployment:** Hugging Face Spaces (Docker)
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Table of Contents
|
| 23 |
+
|
| 24 |
+
- [What Is This?](#what-is-this)
|
| 25 |
+
- [Who Is This For?](#who-is-this-for)
|
| 26 |
+
- [Observation Space](#observation-space)
|
| 27 |
+
- [Action Space](#action-space)
|
| 28 |
+
- [Tasks](#tasks)
|
| 29 |
+
- [Reward Function](#reward-function)
|
| 30 |
+
- [Quick Start](#quick-start)
|
| 31 |
+
- [Running Inference](#running-inference)
|
| 32 |
+
- [Inference Architecture](#inference-architecture)
|
| 33 |
+
- [Score Table](#score-table)
|
| 34 |
+
- [Docker Deployment](#docker-deployment)
|
| 35 |
+
- [Hugging Face Space](#hugging-face-space)
|
| 36 |
+
- [Pre-Submission Validation](#pre-submission-validation)
|
| 37 |
+
- [API Reference](#api-reference)
|
| 38 |
+
- [Project Structure](#project-structure)
|
| 39 |
+
- [Known Limitations](#known-limitations)
|
| 40 |
+
- [Contributing](#contributing)
|
| 41 |
+
- [License](#license)
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## What Is This?
|
| 46 |
+
|
| 47 |
+
This environment simulates a professional email inbox where an AI agent must:
|
| 48 |
+
|
| 49 |
+
1. Read incoming emails with realistic metadata (sender, subject, body, thread history).
|
| 50 |
+
2. Classify each email with the correct priority label.
|
| 51 |
+
3. Route each email to the appropriate department or person.
|
| 52 |
+
4. Summarize the email's key information.
|
| 53 |
+
|
| 54 |
+
Think of it as OpenAI Gym for office work. Instead of balancing a pole, the agent triages an
|
| 55 |
+
inbox. The environment provides structured observations, accepts structured actions, and
|
| 56 |
+
returns graded rewards with partial credit.
|
| 57 |
+
|
| 58 |
+
Every decision is scored by a deterministic programmatic grader: no LLM-as-judge,
|
| 59 |
+
no randomness, fully reproducible.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Who Is This For?
|
| 64 |
+
|
| 65 |
+
| Audience | Use Case |
|
| 66 |
+
|---|---|
|
| 67 |
+
| AI Safety Researchers | Measure agent behavior on realistic tasks with known ground truth |
|
| 68 |
+
| LLM Agent Developers | Benchmark models and prompting strategies on real-world work |
|
| 69 |
+
| RL Researchers | Train agents with shaped rewards in a professional task environment |
|
| 70 |
+
| Companies | Evaluate LLM agents before deploying them to handle real email |
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Observation Space
|
| 75 |
+
|
| 76 |
+
What the agent sees at each step:
|
| 77 |
+
|
| 78 |
+
| Field | Type | Description |
|
| 79 |
+
|---|---|---|
|
| 80 |
+
| `email_id` | `str` | Unique identifier for this email |
|
| 81 |
+
| `subject` | `str` | Email subject line |
|
| 82 |
+
| `body` | `str` | Full email body text |
|
| 83 |
+
| `sender` | `str` | Sender's email address |
|
| 84 |
+
| `timestamp` | `str` | ISO 8601 timestamp of when the email was received |
|
| 85 |
+
| `thread_history` | `list[str]` | Previous messages in the email thread (may be empty) |
|
| 86 |
+
| `task_id` | `str` | Which task is currently active |
|
| 87 |
+
| `step_number` | `int` | Current step in the episode (0-indexed) |
|
| 88 |
+
| `total_emails` | `int` | Total number of emails to process in this task |
|
| 89 |
+
|
| 90 |
+
The observation never contains the correct answer. The agent must reason from email content.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## Action Space
|
| 95 |
+
|
| 96 |
+
What the agent must output at each step:
|
| 97 |
+
|
| 98 |
+
| Field | Type | Allowed Values | Description |
|
| 99 |
+
|---|---|---|---|
|
| 100 |
+
| `label` | `Literal` | `"urgent"`, `"normal"`, `"spam"`, `"archive"` | Priority classification |
|
| 101 |
+
| `summary` | `str` | Free text | Brief summary of the email's content and intent |
|
| 102 |
+
| `route_to` | `str` | Free text (`"billing"`, `"safety"`, `"engineering"`) | Department or person |
|
| 103 |
+
|
| 104 |
+
### Example action JSON
|
| 105 |
+
|
| 106 |
+
```json
|
| 107 |
+
{
|
| 108 |
+
"label": "urgent",
|
| 109 |
+
"summary": "Customer reports a safety issue with product overheating",
|
| 110 |
+
"route_to": "safety"
|
| 111 |
+
}
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Tasks
|
| 117 |
+
|
| 118 |
+
Each task now contains multiple deterministic scenario variants. By default, `/reset`
|
| 119 |
+
cycles through the public scenario pool for the selected task.
|
| 120 |
+
|
| 121 |
+
Private evaluation split selection is controlled server-side via environment
|
| 122 |
+
configuration (`OPENENV_EVAL_SPLIT`), and client-side override can be disabled
|
| 123 |
+
to preserve benchmark integrity.
|
| 124 |
+
|
| 125 |
+
To keep private evaluation data out of source control, supply hidden scenarios at
|
| 126 |
+
runtime using `OPENENV_PRIVATE_SCENARIOS_JSON` (JSON object keyed by task id).
|
| 127 |
+
|
| 128 |
+
Example deployment configuration:
|
| 129 |
+
|
| 130 |
+
```bash
|
| 131 |
+
export OPENENV_EVAL_SPLIT="private_eval"
|
| 132 |
+
export OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE="false"
|
| 133 |
+
export OPENENV_PRIVATE_SCENARIOS_JSON='{"task_easy":[{"scenario_id":"easy-private-001","emails":[{"email_id":"easy-p-001","subject":"Private billing exception","body":"Please correct invoice mismatch for contract addendum B-7 before end of day.","sender":"contracts@partner.example","timestamp":"2026-04-03T09:00:00Z","thread_history":["Customer requested corrected invoice reference."]}],"ground_truth":[{"label":"normal","route_to":"billing","priority_weight":1.0,"summary_keywords":["invoice mismatch","contract addendum","correct"]}]}],"task_medium":[],"task_hard":[]}'
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
Notes:
|
| 137 |
+
|
| 138 |
+
- Keep this value in deployment secrets or runtime environment config.
|
| 139 |
+
- Use valid JSON with double quotes only.
|
| 140 |
+
- You can provide multiple scenarios per task by adding more objects to each task list.
|
| 141 |
+
|
| 142 |
+
### Task 1 β Easy (`task_easy`)
|
| 143 |
+
|
| 144 |
+
Objective: Correctly classify a single unambiguous email.
|
| 145 |
+
|
| 146 |
+
Scoring:
|
| 147 |
+
|
| 148 |
+
- Correct label: 1.0
|
| 149 |
+
- Wrong label but correct routing: 0.3
|
| 150 |
+
- Everything wrong: 0.0
|
| 151 |
+
|
| 152 |
+
### Task 2 β Medium (`task_medium`)
|
| 153 |
+
|
| 154 |
+
Objective: Triage a queue of 5 emails with mixed priority signals.
|
| 155 |
+
|
| 156 |
+
Scoring:
|
| 157 |
+
|
| 158 |
+
- Each email scored individually
|
| 159 |
+
- Score = (correct labels / total emails) * priority weight factor
|
| 160 |
+
- Higher-priority misclassifications are penalized more heavily
|
| 161 |
+
- Final score = weighted mean of all individual scores
|
| 162 |
+
|
| 163 |
+
### Task 3 β Hard (`task_hard`)
|
| 164 |
+
|
| 165 |
+
Objective: Handle a complex complaint that crosses multiple categories.
|
| 166 |
+
|
| 167 |
+
Scoring:
|
| 168 |
+
|
| 169 |
+
- Escalated to safety: 0.4 weight
|
| 170 |
+
- Correct routing: 0.3 weight
|
| 171 |
+
- Marked as urgent: 0.3 weight
|
| 172 |
+
- Penalty: -0.2 if marked as spam
|
| 173 |
+
- Final score = weighted sum of sub-scores (clipped to 0.0 minimum)
|
| 174 |
+
|
| 175 |
+
### Task 4 β Production (`task_production`)
|
| 176 |
+
|
| 177 |
+
Objective: Simulate a production inbox with mixed operational load across safety,
|
| 178 |
+
engineering, billing, support, spam, and low-priority traffic.
|
| 179 |
+
|
| 180 |
+
Scoring:
|
| 181 |
+
|
| 182 |
+
- Per-email weighted scoring by business priority
|
| 183 |
+
- Route-noise penalty when actions route to too many teams
|
| 184 |
+
- Summary quality based on contextual evidence keywords and anti-stuffing rules
|
| 185 |
+
- Deterministic escalation follow-ups are inserted when critical triage is missed
|
| 186 |
+
- Runtime controls available via `/reset` payload for production simulations:
|
| 187 |
+
- `production_profile`: `light` | `standard` | `heavy`
|
| 188 |
+
- `business_hours_mode`: `true` | `false`
|
| 189 |
+
- `escalation_mode`: `low` | `normal` | `high`
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## Reward Function
|
| 194 |
+
|
| 195 |
+
The reward function provides dense training signal at every step, not just binary pass/fail.
|
| 196 |
+
|
| 197 |
+
### Formula
|
| 198 |
+
|
| 199 |
+
```text
|
| 200 |
+
final_reward = base_score + progress_signal + trajectory_bonus - penalties - step_cost
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### Components
|
| 204 |
+
|
| 205 |
+
| Component | Value | Condition |
|
| 206 |
+
|---|---|---|
|
| 207 |
+
| Base score | 0.0-1.0 | Raw grader score for the current step |
|
| 208 |
+
| Progress signal | ~0.00 to ~0.13 | Partial credit for advancing queue, quality, and positive trend |
|
| 209 |
+
| Step cost | ~-0.005 to ~-0.015 | Gentle efficiency pressure over longer episodes |
|
| 210 |
+
| Trajectory bonus | +0.2 | If all tasks completed with mean score > 0.8 |
|
| 211 |
+
| Archive quality penalty | -0.5 | Archive action with an underspecified summary |
|
| 212 |
+
| Loop detection penalty | -0.3 | Same action repeated 3+ times consecutively |
|
| 213 |
+
|
| 214 |
+
The final reward is clipped to [-1.0, 1.0] before being returned.
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Quick Start
|
| 219 |
+
|
| 220 |
+
### Prerequisites
|
| 221 |
+
|
| 222 |
+
- Python 3.11+
|
| 223 |
+
- API endpoint, model name, and token for inference
|
| 224 |
+
|
| 225 |
+
### Installation
|
| 226 |
+
|
| 227 |
+
```bash
|
| 228 |
+
pip install -r requirements.txt
|
| 229 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 230 |
+
export MODEL_NAME="gpt-4o"
|
| 231 |
+
export HF_TOKEN="your-token-here"
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
### Run the environment locally
|
| 235 |
+
|
| 236 |
+
```bash
|
| 237 |
+
python server.py
|
| 238 |
+
|
| 239 |
+
curl -X POST http://localhost:7860/reset \
|
| 240 |
+
-H "Content-Type: application/json" \
|
| 241 |
+
-d '{"task_id": "task_easy"}'
|
| 242 |
+
|
| 243 |
+
curl -X POST http://localhost:7860/step \
|
| 244 |
+
-H "Content-Type: application/json" \
|
| 245 |
+
-d '{"label": "urgent", "summary": "Test", "route_to": "billing"}'
|
| 246 |
+
|
| 247 |
+
curl -X POST http://localhost:7860/state
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
## Running Inference
|
| 253 |
+
|
| 254 |
+
```bash
|
| 255 |
+
python inference.py --task all
|
| 256 |
+
python inference.py --task 1
|
| 257 |
+
python inference.py --task 4 --production-profile heavy --business-hours-mode --escalation-mode high
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
+
The script reads API settings from environment variables and uses fallback actions when
|
| 261 |
+
model output is unparseable, so episodes still complete.
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
+
## Inference Architecture
|
| 266 |
+
|
| 267 |
+
The inference script (inference.py) follows this loop:
|
| 268 |
+
|
| 269 |
+
```text
|
| 270 |
+
1. Initialize OpenAI client + environment
|
| 271 |
+
2. reset() to get first observation
|
| 272 |
+
3. Loop until done or MAX_STEPS:
|
| 273 |
+
- Build prompt from observation + history
|
| 274 |
+
- Call LLM with OpenAI client (catch request errors)
|
| 275 |
+
- Parse response into action (fallback on parse failure)
|
| 276 |
+
- env.step(action)
|
| 277 |
+
- Record reward and history
|
| 278 |
+
4. Print score table
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
### Environment Variables Required
|
| 282 |
+
|
| 283 |
+
```bash
|
| 284 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 285 |
+
export MODEL_NAME="gpt-4o"
|
| 286 |
+
export HF_TOKEN="your-token-here"
|
| 287 |
+
export INFERENCE_RUNTIME_BUDGET_SECONDS="1140"
|
| 288 |
+
export INFERENCE_REQUEST_TIMEOUT_SECONDS="12"
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
Runtime controls:
|
| 292 |
+
|
| 293 |
+
- `INFERENCE_RUNTIME_BUDGET_SECONDS` limits full-script wall-clock runtime (default 1140s, under 20 minutes).
|
| 294 |
+
- `INFERENCE_REQUEST_TIMEOUT_SECONDS` limits each LLM request timeout (default 12s).
|
| 295 |
+
- Equivalent CLI flags: `--runtime-budget-seconds` and `--request-timeout-seconds`.
|
| 296 |
+
|
| 297 |
+
Fallback behavior when parsing fails:
|
| 298 |
+
|
| 299 |
+
```json
|
| 300 |
+
{"label": "normal", "summary": "Unable to parse response", "route_to": "general"}
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## Score Table
|
| 306 |
+
|
| 307 |
+
Placeholder until inference is run.
|
| 308 |
+
|
| 309 |
+
| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Mean |
|
| 310 |
+
|---|---|---|---|---|
|
| 311 |
+
| MODEL_NAME | TBD | TBD | TBD | TBD |
|
| 312 |
+
|
| 313 |
+
Expected rough ranges:
|
| 314 |
+
|
| 315 |
+
- GPT-4o: 0.8-1.0 on easy, 0.5-0.8 on medium, 0.4-0.7 on hard
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## Docker Deployment
|
| 320 |
+
|
| 321 |
+
```bash
|
| 322 |
+
docker build -t email-triage-env .
|
| 323 |
+
docker run -p 7860:7860 email-triage-env
|
| 324 |
+
|
| 325 |
+
curl -X POST http://localhost:7860/reset \
|
| 326 |
+
-H "Content-Type: application/json" \
|
| 327 |
+
-d '{"task_id": "task_easy"}'
|
| 328 |
+
```
|
| 329 |
+
|
| 330 |
+
For Apple Silicon:
|
| 331 |
+
|
| 332 |
+
```bash
|
| 333 |
+
docker build --platform linux/amd64 -t email-triage-env .
|
| 334 |
+
```
|
| 335 |
+
|
| 336 |
+
---
|
| 337 |
+
|
| 338 |
+
## Hugging Face Space
|
| 339 |
+
|
| 340 |
+
Live URL placeholder:
|
| 341 |
+
|
| 342 |
+
`https://huggingface.co/spaces/YOUR_USERNAME/email-triage-env`
|
| 343 |
+
|
| 344 |
+
The Space homepage (`/`) now serves a lightweight interactive triage console for
|
| 345 |
+
manual testing. Machine-readable service metadata is available at `GET /meta`.
|
| 346 |
+
|
| 347 |
+
Example interaction:
|
| 348 |
+
|
| 349 |
+
```bash
|
| 350 |
+
export SPACE_URL="https://YOUR_USERNAME-email-triage-env.hf.space"
|
| 351 |
+
|
| 352 |
+
curl -X POST "$SPACE_URL/reset" \
|
| 353 |
+
-H "Content-Type: application/json" \
|
| 354 |
+
-d '{"task_id": "task_easy"}'
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
---
|
| 358 |
+
|
| 359 |
+
## Pre-Submission Validation
|
| 360 |
+
|
| 361 |
+
Run the validator before submitting your environment.
|
| 362 |
+
|
| 363 |
+
```bash
|
| 364 |
+
chmod +x validate-submission.sh
|
| 365 |
+
./validate-submission.sh https://YOUR_USERNAME-email-triage-env.hf.space .
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
The script checks:
|
| 369 |
+
|
| 370 |
+
- HF Space `/reset` health (HTTP 200 expected)
|
| 371 |
+
- Docker build success
|
| 372 |
+
- `openenv validate` pass status
|
| 373 |
+
|
| 374 |
+
---
|
| 375 |
+
|
| 376 |
+
## API Reference
|
| 377 |
+
|
| 378 |
+
### POST /reset
|
| 379 |
+
|
| 380 |
+
Request:
|
| 381 |
+
|
| 382 |
+
```json
|
| 383 |
+
{"task_id": "task_easy"}
|
| 384 |
+
```
|
| 385 |
+
|
| 386 |
+
Response:
|
| 387 |
+
|
| 388 |
+
```json
|
| 389 |
+
{
|
| 390 |
+
"observation": {
|
| 391 |
+
"email_id": "easy-001",
|
| 392 |
+
"subject": "Quarterly invoice available",
|
| 393 |
+
"body": "...",
|
| 394 |
+
"sender": "accounts@vendor-example.com",
|
| 395 |
+
"timestamp": "2026-03-25T09:15:00Z",
|
| 396 |
+
"thread_history": ["..."],
|
| 397 |
+
"task_id": "task_easy",
|
| 398 |
+
"step_number": 0,
|
| 399 |
+
"total_emails": 1
|
| 400 |
+
},
|
| 401 |
+
"info": {"task_id": "task_easy", "step": 0}
|
| 402 |
+
}
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
+
### POST /step
|
| 406 |
+
|
| 407 |
+
Request:
|
| 408 |
+
|
| 409 |
+
```json
|
| 410 |
+
{
|
| 411 |
+
"label": "urgent",
|
| 412 |
+
"summary": "Customer needs immediate help",
|
| 413 |
+
"route_to": "support"
|
| 414 |
+
}
|
| 415 |
+
```
|
| 416 |
+
|
| 417 |
+
Response:
|
| 418 |
+
|
| 419 |
+
```json
|
| 420 |
+
{
|
| 421 |
+
"observation": {},
|
| 422 |
+
"reward": 0.85,
|
| 423 |
+
"done": false,
|
| 424 |
+
"info": {"step": 1, "task_id": "task_easy"}
|
| 425 |
+
}
|
| 426 |
+
```
|
| 427 |
+
|
| 428 |
+
### POST /state
|
| 429 |
+
|
| 430 |
+
No request body required.
|
| 431 |
+
|
| 432 |
+
Response: `EnvironmentState` JSON object.
|
| 433 |
+
|
| 434 |
+
---
|
| 435 |
+
|
| 436 |
+
## Project Structure
|
| 437 |
+
|
| 438 |
+
```text
|
| 439 |
+
.
|
| 440 |
+
βββ models.py
|
| 441 |
+
βββ tasks.py
|
| 442 |
+
βββ graders.py
|
| 443 |
+
βββ environment.py
|
| 444 |
+
βββ server.py
|
| 445 |
+
βββ server/
|
| 446 |
+
β βββ app.py
|
| 447 |
+
βββ inference.py
|
| 448 |
+
βββ openenv.yaml
|
| 449 |
+
βββ Dockerfile
|
| 450 |
+
βββ requirements.txt
|
| 451 |
+
βββ pyproject.toml
|
| 452 |
+
βββ uv.lock
|
| 453 |
+
βββ validate-submission.sh
|
| 454 |
+
βββ README.md
|
| 455 |
+
βββ RULES.md
|
| 456 |
+
```
|
| 457 |
+
|
| 458 |
+
---
|
| 459 |
+
|
| 460 |
+
## Known Limitations
|
| 461 |
+
|
| 462 |
+
| Limitation | Impact |
|
| 463 |
+
|---|---|
|
| 464 |
+
| Static scenario pools | No live inbox ingestion from production systems |
|
| 465 |
+
| Single-agent server instance | Concurrent agents can conflict |
|
| 466 |
+
| No live thread simulation | Thread history is static |
|
| 467 |
+
| English-only content | No multilingual coverage |
|
| 468 |
+
| No attachments | Text-only triage |
|
| 469 |
+
| Simplified routing | No org chart or availability modeling |
|
| 470 |
+
| Limited temporal dynamics | Production task can generate deterministic escalations, but not full live message streams |
|
| 471 |
+
| Rule-based grading edges | Equivalent decisions may score differently from humans |
|
| 472 |
+
|
| 473 |
+
What an agent cannot exploit:
|
| 474 |
+
|
| 475 |
+
- The correct answer is never present in observations
|
| 476 |
+
- The grader is a pure function and cannot be manipulated
|
| 477 |
+
- Step penalty cannot be bypassed except by efficient actions
|
| 478 |
+
|
| 479 |
+
---
|
| 480 |
+
|
| 481 |
+
## Summary of Revision 2 Changes
|
| 482 |
+
|
| 483 |
+
| What Changed | Before | After | Why |
|
| 484 |
+
|---|---|---|---|
|
| 485 |
+
| Return type of step() | tuple | StepResult object | Match sample result.observation pattern |
|
| 486 |
+
| Return type of reset() | EmailObservation | ResetResult object | Match sample result.observation pattern |
|
| 487 |
+
| New models | 4 models | 6 models (+StepResult, +ResetResult) | Match sample interface |
|
| 488 |
+
| API key reading | OPENAI_API_KEY style | HF_TOKEN or API_KEY via os.getenv | Match sample fallback pattern |
|
| 489 |
+
| Temperature guidance | 0 | 0.2 | Match sample behavior |
|
| 490 |
+
| Response parsing | JSON-only assumption | Text parsing with fallback action | Robustness to non-JSON model output |
|
| 491 |
+
| History tracking | Optional | Mandatory | Match sample architecture |
|
| 492 |
+
| Step cap | Not explicit | MAX_STEPS constant | Runtime safety and reproducibility |
|
| 493 |
+
|
| 494 |
+
---
|
| 495 |
+
|
| 496 |
+
## Contributing
|
| 497 |
+
|
| 498 |
+
Read `RULES.md` before contributing.
|
| 499 |
+
|
| 500 |
+
Key constraints:
|
| 501 |
+
|
| 502 |
+
- Type hints and Pydantic models required
|
| 503 |
+
- No extra dependencies without explicit approval
|
| 504 |
+
- No features beyond project brief
|
| 505 |
+
- Graders must remain deterministic pure functions
|
| 506 |
+
|
| 507 |
+
---
|
| 508 |
+
|
| 509 |
+
## License
|
| 510 |
+
|
| 511 |
+
MIT License.
|