title: Email Triage OpenEnv emoji: π§ colorFrom: blue colorTo: green sdk: docker pinned: false tags: - openenv
π§ Email Triage β OpenEnv Environment
A real-world reinforcement learning environment where an AI agent triages a business inbox: classifying email priority, routing to the correct department, deciding if a response is needed, and summarizing the required action.
Why This Matters
Email triage is one of the most universal knowledge-work tasks β every company does it, it's well-defined enough to evaluate, and bad triage has real business costs (missed critical issues, wasted time on spam). This makes it an ideal benchmark for evaluating LLM agents on practical NLP reasoning.
Environment Description
Task: Given an email (subject, body, sender, timestamp), the agent must output:
- Priority:
low|medium|high|critical - Department:
sales|support|billing|engineering|hr|spam - Summary: 1-2 sentence action summary
- Needs response:
true|false
Tasks
| Task ID | Name | Difficulty | Description |
|---|---|---|---|
| 1 | Basic Email Triage | Easy | Clear, unambiguous signals. Obvious priority and routing. |
| 2 | Ambiguous Signal Triage | Medium | Mixed signals. Polite language hiding urgency. Context-dependent routing. |
| 3 | Complex Business Triage | Hard | Requires business domain knowledge, regulatory awareness, and implicit urgency detection. |
Observation Space
{
"email_id": "string",
"subject": "string",
"body": "string",
"sender": "string",
"timestamp": "string (YYYY-MM-DD HH:MM:SS)",
"step_number": "integer",
"task_id": "integer (1-3)",
"instructions": "string (task-specific guidance)"
}
Action Space
{
"priority": "low | medium | high | critical",
"department": "sales | support | billing | engineering | hr | spam",
"summary": "string (1-2 sentences)",
"needs_response": "boolean"
}
Reward Function
Each step returns a reward in [0.0, 1.0] computed as:
| Component | Weight | Description |
|---|---|---|
| Priority accuracy | 40% | Exact match = 1.0; partial credit for adjacent levels |
| Department routing | 35% | Exact match = 1.0; partial credit for related depts |
| Response decision | 15% | Binary correct/incorrect |
| Summary quality | 10% | Length + content heuristic |
| Critical miss penalty | -30% | Missing a "critical" email as low/medium |
Reward provides continuous partial credit throughout the episode, not just binary end-of-episode scores.
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /reset?task_id=1 |
Start a new episode for task 1, 2, or 3 |
| POST | /step |
Submit an action, receive observation + reward |
| GET | /state |
Current episode state |
| GET | /tasks |
List all tasks |
| GET | / |
Health check |
Setup & Usage
Docker
docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env
Run baseline inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-token-here"
export ENV_URL="http://localhost:7860"
python inference.py
Baseline Scores
Tested with gpt-4o-mini (temperature=0):
| Task | Avg Reward |
|---|---|
| Task 1 (Easy) | ~0.87 |
| Task 2 (Medium) | ~0.64 |
| Task 3 (Hard) | ~0.41 |
| Overall | ~0.64 |
Expected Log Format
The inference script emits structured JSON logs:
{"type": "START", "model": "...", "tasks": [1, 2, 3]}
{"type": "STEP", "task_id": 1, "step": 1, "reward": 0.92, ...}
{"type": "END", "overall_avg_reward": 0.64, "task_scores": {...}}