title: Email Triage OpenEnv emoji: 📧 colorFrom: blue colorTo: green sdk: docker pinned: false tags: - openenv

📧 Email Triage — OpenEnv Environment

A real-world reinforcement learning environment where an AI agent triages a business inbox: classifying email priority, routing to the correct department, deciding if a response is needed, and summarizing the required action.

Why This Matters

Email triage is one of the most universal knowledge-work tasks — every company does it, it's well-defined enough to evaluate, and bad triage has real business costs (missed critical issues, wasted time on spam). This makes it an ideal benchmark for evaluating LLM agents on practical NLP reasoning.

Environment Description

Task: Given an email (subject, body, sender, timestamp), the agent must output:

Priority: low | medium | high | critical
Department: sales | support | billing | engineering | hr | spam
Summary: 1-2 sentence action summary
Needs response: true | false

Tasks

Task ID	Name	Difficulty	Description
1	Basic Email Triage	Easy	Clear, unambiguous signals. Obvious priority and routing.
2	Ambiguous Signal Triage	Medium	Mixed signals. Polite language hiding urgency. Context-dependent routing.
3	Complex Business Triage	Hard	Requires business domain knowledge, regulatory awareness, and implicit urgency detection.

Observation Space

{
  "email_id": "string",
  "subject": "string",
  "body": "string",
  "sender": "string",
  "timestamp": "string (YYYY-MM-DD HH:MM:SS)",
  "step_number": "integer",
  "task_id": "integer (1-3)",
  "instructions": "string (task-specific guidance)"
}

Action Space

{
  "priority": "low | medium | high | critical",
  "department": "sales | support | billing | engineering | hr | spam",
  "summary": "string (1-2 sentences)",
  "needs_response": "boolean"
}

Reward Function

Each step returns a reward in [0.0, 1.0] computed as:

Component	Weight	Description
Priority accuracy	40%	Exact match = 1.0; partial credit for adjacent levels
Department routing	35%	Exact match = 1.0; partial credit for related depts
Response decision	15%	Binary correct/incorrect
Summary quality	10%	Length + content heuristic
Critical miss penalty	-30%	Missing a "critical" email as low/medium

Reward provides continuous partial credit throughout the episode, not just binary end-of-episode scores.

API Endpoints

Method	Endpoint	Description
POST	`/reset?task_id=1`	Start a new episode for task 1, 2, or 3
POST	`/step`	Submit an action, receive observation + reward
GET	`/state`	Current episode state
GET	`/tasks`	List all tasks
GET	`/`	Health check

Setup & Usage

Docker

docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env

Run baseline inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-token-here"
export ENV_URL="http://localhost:7860"

python inference.py

Baseline Scores

Tested with gpt-4o-mini (temperature=0):

Task	Avg Reward
Task 1 (Easy)	~0.87
Task 2 (Medium)	~0.64
Task 3 (Hard)	~0.41
Overall	~0.64

Expected Log Format

The inference script emits structured JSON logs:

{"type": "START", "model": "...", "tasks": [1, 2, 3]}
{"type": "STEP", "task_id": 1, "step": 1, "reward": 0.92, ...}
{"type": "END", "overall_avg_reward": 0.64, "task_scores": {...}}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support