title: Email Triage OpenEnv emoji: πŸ“§ colorFrom: blue colorTo: green sdk: docker pinned: false tags: - openenv

πŸ“§ Email Triage β€” OpenEnv Environment

A real-world reinforcement learning environment where an AI agent triages a business inbox: classifying email priority, routing to the correct department, deciding if a response is needed, and summarizing the required action.

Why This Matters

Email triage is one of the most universal knowledge-work tasks β€” every company does it, it's well-defined enough to evaluate, and bad triage has real business costs (missed critical issues, wasted time on spam). This makes it an ideal benchmark for evaluating LLM agents on practical NLP reasoning.


Environment Description

Task: Given an email (subject, body, sender, timestamp), the agent must output:

  • Priority: low | medium | high | critical
  • Department: sales | support | billing | engineering | hr | spam
  • Summary: 1-2 sentence action summary
  • Needs response: true | false

Tasks

Task ID Name Difficulty Description
1 Basic Email Triage Easy Clear, unambiguous signals. Obvious priority and routing.
2 Ambiguous Signal Triage Medium Mixed signals. Polite language hiding urgency. Context-dependent routing.
3 Complex Business Triage Hard Requires business domain knowledge, regulatory awareness, and implicit urgency detection.

Observation Space

{
  "email_id": "string",
  "subject": "string",
  "body": "string",
  "sender": "string",
  "timestamp": "string (YYYY-MM-DD HH:MM:SS)",
  "step_number": "integer",
  "task_id": "integer (1-3)",
  "instructions": "string (task-specific guidance)"
}

Action Space

{
  "priority": "low | medium | high | critical",
  "department": "sales | support | billing | engineering | hr | spam",
  "summary": "string (1-2 sentences)",
  "needs_response": "boolean"
}

Reward Function

Each step returns a reward in [0.0, 1.0] computed as:

Component Weight Description
Priority accuracy 40% Exact match = 1.0; partial credit for adjacent levels
Department routing 35% Exact match = 1.0; partial credit for related depts
Response decision 15% Binary correct/incorrect
Summary quality 10% Length + content heuristic
Critical miss penalty -30% Missing a "critical" email as low/medium

Reward provides continuous partial credit throughout the episode, not just binary end-of-episode scores.


API Endpoints

Method Endpoint Description
POST /reset?task_id=1 Start a new episode for task 1, 2, or 3
POST /step Submit an action, receive observation + reward
GET /state Current episode state
GET /tasks List all tasks
GET / Health check

Setup & Usage

Docker

docker build -t email-triage-env .
docker run -p 7860:7860 email-triage-env

Run baseline inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-token-here"
export ENV_URL="http://localhost:7860"

python inference.py

Baseline Scores

Tested with gpt-4o-mini (temperature=0):

Task Avg Reward
Task 1 (Easy) ~0.87
Task 2 (Medium) ~0.64
Task 3 (Hard) ~0.41
Overall ~0.64

Expected Log Format

The inference script emits structured JSON logs:

{"type": "START", "model": "...", "tasks": [1, 2, 3]}
{"type": "STEP", "task_id": 1, "step": 1, "reward": 0.92, ...}
{"type": "END", "overall_avg_reward": 0.64, "task_scores": {...}}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support