Spaces:

Imaginephoenix
/

openenv

Runtime error

App Files Files Community

Imaginephoenix commited on Apr 3

Commit

02e973e

verified ·

1 Parent(s): e45bf0c

Upload 15 files

Browse files

Files changed (15) hide show

Dockerfile +20 -0
README.md +500 -0
RULES.md +551 -0
app.py +609 -0
environment.py +419 -0
graders.py +296 -0
inference.py +377 -0
models.py +62 -0
openenv.yaml +27 -0
pyproject.toml +20 -0
requirements.txt +4 -0
server.py +621 -0
tasks.py +748 -0
uv.lock +8 -0
validate-submission.sh +198 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+COPY . /app
+RUN adduser --disabled-password --gecos "" appuser \
+    && chown -R appuser:appuser /app
+USER appuser
+EXPOSE 7860
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "server:app"]

README.md ADDED Viewed

	@@ -0,0 +1,500 @@

+# OpenEnv Email Triage Environment
+A real-world AI agent training environment that simulates professional email triage.
+Built to the OpenEnv specification for standardized agent evaluation and benchmarking.
+- **Status:** In Development
+- **Domain:** Email Triage
+- **Deployment:** Hugging Face Spaces (Docker)
+---
+## Table of Contents
+- [What Is This?](#what-is-this)
+- [Who Is This For?](#who-is-this-for)
+- [Observation Space](#observation-space)
+- [Action Space](#action-space)
+- [Tasks](#tasks)
+- [Reward Function](#reward-function)
+- [Quick Start](#quick-start)
+- [Running Inference](#running-inference)
+- [Inference Architecture](#inference-architecture)
+- [Score Table](#score-table)
+- [Docker Deployment](#docker-deployment)
+- [Hugging Face Space](#hugging-face-space)
+- [Pre-Submission Validation](#pre-submission-validation)
+- [API Reference](#api-reference)
+- [Project Structure](#project-structure)
+- [Known Limitations](#known-limitations)
+- [Contributing](#contributing)
+- [License](#license)
+---
+## What Is This?
+This environment simulates a professional email inbox where an AI agent must:
+1. Read incoming emails with realistic metadata (sender, subject, body, thread history).
+2. Classify each email with the correct priority label.
+3. Route each email to the appropriate department or person.
+4. Summarize the email's key information.
+Think of it as OpenAI Gym for office work. Instead of balancing a pole, the agent triages an
+inbox. The environment provides structured observations, accepts structured actions, and
+returns graded rewards with partial credit.
+Every decision is scored by a deterministic programmatic grader: no LLM-as-judge,
+no randomness, fully reproducible.
+---
+## Who Is This For?
+| Audience | Use Case |
+|---|---|
+| AI Safety Researchers | Measure agent behavior on realistic tasks with known ground truth |
+| LLM Agent Developers | Benchmark models and prompting strategies on real-world work |
+| RL Researchers | Train agents with shaped rewards in a professional task environment |
+| Companies | Evaluate LLM agents before deploying them to handle real email |
+---
+## Observation Space
+What the agent sees at each step:
+| Field | Type | Description |
+|---|---|---|
+| `email_id` | `str` | Unique identifier for this email |
+| `subject` | `str` | Email subject line |
+| `body` | `str` | Full email body text |
+| `sender` | `str` | Sender's email address |
+| `timestamp` | `str` | ISO 8601 timestamp of when the email was received |
+| `thread_history` | `list[str]` | Previous messages in the email thread (may be empty) |
+| `task_id` | `str` | Which task is currently active |
+| `step_number` | `int` | Current step in the episode (0-indexed) |
+| `total_emails` | `int` | Total number of emails to process in this task |
+The observation never contains the correct answer. The agent must reason from email content.
+---
+## Action Space
+What the agent must output at each step:
+| Field | Type | Allowed Values | Description |
+|---|---|---|---|
+| `label` | `Literal` | `"urgent"`, `"normal"`, `"spam"`, `"archive"` | Priority classification |
+| `summary` | `str` | Free text | Brief summary of the email's content and intent |
+| `route_to` | `str` | Free text (`"billing"`, `"safety"`, `"engineering"`) | Department or person |
+### Example action JSON
+```json
+{
+  "label": "urgent",
+  "summary": "Customer reports a safety issue with product overheating",
+  "route_to": "safety"
+}
+```
+---
+## Tasks
+Each task now contains multiple deterministic scenario variants. By default, `/reset`
+cycles through the public scenario pool for the selected task.
+Private evaluation split selection is controlled server-side via environment
+configuration (`OPENENV_EVAL_SPLIT`), and client-side override can be disabled
+to preserve benchmark integrity.
+To keep private evaluation data out of source control, supply hidden scenarios at
+runtime using `OPENENV_PRIVATE_SCENARIOS_JSON` (JSON object keyed by task id).
+Example deployment configuration:
+```bash
+export OPENENV_EVAL_SPLIT="private_eval"
+export OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE="false"
+export OPENENV_PRIVATE_SCENARIOS_JSON='{"task_easy":[{"scenario_id":"easy-private-001","emails":[{"email_id":"easy-p-001","subject":"Private billing exception","body":"Please correct invoice mismatch for contract addendum B-7 before end of day.","sender":"contracts@partner.example","timestamp":"2026-04-03T09:00:00Z","thread_history":["Customer requested corrected invoice reference."]}],"ground_truth":[{"label":"normal","route_to":"billing","priority_weight":1.0,"summary_keywords":["invoice mismatch","contract addendum","correct"]}]}],"task_medium":[],"task_hard":[]}'
+```
+Notes:
+- Keep this value in deployment secrets or runtime environment config.
+- Use valid JSON with double quotes only.
+- You can provide multiple scenarios per task by adding more objects to each task list.
+### Task 1 — Easy (`task_easy`)
+Objective: Correctly classify a single unambiguous email.
+Scoring:
+- Correct label: 1.0
+- Wrong label but correct routing: 0.3
+- Everything wrong: 0.0
+### Task 2 — Medium (`task_medium`)
+Objective: Triage a queue of 5 emails with mixed priority signals.
+Scoring:
+- Each email scored individually
+- Score = (correct labels / total emails) * priority weight factor
+- Higher-priority misclassifications are penalized more heavily
+- Final score = weighted mean of all individual scores
+### Task 3 — Hard (`task_hard`)
+Objective: Handle a complex complaint that crosses multiple categories.
+Scoring:
+- Escalated to safety: 0.4 weight
+- Correct routing: 0.3 weight
+- Marked as urgent: 0.3 weight
+- Penalty: -0.2 if marked as spam
+- Final score = weighted sum of sub-scores (clipped to 0.0 minimum)
+### Task 4 — Production (`task_production`)
+Objective: Simulate a production inbox with mixed operational load across safety,
+engineering, billing, support, spam, and low-priority traffic.
+Scoring:
+- Per-email weighted scoring by business priority
+- Route-noise penalty when actions route to too many teams
+- Summary quality based on contextual evidence keywords and anti-stuffing rules
+- Deterministic escalation follow-ups are inserted when critical triage is missed
+- Runtime controls available via `/reset` payload for production simulations:
+  - `production_profile`: `light` | `standard` | `heavy`
+  - `business_hours_mode`: `true` | `false`
+  - `escalation_mode`: `low` | `normal` | `high`
+---
+## Reward Function
+The reward function provides dense training signal at every step, not just binary pass/fail.
+### Formula
+```text
+final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
+```
+### Components
+| Component | Value | Condition |
+|---|---|---|
+| Base score | 0.0-1.0 | Raw grader score for the current step |
+| Step penalty | -0.01 per step | Encourages efficiency |
+| Trajectory bonus | +0.2 | If all tasks completed with mean score > 0.8 |
+| Destructive action penalty | -0.5 | Agent archives or deletes without reading |
+| Loop detection penalty | -0.3 | Same action repeated 3+ times consecutively |
+The final reward is clipped to [-1.0, 1.0] before being returned.
+---
+## Quick Start
+### Prerequisites
+- Python 3.11+
+- API endpoint, model name, and token for inference
+### Installation
+```bash
+pip install -r requirements.txt
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="gpt-4o"
+export HF_TOKEN="your-token-here"
+```
+### Run the environment locally
+```bash
+python server.py
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"label": "urgent", "summary": "Test", "route_to": "billing"}'
+curl -X POST http://localhost:7860/state
+```
+---
+## Running Inference
+```bash
+python inference.py --task all
+python inference.py --task 1
+python inference.py --task 4 --production-profile heavy --business-hours-mode --escalation-mode high
+```
+The script reads API settings from environment variables and uses fallback actions when
+model output is unparseable, so episodes still complete.
+---
+## Inference Architecture
+The inference script (inference.py) follows this loop:
+```text
+1. Initialize OpenAI client + environment
+2. reset() to get first observation
+3. Loop until done or MAX_STEPS:
+  - Build prompt from observation + history
+  - Call LLM with OpenAI client (catch request errors)
+  - Parse response into action (fallback on parse failure)
+  - env.step(action)
+  - Record reward and history
+4. Print score table
+```
+### Environment Variables Required
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="gpt-4o"
+export HF_TOKEN="your-token-here"
+export INFERENCE_RUNTIME_BUDGET_SECONDS="1140"
+export INFERENCE_REQUEST_TIMEOUT_SECONDS="12"
+```
+Runtime controls:
+- `INFERENCE_RUNTIME_BUDGET_SECONDS` limits full-script wall-clock runtime (default 1140s, under 20 minutes).
+- `INFERENCE_REQUEST_TIMEOUT_SECONDS` limits each LLM request timeout (default 12s).
+- Equivalent CLI flags: `--runtime-budget-seconds` and `--request-timeout-seconds`.
+Fallback behavior when parsing fails:
+```json
+{"label": "normal", "summary": "Unable to parse response", "route_to": "general"}
+```
+---
+## Score Table
+Placeholder until inference is run.
+| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Mean |
+|---|---|---|---|---|
+| MODEL_NAME | TBD | TBD | TBD | TBD |
+Expected rough ranges:
+- GPT-4o: 0.8-1.0 on easy, 0.5-0.8 on medium, 0.4-0.7 on hard
+---
+## Docker Deployment
+```bash
+docker build -t email-triage-env .
+docker run -p 7860:7860 email-triage-env
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+```
+For Apple Silicon:
+```bash
+docker build --platform linux/amd64 -t email-triage-env .
+```
+---
+## Hugging Face Space
+Live URL placeholder:
+`https://huggingface.co/spaces/YOUR_USERNAME/email-triage-env`
+The Space homepage (`/`) now serves a lightweight interactive triage console for
+manual testing. Machine-readable service metadata is available at `GET /meta`.
+Example interaction:
+```bash
+export SPACE_URL="https://YOUR_USERNAME-email-triage-env.hf.space"
+curl -X POST "$SPACE_URL/reset" \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+```
+---
+## Pre-Submission Validation
+Run the validator before submitting your environment.
+```bash
+chmod +x validate-submission.sh
+./validate-submission.sh https://YOUR_USERNAME-email-triage-env.hf.space .
+```
+The script checks:
+- HF Space `/reset` health (HTTP 200 expected)
+- Docker build success
+- `openenv validate` pass status
+---
+## API Reference
+### POST /reset
+Request:
+```json
+{"task_id": "task_easy"}
+```
+Response:
+```json
+{
+  "observation": {
+    "email_id": "easy-001",
+    "subject": "Quarterly invoice available",
+    "body": "...",
+    "sender": "accounts@vendor-example.com",
+    "timestamp": "2026-03-25T09:15:00Z",
+    "thread_history": ["..."],
+    "task_id": "task_easy",
+    "step_number": 0,
+    "total_emails": 1
+  },
+  "info": {"task_id": "task_easy", "step": 0}
+}
+```
+### POST /step
+Request:
+```json
+{
+  "label": "urgent",
+  "summary": "Customer needs immediate help",
+  "route_to": "support"
+}
+```
+Response:
+```json
+{
+  "observation": {},
+  "reward": 0.85,
+  "done": false,
+  "info": {"step": 1, "task_id": "task_easy"}
+}
+```
+### POST /state
+No request body required.
+Response: `EnvironmentState` JSON object.
+---
+## Project Structure
+```text
+.
+├── models.py
+├── tasks.py
+├── graders.py
+├── environment.py
+├── server.py
+├── server/
+│   └── app.py
+├── inference.py
+├── openenv.yaml
+├── Dockerfile
+├── requirements.txt
+├── pyproject.toml
+├── uv.lock
+├── validate-submission.sh
+├── README.md
+└── RULES.md
+```
+---
+## Known Limitations
+| Limitation | Impact |
+|---|---|
+| Static scenario pools | No live inbox ingestion from production systems |
+| Single-agent server instance | Concurrent agents can conflict |
+| No live thread simulation | Thread history is static |
+| English-only content | No multilingual coverage |
+| No attachments | Text-only triage |
+| Simplified routing | No org chart or availability modeling |
+| Limited temporal dynamics | Production task can generate deterministic escalations, but not full live message streams |
+| Rule-based grading edges | Equivalent decisions may score differently from humans |
+What an agent cannot exploit:
+- The correct answer is never present in observations
+- The grader is a pure function and cannot be manipulated
+- Step penalty cannot be bypassed except by efficient actions
+---
+## Summary of Revision 2 Changes
+| What Changed | Before | After | Why |
+|---|---|---|---|
+| Return type of step() | tuple | StepResult object | Match sample result.observation pattern |
+| Return type of reset() | EmailObservation | ResetResult object | Match sample result.observation pattern |
+| New models | 4 models | 6 models (+StepResult, +ResetResult) | Match sample interface |
+| API key reading | OPENAI_API_KEY style | HF_TOKEN or API_KEY via os.getenv | Match sample fallback pattern |
+| Temperature guidance | 0 | 0.2 | Match sample behavior |
+| Response parsing | JSON-only assumption | Text parsing with fallback action | Robustness to non-JSON model output |
+| History tracking | Optional | Mandatory | Match sample architecture |
+| Step cap | Not explicit | MAX_STEPS constant | Runtime safety and reproducibility |
+---
+## Contributing
+Read `RULES.md` before contributing.
+Key constraints:
+- Type hints and Pydantic models required
+- No extra dependencies without explicit approval
+- No features beyond project brief
+- Graders must remain deterministic pure functions
+---
+## License
+MIT License.

RULES.md ADDED Viewed

	@@ -0,0 +1,551 @@

+# RULES.md - Project Constitution & AI Guardrails
+# OpenEnv Email Triage Environment
+EVERY AI agent, copilot, or assistant working on this project MUST read and obey this file before generating ANY code.
+REVISION 2: Updated based on sample inference.py analysis.
+Where submission rules conflict with the original brief, SUBMISSION RULES WIN.
+Where the sample script reveals patterns, MATCH THE PATTERNS.
+## 0. GOLDEN RULE
+> Do NOT generate code that you cannot explain line by line.
+> Do NOT add features not listed in this document.
+> Do NOT deviate from the file map, naming conventions, or interfaces defined here.
+> When in doubt, do LESS, not more.
+---
+## 1. SCOPE - What This Project Is
+- An OpenEnv-compliant AI agent training environment
+- Domain: Email Triage (classify, prioritise, route emails)
+- Deployed as a Docker-based Hugging Face Space
+- Evaluated by inference.py using OpenAI Client with configurable endpoint
+### What this project is NOT
+- A chatbot
+- A web app with a UI
+- A game or toy problem
+- A fine-tuning pipeline
+- A multi-agent system
+- An LLM wrapper with extra features
+- A BrowserGym environment (the sample uses BrowserGym - we do NOT)
+---
+## 2. SUBMISSION CHECKLIST - DISQUALIFICATION CRITERIA
+These are automated checks. Failing ANY ONE means disqualification.
+| # | Check | What the validator does |
+|---|---|---|
+| 1 | HF Space deploys | Pings Space URL - must return HTTP 200 and respond to reset() |
+| 2 | OpenEnv spec compliance | Validates openenv.yaml, typed models, /step, /reset, /state |
+| 3 | Dockerfile builds | Runs docker build on the submitted repo - must succeed |
+| 4 | Inference reproduces | Runs inference.py - must complete without error and produce scores |
+| 5 | 3+ tasks with graders | Enumerates tasks, runs each grader, verifies scores in [0.0, 1.0] |
+| 6 | Pre-validation script | Runs `./validate-submission.sh <ping_url> .` and expects all 3 checks to pass |
+### 2.1 Mandatory pre-submit validation
+- Before claiming "submission ready", run `./validate-submission.sh <ping_url> .` from repo root.
+- If `<ping_url>` is unavailable, request it and block readiness claims until provided.
+- Any AI assistant working on this repo must treat validator failure as a hard stop.
+### Infrastructure constraints
+| Constraint | Limit |
+|---|---|
+| vCPU | 2 |
+| Memory | 8 GB |
+| Inference runtime | < 20 minutes |
+---
+## 3. ENVIRONMENT VARIABLES - Mandatory
+```python
+import os
+API_BASE_URL = os.getenv("API_BASE_URL")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+MODEL_NAME = os.getenv("MODEL_NAME")
+```
+How to use in code (EXACT PATTERN - matches sample):
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url=API_BASE_URL,
+    api_key=API_KEY,
+)
+completion = client.chat.completions.create(
+    model=MODEL_NAME,
+    messages=[...],
+    temperature=0.2,
+    max_tokens=200,
+    stream=False,
+)
+response_text = completion.choices[0].message.content or ""
+```
+Rules:
+- NEVER hard-code any of these values
+- NEVER use os.environ["VAR"] (use os.getenv() - matches sample)
+- NEVER use any LLM client other than openai.OpenAI
+- Support both HF_TOKEN and API_KEY with or fallback (matches sample)
+---
+## 4. FILE MAP - Strict Build Order
+| Order | File | Purpose | May import from |
+|---|---|---|---|
+| 1st | models.py | Pydantic models + StepResult wrapper | stdlib, pydantic only |
+| 2nd | tasks.py | Task definitions + hard-coded email data | models.py only |
+| 3rd | graders.py | Deterministic grader functions | models.py, tasks.py only |
+| 4th | environment.py | Core env class: step, reset, state | models, tasks, graders |
+| 5th | server.py | Flask HTTP wrapper: /reset, /step, /state | environment.py, models.py |
+| 6th | inference.py | OpenAI Client inference script | models.py, environment.py |
+| 7th | openenv.yaml | Spec metadata | N/A (data file) |
+| 8th | Dockerfile | Container build | N/A (config file) |
+| 8th | requirements.txt | Pinned dependencies | N/A (config file) |
+| 9th | README.md | Full documentation | N/A (documentation) |
+| 10th | validate-submission.sh | Pre-submission validator script | N/A (shell script) |
+### Rules about files
+- Do NOT create files not listed above. No utils.py, helpers.py, or config.py.
+- Do NOT merge files. Each file has one responsibility.
+- Do NOT create subdirectories. All files live in the project root.
+- Do NOT add init.py. This is not a package.
+---
+## 5. DEPENDENCY RULES
+### Allowed dependencies
+```txt
+pydantic>=2.0,<3.0
+flask>=3.0,<4.0
+openai>=1.0,<2.0
+gunicorn>=21.0,<23.0
+```
+### Conditionally allowed (only if needed)
+```txt
+numpy
+Pillow
+```
+### Forbidden
+- No LangChain, LlamaIndex, or any agent framework
+- No pandas or scipy
+- No database libraries
+- No async frameworks (FastAPI, aiohttp) - use Flask
+- No frontend frameworks (Streamlit, Gradio)
+- No ML libraries (torch, transformers, sklearn)
+---
+## 6. PYDANTIC MODEL RULES
+### models.py constraints
+- ALL models MUST inherit from pydantic.BaseModel
+- ALL fields MUST have explicit type annotations
+- ALL Literal types MUST use typing.Literal with exhaustive values
+- NO methods on models (except StepResult and ResetResult wrappers)
+- NO validators that call external services
+- NO default_factory that uses randomness
+- Field names MUST be snake_case
+- NO nested models deeper than 2 levels
+### Required models (exact names)
+```python
+class EmailObservation(BaseModel): ...
+class TriageAction(BaseModel): ...
+class RewardResult(BaseModel): ...
+class EnvironmentState(BaseModel): ...
+class StepResult(BaseModel): ...
+class ResetResult(BaseModel): ...
+```
+### StepResult and ResetResult interface (mandatory)
+```python
+class StepResult(BaseModel):
+    observation: EmailObservation
+    reward: float
+    done: bool
+    info: dict[str, str | int | float | bool]
+class ResetResult(BaseModel):
+    observation: EmailObservation
+    info: dict[str, str | int | float | bool]
+```
+### EmailObservation required fields
+| Field | Type | Required |
+|---|---|---|
+| email_id | str | Yes |
+| subject | str | Yes |
+| body | str | Yes |
+| sender | str | Yes |
+| timestamp | str | Yes |
+| thread_history | list[str] | Yes |
+| task_id | str | Yes |
+| step_number | int | Yes |
+| total_emails | int | Yes |
+### TriageAction required fields
+| Field | Type | Required |
+|---|---|---|
+| label | Literal["urgent", "normal", "spam", "archive"] | Yes |
+| summary | str | Yes |
+| route_to | str | Yes |
+### RewardResult required fields
+| Field | Type | Required |
+|---|---|---|
+| score | float | Yes |
+| breakdown | dict[str, float] | Yes |
+| feedback | str | Yes |
+### EnvironmentState required fields
+| Field | Type | Required |
+|---|---|---|
+| task_id | str | Yes |
+| current_step | int | Yes |
+| total_steps | int | Yes |
+| done | bool | Yes |
+| action_history | list | Yes |
+| reward_history | list | Yes |
+---
+## 7. ENVIRONMENT CLASS RULES
+- Class name: EmailTriageEnv
+- Constructor: __init__(self, task_id: str)
+- MUST accept a task_id string
+- MUST NOT call any external API
+- MUST NOT use randomness
+### reset() -> ResetResult
+- MUST return a ResetResult object (not a bare observation)
+- result.observation must contain the first email
+- MUST reset all internal state
+- MUST be callable multiple times without side effects
+- HF Space validator will call /reset and expect HTTP 200 + valid JSON
+### step(action: TriageAction) -> StepResult
+- MUST return a StepResult object (not a tuple)
+- result.observation: next email or terminal observation
+- result.reward: float score for this step
+- result.done: bool indicating episode end
+- result.info: metadata dict
+- MUST never raise an exception from bad agent input
+- If action validation fails: return StepResult with reward=0.0 and continue
+- MUST increment step counter
+- MUST set done=True when all emails processed or max_steps hit
+### state() -> EnvironmentState
+- MUST return the full current internal state
+- MUST be read-only
+### Hard rules for environment.py
+- NO randomness
+- NO API calls
+- NO file I/O during step/reset/state
+- NO global mutable state
+- NO threading or async
+- NO print statements
+---
+## 8. TASK DATA RULES
+Unchanged from previous version.
+- All email data MUST be hard-coded
+- NO loading from external files, URLs, or databases
+- Task IDs: task_easy, task_medium, task_hard
+- Each task defines: task_id, description, emails, ground_truth
+- Ground truth MUST NOT be in observations (no answer leakage)
+- Realistic professional email content
+- NO offensive or NSFW content
+---
+## 9. GRADER RULES
+Unchanged from previous version.
+- Pure functions
+- Deterministic
+- Partial credit
+- Scores in [0.0, 1.0]
+---
+## 10. REWARD FUNCTION RULES
+Unchanged from previous version.
+```text
+final_reward = base_score - (step_count * 0.01) + trajectory_bonus - penalties
+```
+Final reward is clipped to [-1.0, 1.0].
+---
+## 11. SERVER RULES
+### server.py constraints
+- MUST use Flask
+- Exactly THREE routes:
+  - POST /reset: accepts {"task_id": str}, returns ResetResult JSON
+  - POST /step: accepts TriageAction JSON, returns StepResult JSON
+  - POST /state: returns EnvironmentState JSON
+- MUST listen on port 7860
+- MUST handle malformed JSON gracefully (return 400)
+- All responses must include Content-Type: application/json
+- Validator will ping and call /reset, which must return HTTP 200
+### /step response format
+```json
+{
+  "observation": {},
+  "reward": 0.85,
+  "done": false,
+  "info": {"step": 1, "task_id": "task_easy"}
+}
+```
+### /reset response format
+```json
+{
+  "observation": {},
+  "info": {"task_id": "task_easy"}
+}
+```
+---
+## 12. INFERENCE SCRIPT RULES
+CRITICAL PATTERNS FROM SAMPLE - MUST FOLLOW
+### Architecture (matches sample)
+```text
+1. Initialize OpenAI client with env vars
+2. Create environment instance
+3. Call reset(), get initial observation
+4. Loop up to MAX_STEPS:
+   a. Build prompt from observation + history
+   b. Call LLM
+   c. Parse response into action (with fallback)
+   d. Call step(action)
+   e. Record history
+   f. Check done flag
+5. Print results
+```
+### Mandatory constants
+```python
+MAX_STEPS = 10
+TEMPERATURE = 0.2
+MAX_TOKENS = 200
+FALLBACK_ACTION = ...
+```
+### Response parsing rules
+- Do NOT rely only on response_format={"type": "json_object"}
+- Parse free-text responses with regex or string matching
+- If parsing fails, use a fallback action
+- Strip prefixes like action: or next action: before parsing
+- Regex parsing with fallback is preferred
+### History tracking
+```python
+history: list[str] = []
+history_line = f"Step {step}: {action} -> reward {reward:+.2f}"
+history.append(history_line)
+```
+### Error handling
+```python
+try:
+    completion = client.chat.completions.create(...)
+    response_text = completion.choices[0].message.content or ""
+except Exception as exc:
+    print(f"Model request failed ({exc}). Using fallback action.")
+    response_text = ""
+```
+### Output format
+```text
+Episode: task_easy
+Step 1: label=urgent, route=safety -> reward +0.85
+Final score: 0.85
+=== SCORE TABLE ===
+Task         Score    Steps
+task_easy    0.85     1
+task_medium  0.62     5
+task_hard    0.45     2
+Mean         0.64
+```
+### File naming and location
+- File MUST be named inference.py
+- MUST be in the project root directory
+- MUST be runnable with python inference.py
+- MUST complete in under 20 minutes
+---
+## 13. DOCKERFILE RULES
+- Base image: python:3.11-slim
+- WORKDIR: /app
+- Copy requirements.txt first, pip install, then copy source
+- EXPOSE 7860
+- Create non-root user
+- CMD starts the server
+- Must build with --platform linux/amd64
+- Must run within 2 vCPU / 8 GB memory
+- No unnecessary system packages
+- No CUDA/GPU dependencies
+---
+## 14. CODE STYLE RULES
+- Python 3.11+
+- Type hints on ALL function signatures
+- Docstrings on ALL public functions (Google style)
+- No single-letter variable names except i in loops
+- Comments explain WHY, not WHAT
+- Max line length: 100 characters
+- f-strings only
+- No wildcard imports
+- Import order: stdlib -> third-party -> local
+---
+## 15. WHAT AI MUST NEVER DO
+- Never add features not in this spec
+- Never use an LLM inside a grader
+- Never generate fake scores
+- Never create a UI
+- Never use randomness in the environment
+- Never store API keys in code
+- Never skip error handling in step()
+- Never use bare dicts where Pydantic models are specified
+- Never name the inference script baseline.py
+- Never use OPENAI_API_KEY; use HF_TOKEN/API_KEY
+- Never use response_format={"type": "json_object"} without text-parsing fallback
+- Never return tuples from step/reset; use StepResult/ResetResult objects
+- Never skip the fallback action pattern
+- Never skip history tracking in inference
+---
+## 16. DEFINITION OF DONE - Per Phase Checklist
+### Phase 1 complete when
+- models.py exists with all 6 models (including StepResult, ResetResult)
+- All fields match this document
+- Models instantiate with sample data without errors
+- StepResult has observation, reward, done, info attributes
+### Phase 2 complete when
+- tasks.py exists with 3 tasks
+- All email data is realistic and hard-coded
+- Ground truth exists for every email
+- No answer leakage
+### Phase 3 complete when
+- graders.py has 3 pure grader functions
+- Partial credit works
+- All scores in [0.0, 1.0]
+### Phase 4 complete when
+- environment.py has EmailTriageEnv class
+- reset() returns ResetResult
+- step() returns StepResult
+- step() handles invalid input without crashing
+- Full episode runs to completion
+### Phase 5 complete when
+- server.py has /reset, /step, /state routes
+- /reset returns {"observation": ..., "info": ...}
+- /step returns {"observation": ..., "reward": ..., "done": ..., "info": ...}
+- Malformed requests return 400
+- Port 7860
+### Phase 6 complete when
+- inference.py follows sample architecture
+- Uses os.getenv() for API_BASE_URL, HF_TOKEN/API_KEY, MODEL_NAME
+- Has MAX_STEPS, TEMPERATURE, MAX_TOKENS, FALLBACK constants
+- Has history tracking
+- Has response parsing with fallback
+- Has try/except around API calls
+- Prints score table
+- Completes in under 20 minutes
+### Phase 7-9
+Unchanged from previous version.
+---
+## 17. WHEN IN DOUBT
+- Re-read this file
+- Re-read the project briefing
+- Re-read the sample inference.py
+- Match the sample patterns
+- Choose the simpler option
+- Ask the human, do not guess
+This file is the law. Code that violates it gets deleted.

app.py ADDED Viewed

	@@ -0,0 +1,609 @@

+"""Auxiliary server entrypoint required by OpenEnv local validation checks."""
+import os
+from flask import Flask, Response, jsonify, request
+from environment import EmailTriageEnv
+from tasks import get_task_scenario_count, list_task_ids
+FRONTEND_HTML = """<!doctype html>
+<html lang="en">
+<head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1" />
+    <title>Inbox Helper Practice</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap');
+        :root {
+            --bg: #f5f1e9;
+            --paper: #fffaf2;
+            --ink: #102433;
+            --accent: #ea6a2a;
+            --accent-soft: #ffd6bf;
+            --line: #d7cabb;
+            --ok: #0f7b6c;
+            --warn: #9a3a12;
+            --radius: 14px;
+        }
+        * { box-sizing: border-box; }
+        body {
+            margin: 0;
+            font-family: 'Space Grotesk', sans-serif;
+            color: var(--ink);
+            background:
+                radial-gradient(1100px 460px at -10% -20%, #f2bc9f 0%, transparent 60%),
+                radial-gradient(1100px 520px at 120% 115%, #b8d7cf 0%, transparent 62%),
+                var(--bg);
+            min-height: 100vh;
+        }
+        .wrap {
+            max-width: 1100px;
+            margin: 28px auto;
+            padding: 0 16px;
+            animation: reveal .45s ease-out;
+        }
+        @keyframes reveal {
+            from { opacity: 0; transform: translateY(10px); }
+            to { opacity: 1; transform: translateY(0); }
+        }
+        .title {
+            display: flex;
+            justify-content: space-between;
+            align-items: baseline;
+            gap: 14px;
+            margin-bottom: 14px;
+        }
+        h1 {
+            margin: 0;
+            font-size: clamp(1.5rem, 2vw, 2.2rem);
+            letter-spacing: .4px;
+        }
+        .subtitle {
+            margin: 6px 0 0;
+            font-size: .95rem;
+            opacity: .8;
+        }
+        .badge {
+            background: var(--accent-soft);
+            border: 1px solid #f2b693;
+            color: #7f2e0b;
+            padding: 6px 10px;
+            border-radius: 999px;
+            font-size: .85rem;
+            font-weight: 600;
+        }
+        .grid {
+            display: grid;
+            grid-template-columns: 1fr;
+            gap: 14px;
+        }
+        @media (min-width: 900px) {
+            .grid { grid-template-columns: 1fr 1fr; }
+            .wide { grid-column: span 2; }
+        }
+        .card {
+            background: var(--paper);
+            border: 1px solid var(--line);
+            border-radius: var(--radius);
+            padding: 14px;
+            box-shadow: 0 8px 28px rgba(16, 36, 51, 0.08);
+        }
+        .card h2 {
+            margin: 0 0 10px;
+            font-size: 1rem;
+            text-transform: uppercase;
+            letter-spacing: .08em;
+            opacity: .86;
+        }
+        .row {
+            display: flex;
+            flex-wrap: wrap;
+            gap: 8px;
+            align-items: center;
+            margin-bottom: 10px;
+        }
+        select, input, textarea, button {
+            font-family: inherit;
+            font-size: .95rem;
+        }
+        select, input, textarea {
+            width: 100%;
+            border: 1px solid #cdbba6;
+            border-radius: 10px;
+            padding: 9px 10px;
+            background: #fff;
+            color: var(--ink);
+        }
+        textarea {
+            min-height: 92px;
+            resize: vertical;
+        }
+        button {
+            border: 0;
+            border-radius: 10px;
+            padding: 9px 12px;
+            font-weight: 700;
+            background: var(--ink);
+            color: #fff;
+            cursor: pointer;
+            transition: transform .12s ease, opacity .12s ease;
+        }
+        button.secondary {
+            background: #285066;
+        }
+        button.accent {
+            background: var(--accent);
+        }
+        button:hover { transform: translateY(-1px); }
+        button:active { transform: translateY(0); opacity: .92; }
+        .status {
+            padding: 8px 10px;
+            border-radius: 10px;
+            background: #eef7f5;
+            border: 1px solid #c7e4de;
+            color: var(--ok);
+            font-weight: 600;
+            min-height: 40px;
+            display: flex;
+            align-items: center;
+        }
+        .status.error {
+            background: #fff1ea;
+            border-color: #ffc8ae;
+            color: var(--warn);
+        }
+        pre {
+            margin: 0;
+            white-space: pre-wrap;
+            background: #0f1b24;
+            color: #d9efe9;
+            border-radius: 10px;
+            padding: 12px;
+            max-height: 340px;
+            overflow: auto;
+            font-family: 'IBM Plex Mono', monospace;
+            font-size: .85rem;
+            border: 1px solid #21313f;
+        }
+        .email-block {
+            background: #fff;
+            border: 1px solid #d9ccbc;
+            border-radius: 10px;
+            padding: 12px;
+        }
+        .email-row {
+            margin-bottom: 8px;
+            font-size: .95rem;
+            line-height: 1.35;
+        }
+        .email-row strong {
+            display: inline-block;
+            min-width: 66px;
+        }
+        .help {
+            margin: 0 0 10px;
+            font-size: .9rem;
+            opacity: .8;
+        }
+    </style>
+</head>
+<body>
+    <div class="wrap">
+        <div class="title">
+            <div>
+                <h1>Inbox Helper Practice</h1>
+                <p class="subtitle">Practice deciding priority, category, and who should handle each email.</p>
+            </div>
+            <span class="badge" id="badge">connecting...</span>
+        </div>
+        <div class="grid">
+            <section class="card">
+                <h2>Start a Scenario</h2>
+                <p class="help">Pick a difficulty, then click Start.</p>
+                <div class="row">
+                    <select id="taskId">
+                        <option value="task_easy">Easy: one clear email</option>
+                        <option value="task_medium">Medium: mixed inbox</option>
+                        <option value="task_hard">Hard: high-risk complaint</option>
+                        <option value="task_production">Production: full inbox simulator</option>
+                    </select>
+                </div>
+                <div id="productionControls" style="display:none;">
+                    <div class="row">
+                        <select id="productionProfile">
+                            <option value="light">Workload: Light</option>
+                            <option value="standard" selected>Workload: Standard</option>
+                            <option value="heavy">Workload: Heavy</option>
+                        </select>
+                    </div>
+                    <div class="row">
+                        <select id="businessHoursMode">
+                            <option value="false" selected>Time Profile: 24x7 inbox</option>
+                            <option value="true">Time Profile: business hours focus</option>
+                        </select>
+                    </div>
+                    <div class="row">
+                        <select id="escalationMode">
+                            <option value="low">Escalation: Low</option>
+                            <option value="normal" selected>Escalation: Normal</option>
+                            <option value="high">Escalation: High</option>
+                        </select>
+                    </div>
+                </div>
+                <div class="row">
+                    <button class="accent" id="btnReset">Start</button>
+                    <button class="secondary" id="btnState">Check Progress</button>
+                </div>
+                <div class="status" id="status">Ready. Start a scenario.</div>
+            </section>
+            <section class="card">
+                <h2>Your Decision</h2>
+                <p class="help">Choose priority, who should handle it, and a short reason.</p>
+                <div class="row">
+                    <select id="label">
+                        <option value="urgent">Urgent</option>
+                        <option value="normal" selected>Normal</option>
+                        <option value="spam">Spam</option>
+                        <option value="archive">Archive</option>
+                    </select>
+                </div>
+                <div class="row">
+                    <input id="routeTo" placeholder="Who should handle this? (billing, safety, engineering, support)" value="general" />
+                </div>
+                <div class="row">
+                    <textarea id="summary" placeholder="Write one clear sentence with key clues from the email.">Needs review.</textarea>
+                </div>
+                <div class="row">
+                    <button id="btnStep">Send Decision</button>
+                </div>
+            </section>
+            <section class="card wide">
+                <h2>Current Email</h2>
+                <div class="email-block">
+                    <div class="email-row"><strong>Subject:</strong> <span id="mailSubject">No email loaded yet.</span></div>
+                    <div class="email-row"><strong>From:</strong> <span id="mailSender">-</span></div>
+                    <div class="email-row"><strong>Message:</strong> <span id="mailBody">Start a scenario to load an email.</span></div>
+                </div>
+            </section>
+            <section class="card wide">
+                <h2>Details (Advanced)</h2>
+                <pre id="output">Waiting for your first action...</pre>
+            </section>
+        </div>
+    </div>
+    <script>
+        const statusEl = document.getElementById('status');
+        const badgeEl = document.getElementById('badge');
+        const outEl = document.getElementById('output');
+        const mailSubjectEl = document.getElementById('mailSubject');
+        const mailSenderEl = document.getElementById('mailSender');
+        const mailBodyEl = document.getElementById('mailBody');
+        const taskIdEl = document.getElementById('taskId');
+        const productionControlsEl = document.getElementById('productionControls');
+        function setStatus(msg, isError = false) {
+            statusEl.textContent = msg;
+            statusEl.classList.toggle('error', isError);
+        }
+        function writeOutput(value) {
+            outEl.textContent = typeof value === 'string' ? value : JSON.stringify(value, null, 2);
+        }
+        function updateEmailPanel(data) {
+            if (!data || !data.observation) {
+                return;
+            }
+            const obs = data.observation;
+            mailSubjectEl.textContent = obs.subject || 'No subject';
+            mailSenderEl.textContent = obs.sender || '-';
+            mailBodyEl.textContent = obs.body || '';
+        }
+        function updateProductionControlsVisibility() {
+            const isProduction = taskIdEl.value === 'task_production';
+            productionControlsEl.style.display = isProduction ? 'block' : 'none';
+        }
+        async function postJson(path, payload) {
+            const response = await fetch(path, {
+                method: 'POST',
+                headers: { 'Content-Type': 'application/json' },
+                body: JSON.stringify(payload || {}),
+            });
+            const text = await response.text();
+            let data = text;
+            try { data = JSON.parse(text); } catch (e) {}
+            if (!response.ok) {
+                throw new Error('HTTP ' + response.status + ' - ' + text);
+            }
+            return data;
+        }
+        async function warmup() {
+            try {
+                const res = await fetch('/meta');
+                const data = await res.json();
+                badgeEl.textContent = data.status === 'ok' ? 'ready' : 'check service';
+            } catch (e) {
+                badgeEl.textContent = 'offline';
+            }
+        }
+        document.getElementById('btnReset').addEventListener('click', async () => {
+            const taskId = taskIdEl.value;
+            setStatus('Starting a new scenario...');
+            try {
+                const payload = { task_id: taskId };
+                if (taskId === 'task_production') {
+                    payload.production_profile = document.getElementById('productionProfile').value;
+                    payload.business_hours_mode = document.getElementById('businessHoursMode').value === 'true';
+                    payload.escalation_mode = document.getElementById('escalationMode').value;
+                }
+                const data = await postJson('/reset', payload);
+                setStatus('Scenario started. Read the email below.');
+                updateEmailPanel(data);
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not start scenario. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        document.getElementById('btnState').addEventListener('click', async () => {
+            setStatus('Checking progress...');
+            try {
+                const data = await postJson('/state', {});
+                setStatus('Progress updated.');
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not fetch progress. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        document.getElementById('btnStep').addEventListener('click', async () => {
+            const payload = {
+                label: document.getElementById('label').value,
+                summary: document.getElementById('summary').value,
+                route_to: document.getElementById('routeTo').value,
+            };
+            setStatus('Sending your decision...');
+            try {
+                const data = await postJson('/step', payload);
+                setStatus('Decision saved.');
+                updateEmailPanel(data);
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not submit decision. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        taskIdEl.addEventListener('change', updateProductionControlsVisibility);
+        updateProductionControlsVisibility();
+        warmup();
+    </script>
+</body>
+</html>
+"""
+app = Flask(__name__)
+current_env = EmailTriageEnv(task_id="task_easy")
+SCENARIO_COUNTERS = {task_id: 0 for task_id in list_task_ids()}
+DEFAULT_EVAL_SPLIT = os.getenv("OPENENV_EVAL_SPLIT", "public")
+ALLOW_CLIENT_EVAL_OVERRIDE = (
+    os.getenv("OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE", "false").strip().lower() == "true"
+)
+@app.get("/")
+def root_page():
+    """Render a lightweight frontend for interacting with the environment."""
+    return Response(FRONTEND_HTML, mimetype="text/html")
+@app.get("/meta")
+def root_endpoint():
+    """Return service metadata for health checks and machine clients."""
+    return jsonify(
+        {
+            "name": "email-triage-env",
+            "status": "ok",
+            "endpoints": {
+                "reset": {"method": "POST", "path": "/reset"},
+                "step": {"method": "POST", "path": "/step"},
+                "state": {"method": "POST", "path": "/state"},
+            },
+            "scenario_pools": {
+                "public": {
+                    task_id: get_task_scenario_count(task_id, "public")
+                    for task_id in list_task_ids()
+                },
+            },
+            "eval_split": DEFAULT_EVAL_SPLIT,
+            "production_runtime_controls": {
+                "production_profile": ["light", "standard", "heavy"],
+                "business_hours_mode": [True, False],
+                "escalation_mode": ["low", "normal", "high"],
+            },
+        }
+    )
+@app.post("/reset")
+def reset_endpoint():
+    """Reset the environment with a selected task and return ResetResult JSON."""
+    global current_env
+    global SCENARIO_COUNTERS
+    payload = request.get_json(silent=True)
+    if payload is None:
+        payload = {}
+    elif not isinstance(payload, dict):
+        return jsonify({"error": "Malformed JSON payload."}), 400
+    task_id = payload.get("task_id", "task_easy")
+    if not isinstance(task_id, str):
+        return jsonify({"error": "Field 'task_id' must be a string."}), 400
+    runtime_options: dict[str, object] = {}
+    if task_id == "task_production":
+        production_profile = payload.get("production_profile", "standard")
+        if not isinstance(production_profile, str) or production_profile not in {
+            "light",
+            "standard",
+            "heavy",
+        }:
+            return (
+                jsonify(
+                    {
+                        "error": (
+                            "Field 'production_profile' must be one of "
+                            "light/standard/heavy."
+                        )
+                    }
+                ),
+                400,
+            )
+        escalation_mode = payload.get("escalation_mode", "normal")
+        if not isinstance(escalation_mode, str) or escalation_mode not in {
+            "low",
+            "normal",
+            "high",
+        }:
+            return (
+                jsonify(
+                    {
+                        "error": (
+                            "Field 'escalation_mode' must be one of "
+                            "low/normal/high."
+                        )
+                    }
+                ),
+                400,
+            )
+        business_hours_mode = payload.get("business_hours_mode", False)
+        if isinstance(business_hours_mode, str):
+            business_hours_mode = business_hours_mode.strip().lower() in {
+                "1",
+                "true",
+                "yes",
+                "on",
+            }
+        elif not isinstance(business_hours_mode, bool):
+            return jsonify({"error": "Field 'business_hours_mode' must be boolean."}), 400
+        runtime_options = {
+            "production_profile": production_profile,
+            "business_hours_mode": business_hours_mode,
+            "escalation_mode": escalation_mode,
+        }
+    if not ALLOW_CLIENT_EVAL_OVERRIDE and (
+        "eval_split" in payload or "scenario_index" in payload
+    ):
+        return jsonify(
+            {
+                "error": (
+                    "Client overrides for eval_split/scenario_index are disabled "
+                    "by server policy."
+                )
+            }
+        ), 400
+    eval_split = DEFAULT_EVAL_SPLIT
+    if ALLOW_CLIENT_EVAL_OVERRIDE:
+        requested_split = payload.get("eval_split", DEFAULT_EVAL_SPLIT)
+        if not isinstance(requested_split, str):
+            return jsonify({"error": "Field 'eval_split' must be a string."}), 400
+        eval_split = requested_split
+    requested_index = payload.get("scenario_index") if ALLOW_CLIENT_EVAL_OVERRIDE else None
+    if requested_index is not None and (not isinstance(requested_index, int) or requested_index < 0):
+        return jsonify({"error": "Field 'scenario_index' must be a non-negative integer."}), 400
+    try:
+        scenario_count = get_task_scenario_count(task_id, eval_split)
+        if requested_index is None:
+            scenario_index = SCENARIO_COUNTERS.get(task_id, 0)
+            if scenario_count > 0:
+                SCENARIO_COUNTERS[task_id] = (scenario_index + 1) % scenario_count
+        else:
+            scenario_index = requested_index
+        current_env = EmailTriageEnv(
+            task_id=task_id,
+            scenario_index=scenario_index,
+            split=eval_split,
+            runtime_options=runtime_options,
+        )
+        reset_result = current_env.reset()
+    except KeyError as error:
+        return jsonify({"error": str(error)}), 400
+    return jsonify(reset_result.model_dump())
+@app.post("/step")
+def step_endpoint():
+    """Advance environment by one action and return StepResult JSON."""
+    payload = request.get_json(silent=True)
+    if payload is None:
+        return jsonify({"error": "Malformed JSON payload."}), 400
+    step_result = current_env.step(payload)
+    return jsonify(step_result.model_dump())
+@app.post("/state")
+def state_endpoint():
+    """Return read-only EnvironmentState JSON snapshot."""
+    state_result = current_env.state()
+    return jsonify(state_result.model_dump())
+def main() -> None:
+    """Run the Flask app for local and script-based launches."""
+    app.run(host="0.0.0.0", port=7860)
+if __name__ == "__main__":
+    main()

environment.py ADDED Viewed

	@@ -0,0 +1,419 @@

+"""Core OpenEnv email triage environment implementation."""
+import os
+from typing import cast
+from pydantic import ValidationError
+from graders import grade_easy, grade_hard, grade_medium_step
+from models import (
+    EmailObservation,
+    EnvironmentState,
+    ResetResult,
+    RewardResult,
+    StepResult,
+    TriageAction,
+)
+from tasks import get_task_definition
+class EmailTriageEnv:
+    """Deterministic email triage environment implementing reset, step, and state."""
+    def __init__(
+        self,
+        task_id: str,
+        scenario_index: int = 0,
+        split: str | None = None,
+        runtime_options: dict[str, object] | None = None,
+    ) -> None:
+        """Initialize environment with a selected task.
+        Args:
+            task_id: Task identifier such as task_easy, task_medium, or task_hard.
+            scenario_index: Deterministic scenario index within the task pool.
+            split: Scenario split, either public or private_eval.
+            runtime_options: Optional deterministic runtime controls for task generation.
+        """
+        self.task_id = task_id
+        self._episode_index = max(0, scenario_index)
+        self.split = split or os.getenv("OPENENV_EVAL_SPLIT", "public")
+        self.runtime_options = runtime_options or {}
+        self._task_definition = get_task_definition(
+            task_id,
+            self._episode_index,
+            self.split,
+            self.runtime_options,
+        )
+        self._scenario_id = str(self._task_definition.get("scenario_id", "unknown"))
+        self._emails = cast(list[dict[str, object]], self._task_definition.get("emails", []))
+        self._ground_truth = cast(
+            list[dict[str, object]], self._task_definition.get("ground_truth", [])
+        )
+        self._current_index = 0
+        self._current_step = 0
+        self._done = False
+        self._max_steps = max(10, len(self._emails) + 5)
+        self._action_history: list[TriageAction] = []
+        self._reward_history: list[float] = []
+        self._base_score_history: list[float] = []
+        self._generated_followups = 0
+        self._max_generated_followups = 4
+        self._followup_quality_threshold = 0.7
+        self._configure_runtime_controls()
+    def reset(self) -> ResetResult:
+        """Reset episode state and return the first observation.
+        Returns:
+            ResetResult containing first observation and metadata.
+        """
+        self._task_definition = get_task_definition(
+            self.task_id,
+            self._episode_index,
+            self.split,
+            self.runtime_options,
+        )
+        self._scenario_id = str(self._task_definition.get("scenario_id", "unknown"))
+        self._emails = cast(list[dict[str, object]], self._task_definition.get("emails", []))
+        self._ground_truth = cast(
+            list[dict[str, object]], self._task_definition.get("ground_truth", [])
+        )
+        self._current_index = 0
+        self._current_step = 0
+        self._done = False
+        self._max_steps = max(10, len(self._emails) + 5)
+        self._action_history = []
+        self._reward_history = []
+        self._base_score_history = []
+        self._generated_followups = 0
+        self._configure_runtime_controls()
+        self._episode_index += 1
+        first_observation = self._build_observation(self._current_index)
+        return ResetResult(
+            observation=first_observation,
+            info={
+                "task_id": self.task_id,
+                "scenario_id": self._scenario_id,
+                "split": self.split,
+                "step": self._current_step,
+            },
+        )
+    def step(self, action: TriageAction) -> StepResult:
+        """Apply an action and return StepResult.
+        Args:
+            action: Proposed triage action.
+        Returns:
+            StepResult with next observation, reward, done flag, and metadata.
+        """
+        if self._done:
+            return StepResult(
+                observation=self._terminal_observation(),
+                reward=0.0,
+                done=True,
+                info={
+                    "task_id": self.task_id,
+                    "scenario_id": self._scenario_id,
+                    "split": self.split,
+                    "step": self._current_step,
+                    "already_done": True,
+                },
+            )
+        try:
+            validated_action = TriageAction.model_validate(action)
+        except ValidationError as validation_error:
+            self._current_step += 1
+            self._reward_history.append(0.0)
+            self._done = self._current_step >= self._max_steps
+            return StepResult(
+                observation=self._build_observation(self._current_index),
+                reward=0.0,
+                done=self._done,
+                info={
+                    "task_id": self.task_id,
+                    "scenario_id": self._scenario_id,
+                    "split": self.split,
+                    "step": self._current_step,
+                    "validation_error": str(validation_error),
+                },
+            )
+        base_result = self._grade_current_step(validated_action)
+        base_score = base_result.score
+        truth_for_step = (
+            self._ground_truth[min(self._current_index, len(self._ground_truth) - 1)]
+            if self._ground_truth
+            else {}
+        )
+        self._maybe_enqueue_follow_up(validated_action, truth_for_step, base_score)
+        self._action_history.append(validated_action)
+        self._base_score_history.append(base_score)
+        self._current_step += 1
+        penalties = self._compute_penalties(validated_action)
+        trajectory_bonus = self._compute_trajectory_bonus()
+        final_reward = self._clip_reward(
+            base_score - (self._current_step * 0.01) + trajectory_bonus - penalties
+        )
+        self._reward_history.append(final_reward)
+        if self._current_index < len(self._emails):
+            self._current_index += 1
+        all_emails_processed = self._current_index >= len(self._emails)
+        self._done = all_emails_processed or self._current_step >= self._max_steps
+        next_observation = (
+            self._terminal_observation()
+            if self._done
+            else self._build_observation(self._current_index)
+        )
+        info = {
+            "task_id": self.task_id,
+            "scenario_id": self._scenario_id,
+            "split": self.split,
+            "step": self._current_step,
+            "base_score": round(base_score, 4),
+            "penalties": round(penalties, 4),
+            "trajectory_bonus": round(trajectory_bonus, 4),
+        }
+        return StepResult(
+            observation=next_observation,
+            reward=final_reward,
+            done=self._done,
+            info=info,
+        )
+    def _maybe_enqueue_follow_up(
+        self,
+        action: TriageAction,
+        truth: dict[str, object],
+        base_score: float,
+    ) -> None:
+        """Insert deterministic escalation follow-up emails for production mode."""
+        if self.task_id != "task_production":
+            return
+        if self._generated_followups >= self._max_generated_followups:
+            return
+        if not self._emails:
+            return
+        expected_label = str(truth.get("label", ""))
+        expected_route = str(truth.get("route_to", "general"))
+        is_missed_critical = (
+            expected_label == "urgent"
+            and (action.label != "urgent" or expected_route not in action.route_to.lower())
+        )
+        if not is_missed_critical and base_score >= self._followup_quality_threshold:
+            return
+        source_email = self._emails[min(self._current_index, len(self._emails) - 1)]
+        source_subject = str(source_email.get("subject", "Inbox incident"))
+        source_timestamp = str(source_email.get("timestamp", "2026-04-03T00:00:00Z"))
+        followup_email = {
+            "email_id": f"followup-{self._scenario_id}-{self._generated_followups + 1}",
+            "subject": f"Escalation follow-up: {source_subject}",
+            "body": (
+                "Automated escalation triggered because prior triage appears incomplete. "
+                "Please route to the responsible team and provide a clear summary now."
+            ),
+            "sender": "incident-control@acme-enterprise.com",
+            "timestamp": source_timestamp,
+            "thread_history": [f"Previous message subject: {source_subject}"],
+        }
+        followup_truth = {
+            "label": "urgent",
+            "route_to": expected_route,
+            "priority_weight": min(max(float(truth.get("priority_weight", 1.5)) + 0.2, 1.5), 2.0),
+            "summary_keywords": ["escalation", "follow-up", expected_route],
+        }
+        insert_at = min(self._current_index + 1, len(self._emails))
+        self._emails.insert(insert_at, followup_email)
+        self._ground_truth.insert(insert_at, followup_truth)
+        self._generated_followups += 1
+    def _configure_runtime_controls(self) -> None:
+        """Apply deterministic runtime control options for production simulator."""
+        if self.task_id != "task_production":
+            self._max_generated_followups = 4
+            self._followup_quality_threshold = 0.7
+            return
+        escalation_mode = str(self.runtime_options.get("escalation_mode", "normal")).lower()
+        escalation_map = {
+            "low": (2, 0.55),
+            "normal": (4, 0.7),
+            "high": (8, 0.85),
+        }
+        max_followups, threshold = escalation_map.get(escalation_mode, escalation_map["normal"])
+        self._max_generated_followups = max_followups
+        self._followup_quality_threshold = threshold
+    def state(self) -> EnvironmentState:
+        """Return read-only snapshot of full internal state.
+        Returns:
+            EnvironmentState with progress and history.
+        """
+        return EnvironmentState(
+            task_id=self.task_id,
+            current_step=self._current_step,
+            total_steps=self._max_steps,
+            done=self._done,
+            action_history=list(self._action_history),
+            reward_history=list(self._reward_history),
+        )
+    def _build_observation(self, email_index: int) -> EmailObservation:
+        """Build observation for the email at a given index.
+        Args:
+            email_index: Zero-based email index.
+        Returns:
+            EmailObservation for the selected email or terminal placeholder.
+        """
+        if not self._emails:
+            return self._terminal_observation()
+        safe_index = min(max(email_index, 0), len(self._emails) - 1)
+        email_payload = self._emails[safe_index]
+        return EmailObservation(
+            email_id=str(email_payload.get("email_id", "")),
+            subject=str(email_payload.get("subject", "")),
+            body=str(email_payload.get("body", "")),
+            sender=str(email_payload.get("sender", "")),
+            timestamp=str(email_payload.get("timestamp", "")),
+            thread_history=[str(item) for item in email_payload.get("thread_history", [])],
+            task_id=self.task_id,
+            step_number=self._current_step,
+            total_emails=len(self._emails),
+        )
+    def _terminal_observation(self) -> EmailObservation:
+        """Build terminal observation returned when episode is complete.
+        Returns:
+            Terminal EmailObservation payload.
+        """
+        return EmailObservation(
+            email_id="terminal",
+            subject="Episode complete",
+            body="No further emails remain for this task.",
+            sender="system",
+            timestamp="",
+            thread_history=[],
+            task_id=self.task_id,
+            step_number=self._current_step,
+            total_emails=len(self._emails),
+        )
+    def _grade_current_step(self, action: TriageAction) -> RewardResult:
+        """Select deterministic grader based on task and current progress.
+        Args:
+            action: Validated action for the current step.
+        Returns:
+            RewardResult from task-specific grader.
+        """
+        if not self._ground_truth:
+            return RewardResult(
+                score=0.0,
+                breakdown={"missing_ground_truth": 1.0},
+                feedback="Missing ground truth for task.",
+            )
+        if self.task_id == "task_easy":
+            truth = self._ground_truth[min(self._current_index, len(self._ground_truth) - 1)]
+            return grade_easy(action, truth)
+        if self.task_id == "task_medium":
+            truth = self._ground_truth[min(self._current_index, len(self._ground_truth) - 1)]
+            return grade_medium_step(action, truth)
+        truth = self._ground_truth[min(self._current_index, len(self._ground_truth) - 1)]
+        return grade_hard(action, truth)
+    def _compute_penalties(self, action: TriageAction) -> float:
+        """Compute deterministic penalties according to reward policy.
+        Args:
+            action: Validated action for the step.
+        Returns:
+            Total penalty value for current step.
+        """
+        penalty_total = 0.0
+        summary_too_short = len(action.summary.strip()) < 10
+        if action.label == "archive" and summary_too_short:
+            penalty_total += 0.5
+        if self._is_repeated_action_pattern(action):
+            penalty_total += 0.3
+        return penalty_total
+    def _compute_trajectory_bonus(self) -> float:
+        """Return trajectory bonus when episode completion quality is high.
+        Returns:
+            0.2 when mean base score is above threshold at completion, else 0.0.
+        """
+        if not self._base_score_history:
+            return 0.0
+        all_emails_done_after_step = self._current_index + 1 >= len(self._emails)
+        if not all_emails_done_after_step:
+            return 0.0
+        mean_base = sum(self._base_score_history) / len(self._base_score_history)
+        return 0.2 if mean_base > 0.8 else 0.0
+    def _is_repeated_action_pattern(self, action: TriageAction) -> bool:
+        """Detect whether same action appears three times consecutively.
+        Args:
+            action: Current action.
+        Returns:
+            True when repeated label and route occur three times in a row.
+        """
+        if len(self._action_history) < 2:
+            return False
+        previous_action = self._action_history[-1]
+        older_action = self._action_history[-2]
+        return (
+            previous_action.label == older_action.label == action.label
+            and previous_action.route_to.strip().lower()
+            == older_action.route_to.strip().lower()
+            == action.route_to.strip().lower()
+        )
+    def _clip_reward(self, reward_value: float) -> float:
+        """Clip reward to the inclusive range [-1.0, 1.0].
+        Args:
+            reward_value: Raw reward value.
+        Returns:
+            Clipped reward.
+        """
+        return max(-1.0, min(1.0, reward_value))

graders.py ADDED Viewed

	@@ -0,0 +1,296 @@

+"""Deterministic graders for OpenEnv email triage tasks."""
+import re
+from models import RewardResult, TriageAction
+ROUTE_ALIAS_MAP = {
+    "billing": ["billing", "finance", "payments", "accounts"],
+    "safety": ["safety", "compliance", "risk"],
+    "engineering": ["engineering", "eng", "sre", "platform", "on-call"],
+    "support": ["support", "helpdesk", "customer support"],
+    "general": ["general", "inbox", "operations"],
+}
+def _clip_score(score_value: float) -> float:
+    """Clip a score to the inclusive range [0.0, 1.0].
+    Args:
+        score_value: Raw score.
+    Returns:
+        Clipped score.
+    """
+    return max(0.0, min(1.0, score_value))
+def _normalized_text(text_value: str) -> str:
+    """Return normalized lowercase text for deterministic comparisons.
+    Args:
+        text_value: Input text.
+    Returns:
+        Normalized text.
+    """
+    return text_value.strip().lower()
+def _route_matches(action_route: str, expected_route: str) -> bool:
+    """Check if action route contains the expected route token.
+    Args:
+        action_route: Route provided by agent.
+        expected_route: Route expected by ground truth.
+    Returns:
+        True when expected route is present in the action route.
+    """
+    normalized_expected = _normalized_text(expected_route)
+    if not normalized_expected:
+        return False
+    return normalized_expected in _canonical_route_tokens(action_route)
+def _canonical_route_tokens(action_route: str) -> set[str]:
+    """Map free-form route text to canonical route categories."""
+    normalized_action = _normalized_text(action_route)
+    if not normalized_action:
+        return set()
+    route_fragments = [
+        fragment.strip()
+        for fragment in re.split(r"[,;/|]+", normalized_action)
+        if fragment.strip()
+    ]
+    canonical: set[str] = set()
+    for fragment in route_fragments:
+        for route_name, aliases in ROUTE_ALIAS_MAP.items():
+            if any(alias in fragment for alias in aliases):
+                canonical.add(route_name)
+                break
+    # Fallback for phrases without separators.
+    if not canonical:
+        for route_name, aliases in ROUTE_ALIAS_MAP.items():
+            if any(alias in normalized_action for alias in aliases):
+                canonical.add(route_name)
+    return canonical
+def _route_noise_penalty(action_route: str) -> float:
+    """Penalize over-routing to many teams in one action."""
+    route_count = len(_canonical_route_tokens(action_route))
+    if route_count <= 2:
+        return 0.0
+    return min(0.24, 0.08 * (route_count - 2))
+def _summary_keyword_score(summary_text: str, ground_truth: dict) -> float:
+    """Score summary quality using deterministic keyword overlap.
+    Args:
+        summary_text: Summary text produced by the agent.
+        ground_truth: Ground-truth dict that may include summary keywords.
+    Returns:
+        Score in [0.0, 1.0] based on matched summary keywords.
+    """
+    raw_keywords = ground_truth.get("summary_keywords", [])
+    if not isinstance(raw_keywords, list):
+        return 1.0 if len(summary_text.strip()) >= 10 else 0.0
+    keywords = [
+        _normalized_text(str(keyword))
+        for keyword in raw_keywords
+        if _normalized_text(str(keyword))
+    ]
+    if not keywords:
+        return 1.0 if len(summary_text.strip()) >= 10 else 0.0
+    normalized_summary = _normalized_text(summary_text)
+    matches = 0
+    for keyword in keywords:
+        if keyword in normalized_summary:
+            matches += 1
+    base_score = matches / len(keywords)
+    # Discourage keyword stuffing and overly verbose summaries.
+    word_count = len(re.findall(r"[a-z0-9'-]+", normalized_summary))
+    if word_count < 4:
+        brevity_factor = 0.6
+    elif word_count <= 40:
+        brevity_factor = 1.0
+    else:
+        brevity_factor = max(0.45, 1.0 - (word_count - 40) * 0.02)
+    list_like_penalty = 0.85 if normalized_summary.count(",") >= 6 and matches >= 3 else 1.0
+    return _clip_score(base_score * brevity_factor * list_like_penalty)
+def grade_easy(action: TriageAction, ground_truth: dict) -> RewardResult:
+    """Grade easy task with deterministic partial credit.
+    Args:
+        action: Agent action for one email.
+        ground_truth: Expected label and route.
+    Returns:
+        Deterministic reward result in [0.0, 1.0].
+    """
+    expected_label = _normalized_text(str(ground_truth.get("label", "")))
+    expected_route = _normalized_text(str(ground_truth.get("route_to", "")))
+    label_correct = _normalized_text(action.label) == expected_label
+    route_correct = _route_matches(action.route_to, expected_route)
+    summary_score = _summary_keyword_score(action.summary, ground_truth)
+    noise_penalty = _route_noise_penalty(action.route_to)
+    score_value = (0.6 if label_correct else 0.0) + (0.25 if route_correct else 0.0)
+    score_value += 0.15 * summary_score
+    score_value -= noise_penalty
+    score_value = _clip_score(score_value)
+    breakdown = {
+        "label_match": 1.0 if label_correct else 0.0,
+        "route_match": 1.0 if route_correct else 0.0,
+        "summary_match": round(summary_score, 4),
+        "route_noise_penalty": round(noise_penalty, 4),
+    }
+    feedback = "Easy-task grading completed with context summary scoring."
+    return RewardResult(score=score_value, breakdown=breakdown, feedback=feedback)
+def grade_medium_step(action: TriageAction, truth: dict) -> RewardResult:
+    """Grade one medium-task step without cumulative history effects."""
+    expected_label = _normalized_text(str(truth.get("label", "")))
+    expected_route = _normalized_text(str(truth.get("route_to", "")))
+    priority_weight = max(float(truth.get("priority_weight", 1.0)), 0.1)
+    label_correct = _normalized_text(action.label) == expected_label
+    route_correct = _route_matches(action.route_to, expected_route)
+    summary_score = _summary_keyword_score(action.summary, truth)
+    noise_penalty = _route_noise_penalty(action.route_to)
+    per_email_score = (0.55 if label_correct else 0.0) + (0.3 if route_correct else 0.0)
+    per_email_score += 0.15 * summary_score
+    per_email_score -= noise_penalty
+    per_email_score = _clip_score(per_email_score)
+    weighted_step_score = _clip_score(per_email_score * min(priority_weight, 2.0))
+    return RewardResult(
+        score=weighted_step_score,
+        breakdown={
+            "label_match": 1.0 if label_correct else 0.0,
+            "route_match": 1.0 if route_correct else 0.0,
+            "summary_match": round(summary_score, 4),
+            "priority_weight": round(priority_weight, 4),
+            "route_noise_penalty": round(noise_penalty, 4),
+        },
+        feedback="Medium-task step grading completed.",
+    )
+def grade_medium(actions: list[TriageAction], ground_truths: list[dict]) -> RewardResult:
+    """Grade medium task using weighted per-email partial scoring.
+    Args:
+        actions: Agent actions for the medium task email queue.
+        ground_truths: Expected action details for each email.
+    Returns:
+        Deterministic reward result in [0.0, 1.0].
+    """
+    comparable_count = min(len(actions), len(ground_truths))
+    if comparable_count == 0:
+        return RewardResult(
+            score=0.0,
+            breakdown={"emails_scored": 0.0, "weighted_average": 0.0},
+            feedback="No actions available for grading.",
+        )
+    weighted_score_sum = 0.0
+    weight_sum = 0.0
+    label_hits = 0
+    route_hits = 0
+    summary_total = 0.0
+    noise_penalty_total = 0.0
+    for index in range(comparable_count):
+        action = actions[index]
+        truth = ground_truths[index]
+        step_result = grade_medium_step(action, truth)
+        priority_weight = float(step_result.breakdown.get("priority_weight", 1.0))
+        weighted_score_sum += step_result.score
+        weight_sum += min(priority_weight, 2.0)
+        label_hits += 1 if step_result.breakdown.get("label_match", 0.0) > 0 else 0
+        route_hits += 1 if step_result.breakdown.get("route_match", 0.0) > 0 else 0
+        summary_total += float(step_result.breakdown.get("summary_match", 0.0))
+        noise_penalty_total += float(step_result.breakdown.get("route_noise_penalty", 0.0))
+    weighted_average = weighted_score_sum / weight_sum if weight_sum > 0.0 else 0.0
+    score_value = _clip_score(weighted_average)
+    breakdown = {
+        "emails_scored": float(comparable_count),
+        "label_accuracy": label_hits / comparable_count,
+        "route_accuracy": route_hits / comparable_count,
+        "summary_accuracy": summary_total / comparable_count,
+        "avg_route_noise_penalty": noise_penalty_total / comparable_count,
+        "weighted_average": score_value,
+    }
+    feedback = "Weighted medium-task grading completed."
+    return RewardResult(score=score_value, breakdown=breakdown, feedback=feedback)
+def grade_hard(action: TriageAction, ground_truth: dict) -> RewardResult:
+    """Grade hard task using weighted policy-sensitive components.
+    Args:
+        action: Agent action for hard task case.
+        ground_truth: Expected routing and urgency intent.
+    Returns:
+        Deterministic reward result in [0.0, 1.0].
+    """
+    expected_label = _normalized_text(str(ground_truth.get("label", "urgent")))
+    primary_route = _normalized_text(str(ground_truth.get("route_to", "safety")))
+    secondary_route = _normalized_text(str(ground_truth.get("cc_route", "billing")))
+    spam_penalty = float(ground_truth.get("penalize_spam", 0.2))
+    normalized_route = _normalized_text(action.route_to)
+    has_primary_route = _route_matches(normalized_route, primary_route)
+    has_secondary_route = _route_matches(normalized_route, secondary_route)
+    urgent_label = _normalized_text(action.label) == expected_label
+    summary_score = _summary_keyword_score(action.summary, ground_truth)
+    noise_penalty = _route_noise_penalty(action.route_to)
+    escalation_component = 0.35 if has_primary_route else 0.0
+    routing_component = 0.25 if has_secondary_route else 0.0
+    urgency_component = 0.25 if urgent_label else 0.0
+    summary_component = 0.15 * summary_score
+    raw_score = escalation_component + routing_component + urgency_component + summary_component
+    raw_score -= noise_penalty
+    if _normalized_text(action.label) == "spam":
+        raw_score -= spam_penalty
+    score_value = _clip_score(raw_score)
+    breakdown = {
+        "escalation_component": escalation_component,
+        "routing_component": routing_component,
+        "urgency_component": urgency_component,
+        "summary_component": round(summary_component, 4),
+        "route_noise_penalty": round(noise_penalty, 4),
+        "spam_penalty": spam_penalty if _normalized_text(action.label) == "spam" else 0.0,
+    }
+    feedback = "Hard-task weighted policy grading completed."
+    return RewardResult(score=score_value, breakdown=breakdown, feedback=feedback)

inference.py ADDED Viewed

	@@ -0,0 +1,377 @@

+"""Inference script for OpenEnv email triage with strict stdout event format."""
+import argparse
+import json
+import os
+import re
+import time
+from typing import Any
+from openai import OpenAI
+from environment import EmailTriageEnv
+from models import EmailObservation, TriageAction
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+HF_TOKEN = os.getenv("HF_TOKEN")
+API_KEY = HF_TOKEN or os.getenv("API_KEY")
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+BENCHMARK = "openenv-email-triage"
+MAX_STEPS = 30
+TEMPERATURE = 0.2
+MAX_TOKENS = 200
+SUCCESS_SCORE_THRESHOLD = 0.5
+DEFAULT_RUNTIME_BUDGET_SECONDS = int(os.getenv("INFERENCE_RUNTIME_BUDGET_SECONDS", "1140"))
+DEFAULT_REQUEST_TIMEOUT_SECONDS = float(os.getenv("INFERENCE_REQUEST_TIMEOUT_SECONDS", "12"))
+SYSTEM_PROMPT = (
+    "You are an email triage assistant. For each email, prioritize risk/time impact, "
+    "categorize with one label (urgent|normal|spam|archive), route to the best team, "
+    "and summarize the key evidence. Return one JSON object with keys label, summary, route_to."
+)
+FALLBACK_ACTION = {
+    "label": "normal",
+    "summary": "Unable to parse response",
+    "route_to": "general",
+}
+TASK_MAP = {
+    "1": "task_easy",
+    "2": "task_medium",
+    "3": "task_hard",
+    "4": "task_production",
+}
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments for task and optional model override."""
+    parser = argparse.ArgumentParser(description="Run OpenEnv email triage inference.")
+    parser.add_argument(
+        "--task",
+        default="all",
+        choices=["1", "2", "3", "4", "all"],
+        help="Task selection: 1, 2, 3, 4, or all.",
+    )
+    parser.add_argument(
+        "--model",
+        default=None,
+        help="Optional model override. Falls back to MODEL_NAME environment variable.",
+    )
+    parser.add_argument(
+        "--split",
+        default=os.getenv("OPENENV_EVAL_SPLIT", "public"),
+        choices=["public", "private_eval"],
+        help="Scenario split to evaluate.",
+    )
+    parser.add_argument(
+        "--episodes-per-task",
+        default=1,
+        type=int,
+        help="Number of deterministic scenarios to evaluate per task.",
+    )
+    parser.add_argument(
+        "--runtime-budget-seconds",
+        default=DEFAULT_RUNTIME_BUDGET_SECONDS,
+        type=int,
+        help="Global wall-clock budget for the full script run.",
+    )
+    parser.add_argument(
+        "--request-timeout-seconds",
+        default=DEFAULT_REQUEST_TIMEOUT_SECONDS,
+        type=float,
+        help="Timeout per LLM request.",
+    )
+    parser.add_argument(
+        "--production-profile",
+        default="standard",
+        choices=["light", "standard", "heavy"],
+        help="Runtime workload profile used for task 4 episodes.",
+    )
+    parser.add_argument(
+        "--business-hours-mode",
+        action="store_true",
+        help="If set, task 4 timestamps focus on business-hours windows.",
+    )
+    parser.add_argument(
+        "--escalation-mode",
+        default="normal",
+        choices=["low", "normal", "high"],
+        help="Escalation strictness for task 4 follow-up generation.",
+    )
+    return parser.parse_args()
+def validate_runtime_config(model_name: str | None) -> str:
+    """Validate required runtime settings and return effective model name."""
+    if not API_KEY:
+        raise ValueError("Missing HF_TOKEN or API_KEY environment variable.")
+    effective_model = model_name or MODEL_NAME
+    return effective_model
+def log_start(task_name: str, benchmark_name: str, model_name: str) -> None:
+    """Emit mandatory START line."""
+    print(
+        f"[START] task={task_name} env={benchmark_name} model={model_name}",
+        flush=True,
+    )
+def log_step(step: int, action_str: str, reward: float, done: bool, error: str | None) -> None:
+    """Emit mandatory STEP line."""
+    error_value = error if error else "null"
+    done_value = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action_str} reward={reward:.2f} "
+        f"done={done_value} error={error_value}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, rewards: list[float]) -> None:
+    """Emit mandatory END line."""
+    rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} rewards={rewards_str}",
+        flush=True,
+    )
+def build_user_prompt(observation: EmailObservation, history: list[str]) -> str:
+    """Build model prompt from current observation and recent history."""
+    recent_history = "\n".join(history[-5:]) if history else "None"
+    return (
+        f"email_id: {observation.email_id}\n"
+        f"subject: {observation.subject}\n"
+        f"sender: {observation.sender}\n"
+        f"timestamp: {observation.timestamp}\n"
+        f"body: {observation.body}\n"
+        f"thread_history: {observation.thread_history}\n"
+        f"task_id: {observation.task_id}\n"
+        f"step_number: {observation.step_number}\n"
+        f"total_emails: {observation.total_emails}\n\n"
+        f"recent_history:\n{recent_history}\n\n"
+        "Return exactly one JSON object with label, summary, route_to."
+    )
+def strip_action_prefixes(response_text: str) -> str:
+    """Remove common formatting wrappers before parsing model output."""
+    cleaned = response_text.strip()
+    cleaned = re.sub(r"^```(?:json)?", "", cleaned, flags=re.IGNORECASE).strip()
+    cleaned = re.sub(r"```$", "", cleaned).strip()
+    cleaned = re.sub(r"^(next\s+action|action)\s*:\s*", "", cleaned, flags=re.IGNORECASE)
+    return cleaned.strip()
+def parse_text_action(cleaned_text: str) -> dict[str, str]:
+    """Parse action from free-form text with deterministic regex fallback."""
+    result: dict[str, str] = {}
+    label_match = re.search(
+        r"(?:\"label\"|label)\s*[:=]\s*\"?(urgent|normal|spam|archive)\"?",
+        cleaned_text,
+        flags=re.IGNORECASE,
+    )
+    if label_match:
+        result["label"] = label_match.group(1).lower()
+    route_match = re.search(
+        r"(?:\"route_to\"|route_to|route)\s*[:=]\s*\"?([a-zA-Z0-9_\-/ ]+)\"?",
+        cleaned_text,
+        flags=re.IGNORECASE,
+    )
+    if route_match:
+        result["route_to"] = route_match.group(1).strip().lower()
+    summary_match = re.search(
+        r"(?:\"summary\"|summary)\s*[:=]\s*\"?([^\"\n]+)\"?",
+        cleaned_text,
+        flags=re.IGNORECASE,
+    )
+    if summary_match:
+        result["summary"] = summary_match.group(1).strip()
+    return result
+def parse_action_response(response_text: str) -> TriageAction:
+    """Parse model response into a valid TriageAction with fallback behavior."""
+    cleaned_text = strip_action_prefixes(response_text)
+    parsed_payload: dict[str, Any] = {}
+    json_start = cleaned_text.find("{")
+    json_end = cleaned_text.rfind("}")
+    if json_start != -1 and json_end != -1 and json_end > json_start:
+        candidate = cleaned_text[json_start : json_end + 1]
+        try:
+            loaded = json.loads(candidate)
+            if isinstance(loaded, dict):
+                parsed_payload = loaded
+        except json.JSONDecodeError:
+            parsed_payload = {}
+    if not parsed_payload:
+        parsed_payload = parse_text_action(cleaned_text)
+    fallback_copy = dict(FALLBACK_ACTION)
+    fallback_copy.update(parsed_payload)
+    try:
+        return TriageAction.model_validate(fallback_copy)
+    except Exception:
+        return TriageAction.model_validate(FALLBACK_ACTION)
+def action_to_log_string(action: TriageAction) -> str:
+    """Return single-line action string for required STEP logging."""
+    return json.dumps(action.model_dump(), separators=(",", ":"), ensure_ascii=True)
+def run_episode(
+    client: OpenAI,
+    model_name: str,
+    task_id: str,
+    scenario_index: int,
+    eval_split: str,
+    deadline: float,
+    request_timeout_seconds: float,
+    runtime_options: dict[str, Any] | None = None,
+) -> None:
+    """Run one episode and emit strict START/STEP/END lines."""
+    rewards: list[float] = []
+    steps_taken = 0
+    success = False
+    env: EmailTriageEnv | None = None
+    log_start(task_name=task_id, benchmark_name=BENCHMARK, model_name=model_name)
+    try:
+        env = EmailTriageEnv(
+            task_id=task_id,
+            scenario_index=scenario_index,
+            split=eval_split,
+            runtime_options=runtime_options,
+        )
+        reset_result = env.reset()
+        observation = reset_result.observation
+        history: list[str] = []
+        for step in range(1, MAX_STEPS + 1):
+            if time.monotonic() >= deadline:
+                break
+            prompt = build_user_prompt(observation, history)
+            response_text = ""
+            try:
+                remaining = max(1.0, deadline - time.monotonic())
+                timeout_seconds = max(
+                    1.0,
+                    min(float(request_timeout_seconds), float(remaining)),
+                )
+                completion = client.chat.completions.create(
+                    model=model_name,
+                    messages=[
+                        {"role": "system", "content": SYSTEM_PROMPT},
+                        {"role": "user", "content": prompt},
+                    ],
+                    temperature=TEMPERATURE,
+                    max_tokens=MAX_TOKENS,
+                    stream=False,
+                    timeout=timeout_seconds,
+                )
+                response_text = completion.choices[0].message.content or ""
+            except Exception:
+                response_text = ""
+            action = parse_action_response(response_text)
+            step_result = env.step(action)
+            reward = float(step_result.reward)
+            done = bool(step_result.done)
+            error_raw = step_result.info.get("validation_error")
+            error = str(error_raw) if isinstance(error_raw, str) else None
+            rewards.append(reward)
+            steps_taken = step
+            log_step(
+                step=step,
+                action_str=action_to_log_string(action),
+                reward=reward,
+                done=done,
+                error=error,
+            )
+            history.append(
+                f"step={step} action={action.label}/{action.route_to} reward={reward:.2f}"
+            )
+            observation = step_result.observation
+            if done:
+                break
+        avg_reward = sum(rewards) / max(len(rewards), 1)
+        success = avg_reward >= SUCCESS_SCORE_THRESHOLD
+    except Exception:
+        success = False
+    finally:
+        if env is not None:
+            close_method = getattr(env, "close", None)
+            if callable(close_method):
+                try:
+                    close_method()
+                except Exception:
+                    pass
+        log_end(success=success, steps=steps_taken, rewards=rewards)
+def main() -> None:
+    """Entrypoint for running one or many tasks with strict stdout logs."""
+    args = parse_args()
+    deadline = time.monotonic() + max(args.runtime_budget_seconds, 1)
+    request_timeout_seconds = max(float(args.request_timeout_seconds), 1.0)
+    try:
+        effective_model = validate_runtime_config(args.model)
+    except ValueError as error:
+        print(str(error), flush=True)
+        raise SystemExit(1) from error
+    _ = LOCAL_IMAGE_NAME
+    client = OpenAI(
+        base_url=API_BASE_URL,
+        api_key=API_KEY,
+    )
+    task_ids = [TASK_MAP[args.task]] if args.task in TASK_MAP else list(TASK_MAP.values())
+    for task_id in task_ids:
+        runtime_options = None
+        if task_id == "task_production":
+            runtime_options = {
+                "production_profile": args.production_profile,
+                "business_hours_mode": args.business_hours_mode,
+                "escalation_mode": args.escalation_mode,
+            }
+        for scenario_index in range(max(args.episodes_per_task, 1)):
+            run_episode(
+                client=client,
+                model_name=effective_model,
+                task_id=task_id,
+                scenario_index=scenario_index,
+                eval_split=args.split,
+                deadline=deadline,
+                request_timeout_seconds=request_timeout_seconds,
+                runtime_options=runtime_options,
+            )
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""Data models for the OpenEnv email triage environment."""
+from typing import Literal
+from pydantic import BaseModel
+class EmailObservation(BaseModel):
+    """Represents the email context visible to the agent at each step."""
+    email_id: str
+    subject: str
+    body: str
+    sender: str
+    timestamp: str
+    thread_history: list[str]
+    task_id: str
+    step_number: int
+    total_emails: int
+class TriageAction(BaseModel):
+    """Represents the action chosen by the agent for an email."""
+    label: Literal["urgent", "normal", "spam", "archive"]
+    summary: str
+    route_to: str
+class RewardResult(BaseModel):
+    """Represents deterministic grading output before reward shaping."""
+    score: float
+    breakdown: dict[str, float]
+    feedback: str
+class EnvironmentState(BaseModel):
+    """Represents full internal environment state for debugging and evaluation."""
+    task_id: str
+    current_step: int
+    total_steps: int
+    done: bool
+    action_history: list[TriageAction]
+    reward_history: list[float]
+class StepResult(BaseModel):
+    """Represents the standardized output of environment step calls."""
+    observation: EmailObservation
+    reward: float
+    done: bool
+    info: dict[str, str | int | float | bool]
+class ResetResult(BaseModel):
+    """Represents the standardized output of environment reset calls."""
+    observation: EmailObservation
+    info: dict[str, str | int | float | bool]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+name: email-triage-env
+version: 0.1.0
+tasks:
+  - task_easy
+  - task_medium
+  - task_hard
+  - task_production
+observation_space:
+  email_id: str
+  subject: str
+  body: str
+  sender: str
+  timestamp: str
+  thread_history: list[str]
+  task_id: str
+  step_number: int
+  total_emails: int
+action_space:
+  label: literal[urgent, normal, spam, archive]
+  summary: str
+  route_to: str
+entrypoint: environment:EmailTriageEnv
+tags:
+  - openenv
+  - real-world
+  - nlp
+  - email-triage

pyproject.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+[build-system]
+requires = ["setuptools>=68"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "email-triage-env"
+version = "0.1.0"
+description = "OpenEnv email triage environment"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+  "pydantic==2.11.3",
+  "flask==3.1.0",
+  "openai==1.69.0",
+  "gunicorn==22.0.0",
+  "openenv-core>=0.2.0",
+]
+[project.scripts]
+server = "server:main"

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+pydantic==2.11.3
+flask==3.1.0
+openai==1.69.0
+gunicorn==22.0.0

server.py ADDED Viewed

	@@ -0,0 +1,621 @@

+"""Flask server wrapper for the OpenEnv email triage environment."""
+import os
+from flask import Flask, Response, jsonify, request
+from environment import EmailTriageEnv
+from tasks import get_task_scenario_count, list_task_ids
+FRONTEND_HTML = """<!doctype html>
+<html lang="en">
+<head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1" />
+    <title>Inbox Helper Practice</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap');
+        :root {
+            --bg: #f5f1e9;
+            --paper: #fffaf2;
+            --ink: #102433;
+            --accent: #ea6a2a;
+            --accent-soft: #ffd6bf;
+            --line: #d7cabb;
+            --ok: #0f7b6c;
+            --warn: #9a3a12;
+            --radius: 14px;
+        }
+        * { box-sizing: border-box; }
+        body {
+            margin: 0;
+            font-family: 'Space Grotesk', sans-serif;
+            color: var(--ink);
+            background:
+                radial-gradient(1100px 460px at -10% -20%, #f2bc9f 0%, transparent 60%),
+                radial-gradient(1100px 520px at 120% 115%, #b8d7cf 0%, transparent 62%),
+                var(--bg);
+            min-height: 100vh;
+        }
+        .wrap {
+            max-width: 1100px;
+            margin: 28px auto;
+            padding: 0 16px;
+            animation: reveal .45s ease-out;
+        }
+        @keyframes reveal {
+            from { opacity: 0; transform: translateY(10px); }
+            to { opacity: 1; transform: translateY(0); }
+        }
+        .title {
+            display: flex;
+            justify-content: space-between;
+            align-items: baseline;
+            gap: 14px;
+            margin-bottom: 14px;
+        }
+        h1 {
+            margin: 0;
+            font-size: clamp(1.5rem, 2vw, 2.2rem);
+            letter-spacing: .4px;
+        }
+        .subtitle {
+            margin: 6px 0 0;
+            font-size: .95rem;
+            opacity: .8;
+        }
+        .badge {
+            background: var(--accent-soft);
+            border: 1px solid #f2b693;
+            color: #7f2e0b;
+            padding: 6px 10px;
+            border-radius: 999px;
+            font-size: .85rem;
+            font-weight: 600;
+        }
+        .grid {
+            display: grid;
+            grid-template-columns: 1fr;
+            gap: 14px;
+        }
+        @media (min-width: 900px) {
+            .grid { grid-template-columns: 1fr 1fr; }
+            .wide { grid-column: span 2; }
+        }
+        .card {
+            background: var(--paper);
+            border: 1px solid var(--line);
+            border-radius: var(--radius);
+            padding: 14px;
+            box-shadow: 0 8px 28px rgba(16, 36, 51, 0.08);
+        }
+        .card h2 {
+            margin: 0 0 10px;
+            font-size: 1rem;
+            text-transform: uppercase;
+            letter-spacing: .08em;
+            opacity: .86;
+        }
+        .row {
+            display: flex;
+            flex-wrap: wrap;
+            gap: 8px;
+            align-items: center;
+            margin-bottom: 10px;
+        }
+        select, input, textarea, button {
+            font-family: inherit;
+            font-size: .95rem;
+        }
+        select, input, textarea {
+            width: 100%;
+            border: 1px solid #cdbba6;
+            border-radius: 10px;
+            padding: 9px 10px;
+            background: #fff;
+            color: var(--ink);
+        }
+        textarea {
+            min-height: 92px;
+            resize: vertical;
+        }
+        button {
+            border: 0;
+            border-radius: 10px;
+            padding: 9px 12px;
+            font-weight: 700;
+            background: var(--ink);
+            color: #fff;
+            cursor: pointer;
+            transition: transform .12s ease, opacity .12s ease;
+        }
+        button.secondary {
+            background: #285066;
+        }
+        button.accent {
+            background: var(--accent);
+        }
+        button:hover { transform: translateY(-1px); }
+        button:active { transform: translateY(0); opacity: .92; }
+        .status {
+            padding: 8px 10px;
+            border-radius: 10px;
+            background: #eef7f5;
+            border: 1px solid #c7e4de;
+            color: var(--ok);
+            font-weight: 600;
+            min-height: 40px;
+            display: flex;
+            align-items: center;
+        }
+        .status.error {
+            background: #fff1ea;
+            border-color: #ffc8ae;
+            color: var(--warn);
+        }
+        pre {
+            margin: 0;
+            white-space: pre-wrap;
+            background: #0f1b24;
+            color: #d9efe9;
+            border-radius: 10px;
+            padding: 12px;
+            max-height: 340px;
+            overflow: auto;
+            font-family: 'IBM Plex Mono', monospace;
+            font-size: .85rem;
+            border: 1px solid #21313f;
+        }
+        .email-block {
+            background: #fff;
+            border: 1px solid #d9ccbc;
+            border-radius: 10px;
+            padding: 12px;
+        }
+        .email-row {
+            margin-bottom: 8px;
+            font-size: .95rem;
+            line-height: 1.35;
+        }
+        .email-row strong {
+            display: inline-block;
+            min-width: 66px;
+        }
+        .help {
+            margin: 0 0 10px;
+            font-size: .9rem;
+            opacity: .8;
+        }
+    </style>
+</head>
+<body>
+    <div class="wrap">
+        <div class="title">
+            <div>
+                <h1>Inbox Helper Practice</h1>
+                <p class="subtitle">Practice deciding priority, category, and who should handle each email.</p>
+            </div>
+            <span class="badge" id="badge">connecting...</span>
+        </div>
+        <div class="grid">
+            <section class="card">
+                <h2>Start a Scenario</h2>
+                <p class="help">Pick a difficulty, then click Start.</p>
+                <div class="row">
+                    <select id="taskId">
+                        <option value="task_easy">Easy: one clear email</option>
+                        <option value="task_medium">Medium: mixed inbox</option>
+                        <option value="task_hard">Hard: high-risk complaint</option>
+                        <option value="task_production">Production: full inbox simulator</option>
+                    </select>
+                </div>
+                <div id="productionControls" style="display:none;">
+                    <div class="row">
+                        <select id="productionProfile">
+                            <option value="light">Workload: Light</option>
+                            <option value="standard" selected>Workload: Standard</option>
+                            <option value="heavy">Workload: Heavy</option>
+                        </select>
+                    </div>
+                    <div class="row">
+                        <select id="businessHoursMode">
+                            <option value="false" selected>Time Profile: 24x7 inbox</option>
+                            <option value="true">Time Profile: business hours focus</option>
+                        </select>
+                    </div>
+                    <div class="row">
+                        <select id="escalationMode">
+                            <option value="low">Escalation: Low</option>
+                            <option value="normal" selected>Escalation: Normal</option>
+                            <option value="high">Escalation: High</option>
+                        </select>
+                    </div>
+                </div>
+                <div class="row">
+                    <button class="accent" id="btnReset">Start</button>
+                    <button class="secondary" id="btnState">Check Progress</button>
+                </div>
+                <div class="status" id="status">Ready. Start a scenario.</div>
+            </section>
+            <section class="card">
+                <h2>Your Decision</h2>
+                <p class="help">Choose priority, who should handle it, and a short reason.</p>
+                <div class="row">
+                    <select id="label">
+                        <option value="urgent">Urgent</option>
+                        <option value="normal" selected>Normal</option>
+                        <option value="spam">Spam</option>
+                        <option value="archive">Archive</option>
+                    </select>
+                </div>
+                <div class="row">
+                    <input id="routeTo" placeholder="Who should handle this? (billing, safety, engineering, support)" value="general" />
+                </div>
+                <div class="row">
+                    <textarea id="summary" placeholder="Write one clear sentence with key clues from the email.">Needs review.</textarea>
+                </div>
+                <div class="row">
+                    <button id="btnStep">Send Decision</button>
+                </div>
+            </section>
+            <section class="card wide">
+                <h2>Current Email</h2>
+                <div class="email-block">
+                    <div class="email-row"><strong>Subject:</strong> <span id="mailSubject">No email loaded yet.</span></div>
+                    <div class="email-row"><strong>From:</strong> <span id="mailSender">-</span></div>
+                    <div class="email-row"><strong>Message:</strong> <span id="mailBody">Start a scenario to load an email.</span></div>
+                </div>
+            </section>
+            <section class="card wide">
+                <h2>Details (Advanced)</h2>
+                <pre id="output">Waiting for your first action...</pre>
+            </section>
+        </div>
+    </div>
+    <script>
+        const statusEl = document.getElementById('status');
+        const badgeEl = document.getElementById('badge');
+        const outEl = document.getElementById('output');
+        const mailSubjectEl = document.getElementById('mailSubject');
+        const mailSenderEl = document.getElementById('mailSender');
+        const mailBodyEl = document.getElementById('mailBody');
+        const taskIdEl = document.getElementById('taskId');
+        const productionControlsEl = document.getElementById('productionControls');
+        function setStatus(msg, isError = false) {
+            statusEl.textContent = msg;
+            statusEl.classList.toggle('error', isError);
+        }
+        function writeOutput(value) {
+            outEl.textContent = typeof value === 'string' ? value : JSON.stringify(value, null, 2);
+        }
+        function updateEmailPanel(data) {
+            if (!data || !data.observation) {
+                return;
+            }
+            const obs = data.observation;
+            mailSubjectEl.textContent = obs.subject || 'No subject';
+            mailSenderEl.textContent = obs.sender || '-';
+            mailBodyEl.textContent = obs.body || '';
+        }
+        function updateProductionControlsVisibility() {
+            const isProduction = taskIdEl.value === 'task_production';
+            productionControlsEl.style.display = isProduction ? 'block' : 'none';
+        }
+        async function postJson(path, payload) {
+            const response = await fetch(path, {
+                method: 'POST',
+                headers: { 'Content-Type': 'application/json' },
+                body: JSON.stringify(payload || {}),
+            });
+            const text = await response.text();
+            let data = text;
+            try { data = JSON.parse(text); } catch (e) {}
+            if (!response.ok) {
+                throw new Error('HTTP ' + response.status + ' - ' + text);
+            }
+            return data;
+        }
+        async function warmup() {
+            try {
+                const res = await fetch('/meta');
+                const data = await res.json();
+                badgeEl.textContent = data.status === 'ok' ? 'ready' : 'check service';
+            } catch (e) {
+                badgeEl.textContent = 'offline';
+            }
+        }
+        document.getElementById('btnReset').addEventListener('click', async () => {
+            const taskId = taskIdEl.value;
+            setStatus('Starting a new scenario...');
+            try {
+                const payload = { task_id: taskId };
+                if (taskId === 'task_production') {
+                    payload.production_profile = document.getElementById('productionProfile').value;
+                    payload.business_hours_mode = document.getElementById('businessHoursMode').value === 'true';
+                    payload.escalation_mode = document.getElementById('escalationMode').value;
+                }
+                const data = await postJson('/reset', payload);
+                setStatus('Scenario started. Read the email below.');
+                updateEmailPanel(data);
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not start scenario. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        document.getElementById('btnState').addEventListener('click', async () => {
+            setStatus('Checking progress...');
+            try {
+                const data = await postJson('/state', {});
+                setStatus('Progress updated.');
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not fetch progress. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        document.getElementById('btnStep').addEventListener('click', async () => {
+            const payload = {
+                label: document.getElementById('label').value,
+                summary: document.getElementById('summary').value,
+                route_to: document.getElementById('routeTo').value,
+            };
+            setStatus('Sending your decision...');
+            try {
+                const data = await postJson('/step', payload);
+                setStatus('Decision saved.');
+                updateEmailPanel(data);
+                writeOutput(data);
+            } catch (e) {
+                setStatus('Could not submit decision. See details below.', true);
+                writeOutput(String(e));
+            }
+        });
+        taskIdEl.addEventListener('change', updateProductionControlsVisibility);
+        updateProductionControlsVisibility();
+        warmup();
+    </script>
+</body>
+</html>
+"""
+app = Flask(__name__)
+current_env = EmailTriageEnv(task_id="task_easy")
+SCENARIO_COUNTERS = {task_id: 0 for task_id in list_task_ids()}
+DEFAULT_EVAL_SPLIT = os.getenv("OPENENV_EVAL_SPLIT", "public")
+ALLOW_CLIENT_EVAL_OVERRIDE = (
+    os.getenv("OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE", "false").strip().lower() == "true"
+)
+@app.get("/")
+def root_page():
+    """Render a lightweight frontend for interacting with the environment."""
+    return Response(FRONTEND_HTML, mimetype="text/html")
+@app.get("/meta")
+def root_endpoint():
+    """Return service metadata for health checks and machine clients."""
+    return jsonify(
+        {
+            "name": "email-triage-env",
+            "status": "ok",
+            "endpoints": {
+                "reset": {"method": "POST", "path": "/reset"},
+                "step": {"method": "POST", "path": "/step"},
+                "state": {"method": "POST", "path": "/state"},
+            },
+            "scenario_pools": {
+                "public": {
+                    task_id: get_task_scenario_count(task_id, "public")
+                    for task_id in list_task_ids()
+                },
+            },
+            "eval_split": DEFAULT_EVAL_SPLIT,
+            "production_runtime_controls": {
+                "production_profile": ["light", "standard", "heavy"],
+                "business_hours_mode": [True, False],
+                "escalation_mode": ["low", "normal", "high"],
+            },
+        }
+    )
+@app.post("/reset")
+def reset_endpoint():
+    """Reset the environment with a selected task and return ResetResult JSON.
+    Returns:
+        Flask response containing reset payload.
+    """
+    global current_env
+    global SCENARIO_COUNTERS
+    payload = request.get_json(silent=True)
+    if payload is None:
+        payload = {}
+    elif not isinstance(payload, dict):
+        return jsonify({"error": "Malformed JSON payload."}), 400
+    task_id = payload.get("task_id", "task_easy")
+    if not isinstance(task_id, str):
+        return jsonify({"error": "Field 'task_id' must be a string."}), 400
+    runtime_options: dict[str, object] = {}
+    if task_id == "task_production":
+        production_profile = payload.get("production_profile", "standard")
+        if not isinstance(production_profile, str) or production_profile not in {
+            "light",
+            "standard",
+            "heavy",
+        }:
+            return (
+                jsonify(
+                    {
+                        "error": (
+                            "Field 'production_profile' must be one of "
+                            "light/standard/heavy."
+                        )
+                    }
+                ),
+                400,
+            )
+        escalation_mode = payload.get("escalation_mode", "normal")
+        if not isinstance(escalation_mode, str) or escalation_mode not in {
+            "low",
+            "normal",
+            "high",
+        }:
+            return (
+                jsonify(
+                    {
+                        "error": (
+                            "Field 'escalation_mode' must be one of "
+                            "low/normal/high."
+                        )
+                    }
+                ),
+                400,
+            )
+        business_hours_mode = payload.get("business_hours_mode", False)
+        if isinstance(business_hours_mode, str):
+            business_hours_mode = business_hours_mode.strip().lower() in {
+                "1",
+                "true",
+                "yes",
+                "on",
+            }
+        elif not isinstance(business_hours_mode, bool):
+            return jsonify({"error": "Field 'business_hours_mode' must be boolean."}), 400
+        runtime_options = {
+            "production_profile": production_profile,
+            "business_hours_mode": business_hours_mode,
+            "escalation_mode": escalation_mode,
+        }
+    if not ALLOW_CLIENT_EVAL_OVERRIDE and (
+        "eval_split" in payload or "scenario_index" in payload
+    ):
+        return jsonify(
+            {
+                "error": (
+                    "Client overrides for eval_split/scenario_index are disabled "
+                    "by server policy."
+                )
+            }
+        ), 400
+    eval_split = DEFAULT_EVAL_SPLIT
+    if ALLOW_CLIENT_EVAL_OVERRIDE:
+        requested_split = payload.get("eval_split", DEFAULT_EVAL_SPLIT)
+        if not isinstance(requested_split, str):
+            return jsonify({"error": "Field 'eval_split' must be a string."}), 400
+        eval_split = requested_split
+    requested_index = payload.get("scenario_index") if ALLOW_CLIENT_EVAL_OVERRIDE else None
+    if requested_index is not None and (not isinstance(requested_index, int) or requested_index < 0):
+        return jsonify({"error": "Field 'scenario_index' must be a non-negative integer."}), 400
+    try:
+        scenario_count = get_task_scenario_count(task_id, eval_split)
+        if requested_index is None:
+            scenario_index = SCENARIO_COUNTERS.get(task_id, 0)
+            if scenario_count > 0:
+                SCENARIO_COUNTERS[task_id] = (scenario_index + 1) % scenario_count
+        else:
+            scenario_index = requested_index
+        current_env = EmailTriageEnv(
+            task_id=task_id,
+            scenario_index=scenario_index,
+            split=eval_split,
+            runtime_options=runtime_options,
+        )
+        reset_result = current_env.reset()
+    except KeyError as error:
+        return jsonify({"error": str(error)}), 400
+    return jsonify(reset_result.model_dump())
+@app.post("/step")
+def step_endpoint():
+    """Advance environment by one action and return StepResult JSON.
+    Returns:
+        Flask response containing step payload.
+    """
+    payload = request.get_json(silent=True)
+    if payload is None:
+        return jsonify({"error": "Malformed JSON payload."}), 400
+    step_result = current_env.step(payload)
+    return jsonify(step_result.model_dump())
+@app.post("/state")
+def state_endpoint():
+    """Return read-only EnvironmentState JSON snapshot.
+    Returns:
+        Flask response containing state payload.
+    """
+    state_result = current_env.state()
+    return jsonify(state_result.model_dump())
+def main() -> None:
+    """Run the Flask app for local and script-based launches."""
+    app.run(host="0.0.0.0", port=7860)
+if __name__ == "__main__":
+    main()

tasks.py ADDED Viewed

	@@ -0,0 +1,748 @@

+"""Task definitions and scenario pools for the OpenEnv email triage environment."""
+import json
+import os
+import random
+from datetime import datetime, timedelta, timezone
+from typing import cast
+TASK_LIBRARY: dict[str, dict[str, object]] = {
+    "task_easy": {
+        "description": "Classify and route one unambiguous operational email.",
+        "scenario_pool": [
+            {
+                "scenario_id": "easy_invoice_confirmation",
+                "emails": [
+                    {
+                        "email_id": "easy-001",
+                        "subject": "Quarterly invoice available",
+                        "body": (
+                            "Hello Team, your Q1 invoice is now ready in the billing portal. "
+                            "Please confirm the purchase order number by Friday."
+                        ),
+                        "sender": "accounts@vendor-example.com",
+                        "timestamp": "2026-03-25T09:15:00Z",
+                        "thread_history": [
+                            "Last month: requested invoice schedule for Q1 and Q2."
+                        ],
+                    }
+                ],
+                "ground_truth": [
+                    {
+                        "label": "normal",
+                        "route_to": "billing",
+                        "priority_weight": 1.0,
+                        "summary_keywords": ["invoice", "purchase order", "billing portal"],
+                    }
+                ],
+            },
+            {
+                "scenario_id": "easy_password_lockout",
+                "emails": [
+                    {
+                        "email_id": "easy-002",
+                        "subject": "Locked out of admin dashboard",
+                        "body": (
+                            "I cannot access the admin dashboard after MFA reset, and our "
+                            "client demo starts in 20 minutes. Please help immediately."
+                        ),
+                        "sender": "sales-lead@acme-enterprise.com",
+                        "timestamp": "2026-03-28T11:40:00Z",
+                        "thread_history": ["MFA reset was completed this morning."],
+                    }
+                ],
+                "ground_truth": [
+                    {
+                        "label": "urgent",
+                        "route_to": "support",
+                        "priority_weight": 1.2,
+                        "summary_keywords": ["locked out", "mfa", "demo"],
+                    }
+                ],
+            },
+            {
+                "scenario_id": "easy_newsletter_archive",
+                "emails": [
+                    {
+                        "email_id": "easy-003",
+                        "subject": "Monthly partner newsletter",
+                        "body": (
+                            "Sharing this month's partner newsletter with product updates. "
+                            "No action needed unless you want to read the highlights."
+                        ),
+                        "sender": "updates@partner-network.io",
+                        "timestamp": "2026-03-30T08:10:00Z",
+                        "thread_history": [],
+                    }
+                ],
+                "ground_truth": [
+                    {
+                        "label": "archive",
+                        "route_to": "general",
+                        "priority_weight": 0.8,
+                        "summary_keywords": ["newsletter", "no action", "updates"],
+                    }
+                ],
+            },
+        ],
+        "private_eval_pool": [],
+    },
+    "task_medium": {
+        "description": "Triage five mixed-priority emails with ambiguous contextual signals.",
+        "scenario_pool": [
+            {
+                "scenario_id": "medium_ops_mix_a",
+                "emails": [
+                    {
+                        "email_id": "med-001",
+                        "subject": "URGENT: Your account will be disabled in 30 minutes",
+                        "body": (
+                            "Click this external short link to keep your mailbox active. "
+                            "If you do not click now, your account will be deleted."
+                        ),
+                        "sender": "it-admin@secure-mail-help.net",
+                        "timestamp": "2026-03-26T07:08:00Z",
+                        "thread_history": [],
+                    },
+                    {
+                        "email_id": "med-002",
+                        "subject": "Can someone review production error spikes?",
+                        "body": (
+                            "We are seeing a 28% spike in checkout failures after the 06:10 UTC "
+                            "deploy. Please triage and assign on-call ownership immediately."
+                        ),
+                        "sender": "ops-manager@acme-enterprise.com",
+                        "timestamp": "2026-03-26T06:21:00Z",
+                        "thread_history": ["Pager alert opened at 06:18 UTC."],
+                    },
+                    {
+                        "email_id": "med-003",
+                        "subject": "RE: promo campaign winner list",
+                        "body": (
+                            "Subject line looks like a campaign thread, but this message confirms "
+                            "a customer reported duplicate card charges. Please review and respond."
+                        ),
+                        "sender": "care-escalations@acme-enterprise.com",
+                        "timestamp": "2026-03-26T11:42:00Z",
+                        "thread_history": [
+                            "Marketing team forwarded customer complaint for billing review."
+                        ],
+                    },
+                    {
+                        "email_id": "med-004",
+                        "subject": "Safety escalation: charger overheating case #4812",
+                        "body": (
+                            "Customer reports visible smoke from charging dock during normal use. "
+                            "No injuries reported, but immediate safety review requested."
+                        ),
+                        "sender": "support-lead@acme-enterprise.com",
+                        "timestamp": "2026-03-26T10:03:00Z",
+                        "thread_history": ["Ticket severity raised from P2 to P1."],
+                    },
+                    {
+                        "email_id": "med-005",
+                        "subject": "FYI: April all-hands agenda",
+                        "body": (
+                            "Sharing the all-hands agenda draft. No action required unless you "
+                            "want to propose additional topics by Monday."
+                        ),
+                        "sender": "people-ops@acme-enterprise.com",
+                        "timestamp": "2026-03-26T14:25:00Z",
+                        "thread_history": [],
+                    },
+                ],
+                "ground_truth": [
+                    {
+                        "label": "spam",
+                        "route_to": "general",
+                        "priority_weight": 1.0,
+                        "summary_keywords": ["external link", "disable", "phishing"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "engineering",
+                        "priority_weight": 1.5,
+                        "summary_keywords": ["checkout", "failures", "on-call"],
+                    },
+                    {
+                        "label": "normal",
+                        "route_to": "billing",
+                        "priority_weight": 1.2,
+                        "summary_keywords": ["duplicate", "charges", "billing"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "priority_weight": 1.6,
+                        "summary_keywords": ["smoke", "overheating", "safety"],
+                    },
+                    {
+                        "label": "archive",
+                        "route_to": "general",
+                        "priority_weight": 0.8,
+                        "summary_keywords": ["all-hands", "agenda", "no action required"],
+                    },
+                ],
+            },
+            {
+                "scenario_id": "medium_ops_mix_b",
+                "emails": [
+                    {
+                        "email_id": "med-b-001",
+                        "subject": "Action required: verify payroll account immediately",
+                        "body": (
+                            "Your payroll account appears locked. Verify your credentials on this "
+                            "new portal link to avoid delayed salary processing."
+                        ),
+                        "sender": "payroll-security@alerts-payroll.net",
+                        "timestamp": "2026-04-01T07:00:00Z",
+                        "thread_history": [],
+                    },
+                    {
+                        "email_id": "med-b-002",
+                        "subject": "Incident: checkout API timeout in eu-west",
+                        "body": (
+                            "Payments API timeout crossed SLO in eu-west for 11 minutes. "
+                            "Revenue impact probable. On-call escalation required."
+                        ),
+                        "sender": "sre@acme-enterprise.com",
+                        "timestamp": "2026-04-01T06:40:00Z",
+                        "thread_history": ["Auto-remediation attempt failed."],
+                    },
+                    {
+                        "email_id": "med-b-003",
+                        "subject": "Question about duplicate invoice #4421",
+                        "body": (
+                            "Customer says invoice #4421 appears twice in the portal and asks "
+                            "which one should be paid."
+                        ),
+                        "sender": "support@acme-enterprise.com",
+                        "timestamp": "2026-04-01T09:22:00Z",
+                        "thread_history": ["Ticket tagged as finance inquiry."],
+                    },
+                    {
+                        "email_id": "med-b-004",
+                        "subject": "Potential safety issue in warehouse charging bay",
+                        "body": (
+                            "Night shift observed sparks from charging rack slot C after routine use. "
+                            "Operations paused affected rack."
+                        ),
+                        "sender": "warehouse-lead@acme-enterprise.com",
+                        "timestamp": "2026-04-01T04:50:00Z",
+                        "thread_history": ["Photos attached in internal ticket."],
+                    },
+                    {
+                        "email_id": "med-b-005",
+                        "subject": "Reminder: volunteer signup closes Friday",
+                        "body": "Friendly reminder to sign up for community day volunteer slots.",
+                        "sender": "people-ops@acme-enterprise.com",
+                        "timestamp": "2026-04-01T10:12:00Z",
+                        "thread_history": [],
+                    },
+                ],
+                "ground_truth": [
+                    {
+                        "label": "spam",
+                        "route_to": "general",
+                        "priority_weight": 1.0,
+                        "summary_keywords": ["verify", "portal link", "payroll"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "engineering",
+                        "priority_weight": 1.5,
+                        "summary_keywords": ["timeout", "slo", "on-call"],
+                    },
+                    {
+                        "label": "normal",
+                        "route_to": "billing",
+                        "priority_weight": 1.2,
+                        "summary_keywords": ["duplicate invoice", "paid", "finance"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "priority_weight": 1.6,
+                        "summary_keywords": ["sparks", "charging", "paused"],
+                    },
+                    {
+                        "label": "archive",
+                        "route_to": "general",
+                        "priority_weight": 0.8,
+                        "summary_keywords": ["reminder", "volunteer", "signup"],
+                    },
+                ],
+            },
+        ],
+        "private_eval_pool": [],
+    },
+    "task_hard": {
+        "description": "Handle ambiguous complaints that mix safety, legal, and billing risk.",
+        "scenario_pool": [
+            {
+                "scenario_id": "hard_cross_function_a",
+                "emails": [
+                    {
+                        "email_id": "hard-001",
+                        "subject": "Formal complaint: unsafe device behavior and disputed charges",
+                        "body": (
+                            "I was charged twice for the replacement kit, and during testing the "
+                            "unit became hot enough to scorch the desk surface. I need billing "
+                            "correction and urgent safety follow-up today."
+                        ),
+                        "sender": "legal-customer@enterprise-client.com",
+                        "timestamp": "2026-03-26T08:33:00Z",
+                        "thread_history": [
+                            "Support asked customer to share photos; customer replied with incident details."
+                        ],
+                    },
+                    {
+                        "email_id": "hard-002",
+                        "subject": "Escalation follow-up: compliance and refund timeline",
+                        "body": (
+                            "Following up on the same incident, compliance team requests confirmation "
+                            "of safety escalation and billing refund timeline before we close the case."
+                        ),
+                        "sender": "procurement@enterprise-client.com",
+                        "timestamp": "2026-03-26T09:07:00Z",
+                        "thread_history": ["Legal requested cross-team response within 4 business hours."],
+                    },
+                ],
+                "ground_truth": [
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "cc_route": "billing",
+                        "penalize_spam": 0.2,
+                        "summary_keywords": ["unsafe", "overheating", "double charge", "refund"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "cc_route": "billing",
+                        "penalize_spam": 0.2,
+                        "summary_keywords": ["compliance", "safety escalation", "refund timeline"],
+                    },
+                ],
+            },
+            {
+                "scenario_id": "hard_cross_function_b",
+                "emails": [
+                    {
+                        "email_id": "hard-b-001",
+                        "subject": "Executive escalation: battery smoke + invoice mismatch",
+                        "body": (
+                            "A strategic customer reported smoke from a battery dock and also found an "
+                            "invoice mismatch on the emergency replacement shipment."
+                        ),
+                        "sender": "exec-office@acme-enterprise.com",
+                        "timestamp": "2026-04-01T07:40:00Z",
+                        "thread_history": ["Account owner requested immediate cross-functional response."],
+                    },
+                    {
+                        "email_id": "hard-b-002",
+                        "subject": "Legal asks for mitigation proof and refund confirmation",
+                        "body": (
+                            "Legal team needs written mitigation steps for the safety issue and proof "
+                            "that refund correction has been processed."
+                        ),
+                        "sender": "legal@enterprise-client.com",
+                        "timestamp": "2026-04-01T08:05:00Z",
+                        "thread_history": ["Deadline: before board review this afternoon."],
+                    },
+                ],
+                "ground_truth": [
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "cc_route": "billing",
+                        "penalize_spam": 0.2,
+                        "summary_keywords": ["smoke", "invoice mismatch", "strategic customer"],
+                    },
+                    {
+                        "label": "urgent",
+                        "route_to": "safety",
+                        "cc_route": "billing",
+                        "penalize_spam": 0.2,
+                        "summary_keywords": ["mitigation", "refund confirmation", "legal"],
+                    },
+                ],
+            },
+        ],
+        "private_eval_pool": [],
+    },
+    "task_production": {
+        "description": (
+            "Simulate a production inbox with mixed business-critical, safety, billing, "
+            "support, and spam threads."
+        ),
+        "scenario_pool": [],
+        "private_eval_pool": [],
+    },
+}
+PRODUCTION_SCENARIO_COUNT = 1000
+PRODUCTION_TEMPLATE_LIBRARY: dict[str, dict[str, object]] = {
+    "phishing_link": {
+        "label": "spam",
+        "route_to": "general",
+        "priority_weight": 1.0,
+        "summary_keywords": ["phishing", "external link", "credential"],
+        "subject_options": [
+            "Immediate verification required for shared inbox",
+            "Mailbox suspension warning - action needed",
+        ],
+        "body_options": [
+            "Security team asks you to verify credentials using this external short link.",
+            "Your mailbox will be disabled unless you confirm account details on new portal.",
+        ],
+        "sender_options": ["security-alert@mail-checkup.net", "admin-support@auth-updates.co"],
+    },
+    "incident_checkout": {
+        "label": "urgent",
+        "route_to": "engineering",
+        "priority_weight": 1.8,
+        "summary_keywords": ["checkout", "incident", "on-call"],
+        "subject_options": [
+            "SEV-1: checkout failures rising",
+            "Revenue incident: payment flow degraded",
+        ],
+        "body_options": [
+            "Checkout success rate dropped after deployment and on-call escalation is pending.",
+            "Payment API latency breach is impacting order completion in production.",
+        ],
+        "sender_options": ["incident-bot@acme-enterprise.com", "sre-lead@acme-enterprise.com"],
+    },
+    "billing_refund": {
+        "label": "normal",
+        "route_to": "billing",
+        "priority_weight": 1.3,
+        "summary_keywords": ["refund", "invoice", "duplicate charge"],
+        "subject_options": [
+            "Customer reports duplicate charge",
+            "Refund timeline confirmation request",
+        ],
+        "body_options": [
+            "Customer sees duplicate charge on invoice and asks for correction timeline.",
+            "Account manager requests status update for pending reimbursement case.",
+        ],
+        "sender_options": ["care-escalations@acme-enterprise.com", "finance-ops@acme-enterprise.com"],
+    },
+    "safety_smoke": {
+        "label": "urgent",
+        "route_to": "safety",
+        "priority_weight": 1.9,
+        "summary_keywords": ["safety", "smoke", "overheating"],
+        "subject_options": [
+            "Safety escalation: smoke seen during charging",
+            "Urgent product safety complaint",
+        ],
+        "body_options": [
+            "Customer reports visible smoke and overheating during normal charging operation.",
+            "Field ops flagged possible thermal event requiring immediate safety review.",
+        ],
+        "sender_options": ["support-lead@acme-enterprise.com", "field-ops@acme-enterprise.com"],
+    },
+    "support_access": {
+        "label": "urgent",
+        "route_to": "support",
+        "priority_weight": 1.4,
+        "summary_keywords": ["locked out", "access", "mfa"],
+        "subject_options": [
+            "Executive locked out after MFA reset",
+            "Cannot access admin dashboard before meeting",
+        ],
+        "body_options": [
+            "User cannot access dashboard after security reset and requests immediate help.",
+            "Critical user lockout blocks live customer session starting shortly.",
+        ],
+        "sender_options": ["sales-lead@acme-enterprise.com", "exec-assistant@acme-enterprise.com"],
+    },
+    "archive_digest": {
+        "label": "archive",
+        "route_to": "general",
+        "priority_weight": 0.7,
+        "summary_keywords": ["newsletter", "digest", "no action"],
+        "subject_options": [
+            "Monthly partner digest",
+            "Internal culture newsletter",
+        ],
+        "body_options": [
+            "Sharing monthly digest for awareness only; no action required.",
+            "Newsletter update with optional reads and no operational request.",
+        ],
+        "sender_options": ["people-ops@acme-enterprise.com", "updates@partner-network.io"],
+    },
+}
+PRODUCTION_EVENT_PLAN: list[str] = [
+    "phishing_link",
+    "incident_checkout",
+    "billing_refund",
+    "safety_smoke",
+    "support_access",
+    "archive_digest",
+    "incident_checkout",
+    "billing_refund",
+    "safety_smoke",
+    "support_access",
+    "archive_digest",
+    "incident_checkout",
+    "billing_refund",
+    "phishing_link",
+    "safety_smoke",
+    "support_access",
+    "incident_checkout",
+    "archive_digest",
+]
+PRODUCTION_PROFILE_EVENT_PLANS: dict[str, list[str]] = {
+    "light": PRODUCTION_EVENT_PLAN[:12],
+    "standard": PRODUCTION_EVENT_PLAN,
+    "heavy": PRODUCTION_EVENT_PLAN
+    + [
+        "incident_checkout",
+        "safety_smoke",
+        "incident_checkout",
+        "billing_refund",
+        "support_access",
+        "safety_smoke",
+        "phishing_link",
+        "incident_checkout",
+        "billing_refund",
+        "archive_digest",
+    ],
+}
+def _normalize_production_profile(profile_value: object) -> str:
+    """Normalize production profile value to one of light/standard/heavy."""
+    profile = str(profile_value or "standard").strip().lower()
+    return profile if profile in PRODUCTION_PROFILE_EVENT_PLANS else "standard"
+def _coerce_bool(value: object, default: bool = False) -> bool:
+    """Coerce bool-ish values from runtime options."""
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        normalized = value.strip().lower()
+        if normalized in {"1", "true", "yes", "on"}:
+            return True
+        if normalized in {"0", "false", "no", "off"}:
+            return False
+    return default
+def _load_private_eval_overrides() -> dict[str, list[dict[str, object]]]:
+    """Load private-eval scenarios from OPENENV_PRIVATE_SCENARIOS_JSON.
+    Expected shape:
+    {
+      "task_easy": [ {"scenario_id": "...", "emails": [...], "ground_truth": [...]}, ... ],
+      "task_medium": [...],
+      "task_hard": [...]
+    }
+    """
+    raw_payload = os.getenv("OPENENV_PRIVATE_SCENARIOS_JSON", "").strip()
+    if not raw_payload:
+        return {}
+    try:
+        parsed_payload = json.loads(raw_payload)
+    except json.JSONDecodeError:
+        return {}
+    if not isinstance(parsed_payload, dict):
+        return {}
+    validated: dict[str, list[dict[str, object]]] = {}
+    for task_id, pool in parsed_payload.items():
+        if not isinstance(task_id, str) or task_id not in TASK_LIBRARY:
+            continue
+        if isinstance(pool, list):
+            validated[task_id] = cast(list[dict[str, object]], pool)
+    return validated
+def _build_production_task_definition(
+    scenario_index: int,
+    split: str,
+    runtime_options: dict[str, object] | None = None,
+) -> dict[str, object]:
+    """Build a deterministic production-style inbox scenario."""
+    options = runtime_options or {}
+    profile = _normalize_production_profile(options.get("production_profile", "standard"))
+    business_hours_mode = _coerce_bool(options.get("business_hours_mode", False), False)
+    seed_base = 910000 if split == "private_eval" else 510000
+    profile_seed = {"light": 101, "standard": 202, "heavy": 303}[profile]
+    rng = random.Random(seed_base + max(0, scenario_index) + profile_seed)
+    base_datetime = datetime(2026, 4, 1, 7, 30, tzinfo=timezone.utc)
+    base_datetime += timedelta(minutes=max(0, scenario_index) % 720)
+    if business_hours_mode:
+        base_datetime = base_datetime.replace(hour=9, minute=0)
+    thread_counts: dict[str, int] = {}
+    emails: list[dict[str, object]] = []
+    ground_truth: list[dict[str, object]] = []
+    event_plan = PRODUCTION_PROFILE_EVENT_PLANS[profile]
+    for idx, template_key in enumerate(event_plan):
+        template = PRODUCTION_TEMPLATE_LIBRARY[template_key]
+        thread_family = template_key.split("_")[0]
+        thread_counts[thread_family] = thread_counts.get(thread_family, 0) + 1
+        thread_number = thread_counts[thread_family]
+        subject = str(rng.choice(cast(list[str], template["subject_options"])))
+        body = str(rng.choice(cast(list[str], template["body_options"])))
+        sender = str(rng.choice(cast(list[str], template["sender_options"])))
+        if thread_number > 1:
+            subject = f"RE[{thread_number}]: {subject}"
+        if business_hours_mode:
+            business_window_minutes = 8 * 60
+            minute_offset = (idx * 17) % business_window_minutes
+            day_offset = (idx * 17) // business_window_minutes
+            event_dt = (base_datetime + timedelta(days=day_offset, minutes=minute_offset)).replace(
+                hour=9 + ((minute_offset // 60) % 8),
+                minute=minute_offset % 60,
+            )
+        else:
+            event_dt = base_datetime + timedelta(minutes=idx * 9)
+        timestamp = event_dt.isoformat().replace("+00:00", "Z")
+        thread_history = []
+        if thread_number > 1:
+            thread_history.append(
+                f"Previous {thread_family} thread update #{thread_number - 1} in operations inbox."
+            )
+        emails.append(
+            {
+                "email_id": f"prod-{scenario_index:04d}-{idx + 1:03d}",
+                "subject": subject,
+                "body": body,
+                "sender": sender,
+                "timestamp": timestamp,
+                "thread_history": thread_history,
+            }
+        )
+        ground_truth.append(
+            {
+                "label": str(template["label"]),
+                "route_to": str(template["route_to"]),
+                "priority_weight": float(template["priority_weight"]),
+                "summary_keywords": cast(list[str], template["summary_keywords"]),
+            }
+        )
+    return {
+        "task_id": "task_production",
+        "split": split,
+        "scenario_id": f"production-{split}-{profile}-{scenario_index:04d}",
+        "description": str(TASK_LIBRARY["task_production"]["description"]),
+        "emails": emails,
+        "ground_truth": ground_truth,
+        "runtime_options": {
+            "production_profile": profile,
+            "business_hours_mode": business_hours_mode,
+        },
+    }
+def _normalize_split(split: str | None) -> str:
+    """Return normalized split name constrained to known values."""
+    return "private_eval" if split == "private_eval" else "public"
+def get_task_definition(
+    task_id: str,
+    scenario_index: int = 0,
+    split: str | None = None,
+    runtime_options: dict[str, object] | None = None,
+) -> dict[str, object]:
+    """Return selected task scenario by task_id, split, and deterministic index.
+    Args:
+        task_id: Task identifier.
+        scenario_index: Deterministic index into selected scenario pool.
+        split: Scenario split selector; supports public and private_eval.
+    Returns:
+        Concrete task definition dictionary.
+    Raises:
+        KeyError: If task_id is not defined.
+    """
+    if task_id not in TASK_LIBRARY:
+        raise KeyError(f"Unknown task_id: {task_id}")
+    normalized_split = _normalize_split(split)
+    if task_id == "task_production":
+        return _build_production_task_definition(
+            scenario_index,
+            normalized_split,
+            runtime_options,
+        )
+    task_record = TASK_LIBRARY[task_id]
+    if normalized_split == "private_eval":
+        private_overrides = _load_private_eval_overrides()
+        pool = private_overrides.get(task_id, [])
+        if not pool:
+            pool = cast(list[dict[str, object]], task_record.get("private_eval_pool", []))
+        if not pool:
+            raise KeyError(
+                "No private_eval scenarios configured for "
+                f"{task_id}. Set OPENENV_PRIVATE_SCENARIOS_JSON."
+            )
+    else:
+        pool = cast(list[dict[str, object]], task_record.get("scenario_pool", []))
+    if not pool:
+        raise KeyError(f"No public scenarios configured for {task_id}")
+    safe_index = scenario_index % len(pool)
+    scenario = pool[safe_index]
+    return {
+        "task_id": task_id,
+        "split": normalized_split,
+        "scenario_id": str(scenario.get("scenario_id", f"{task_id}-{safe_index}")),
+        "description": str(task_record.get("description", "")),
+        "emails": cast(list[dict[str, object]], scenario.get("emails", [])),
+        "ground_truth": cast(list[dict[str, object]], scenario.get("ground_truth", [])),
+    }
+def get_task_scenario_count(task_id: str, split: str | None = None) -> int:
+    """Return number of scenarios for a task in selected split."""
+    if task_id not in TASK_LIBRARY:
+        raise KeyError(f"Unknown task_id: {task_id}")
+    if task_id == "task_production":
+        return PRODUCTION_SCENARIO_COUNT
+    task_record = TASK_LIBRARY[task_id]
+    normalized_split = _normalize_split(split)
+    if normalized_split == "private_eval":
+        private_overrides = _load_private_eval_overrides()
+        pool = private_overrides.get(task_id, [])
+        if not pool:
+            pool = cast(list[dict[str, object]], task_record.get("private_eval_pool", []))
+        return len(pool)
+    else:
+        pool = cast(list[dict[str, object]], task_record.get("scenario_pool", []))
+    return len(pool)
+def list_task_ids() -> list[str]:
+    """Return all supported task identifiers in deterministic order."""
+    return ["task_easy", "task_medium", "task_hard", "task_production"]

uv.lock ADDED Viewed

	@@ -0,0 +1,8 @@

+version = 1
+revision = 1
+requires-python = ">=3.11"
+[[package]]
+name = "email-triage-env"
+version = "0.1.0"
+source = { editable = "." }

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,198 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh - OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Space URL (e.g. https://your-space.hf.space or https://huggingface.co/spaces/<owner>/<space>)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+normalize_ping_url() {
+  local input_url="${1%/}"
+  if [[ "$input_url" =~ ^https?://huggingface\.co/spaces/([^/]+)/([^/]+)$ ]]; then
+    local owner
+    local space
+    owner="$(printf '%s' "${BASH_REMATCH[1]}" | tr '[:upper:]' '[:lower:]')"
+    space="$(printf '%s' "${BASH_REMATCH[2]}" | tr '[:upper:]' '[:lower:]')"
+    printf "https://%s-%s.hf.space" "$owner" "$space"
+  else
+    printf "%s" "$input_url"
+  fi
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Space URL (e.g. https://your-space.hf.space or https://huggingface.co/spaces/<owner>/<space>)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="$(normalize_ping_url "$PING_URL")"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0