Spaces:

Imaginephoenix
/

openenv1

Sleeping

App Files Files Community

Imaginephoenix commited on Apr 4

Commit

d547d85

verified ·

1 Parent(s): f6339e7

Upload README.md

Browse files

Files changed (1) hide show

README.md +506 -6

README.md CHANGED Viewed

@@ -1,11 +1,511 @@
 ---
-title: Openenv1
-emoji: 🌖
-colorFrom: red
-colorTo: gray
 sdk: docker
 pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: OpenEnv Email Triage Environment
+emoji: 📬
+colorFrom: blue
+colorTo: blue
 sdk: docker
+app_port: 7860
 pinned: false
 ---
+# OpenEnv Email Triage Environment
+A real-world AI agent training environment that simulates professional email triage.
+Built to the OpenEnv specification for standardized agent evaluation and benchmarking.
+- **Status:** In Development
+- **Domain:** Email Triage
+- **Deployment:** Hugging Face Spaces (Docker)
+---
+## Table of Contents
+- [What Is This?](#what-is-this)
+- [Who Is This For?](#who-is-this-for)
+- [Observation Space](#observation-space)
+- [Action Space](#action-space)
+- [Tasks](#tasks)
+- [Reward Function](#reward-function)
+- [Quick Start](#quick-start)
+- [Running Inference](#running-inference)
+- [Inference Architecture](#inference-architecture)
+- [Score Table](#score-table)
+- [Docker Deployment](#docker-deployment)
+- [Hugging Face Space](#hugging-face-space)
+- [Pre-Submission Validation](#pre-submission-validation)
+- [API Reference](#api-reference)
+- [Project Structure](#project-structure)
+- [Known Limitations](#known-limitations)
+- [Contributing](#contributing)
+- [License](#license)
+---
+## What Is This?
+This environment simulates a professional email inbox where an AI agent must:
+1. Read incoming emails with realistic metadata (sender, subject, body, thread history).
+2. Classify each email with the correct priority label.
+3. Route each email to the appropriate department or person.
+4. Summarize the email's key information.
+Think of it as OpenAI Gym for office work. Instead of balancing a pole, the agent triages an
+inbox. The environment provides structured observations, accepts structured actions, and
+returns graded rewards with partial credit.
+Every decision is scored by a deterministic programmatic grader: no LLM-as-judge,
+no randomness, fully reproducible.
+---
+## Who Is This For?
+| Audience | Use Case |
+|---|---|
+| AI Safety Researchers | Measure agent behavior on realistic tasks with known ground truth |
+| LLM Agent Developers | Benchmark models and prompting strategies on real-world work |
+| RL Researchers | Train agents with shaped rewards in a professional task environment |
+| Companies | Evaluate LLM agents before deploying them to handle real email |
+---
+## Observation Space
+What the agent sees at each step:
+| Field | Type | Description |
+|---|---|---|
+| `email_id` | `str` | Unique identifier for this email |
+| `subject` | `str` | Email subject line |
+| `body` | `str` | Full email body text |
+| `sender` | `str` | Sender's email address |
+| `timestamp` | `str` | ISO 8601 timestamp of when the email was received |
+| `thread_history` | `list[str]` | Previous messages in the email thread (may be empty) |
+| `task_id` | `str` | Which task is currently active |
+| `step_number` | `int` | Current step in the episode (0-indexed) |
+| `total_emails` | `int` | Total number of emails to process in this task |
+The observation never contains the correct answer. The agent must reason from email content.
+---
+## Action Space
+What the agent must output at each step:
+| Field | Type | Allowed Values | Description |
+|---|---|---|---|
+| `label` | `Literal` | `"urgent"`, `"normal"`, `"spam"`, `"archive"` | Priority classification |
+| `summary` | `str` | Free text | Brief summary of the email's content and intent |
+| `route_to` | `str` | Free text (`"billing"`, `"safety"`, `"engineering"`) | Department or person |
+### Example action JSON
+```json
+{
+  "label": "urgent",
+  "summary": "Customer reports a safety issue with product overheating",
+  "route_to": "safety"
+}
+```
+---
+## Tasks
+Each task now contains multiple deterministic scenario variants. By default, `/reset`
+cycles through the public scenario pool for the selected task.
+Private evaluation split selection is controlled server-side via environment
+configuration (`OPENENV_EVAL_SPLIT`), and client-side override can be disabled
+to preserve benchmark integrity.
+To keep private evaluation data out of source control, supply hidden scenarios at
+runtime using `OPENENV_PRIVATE_SCENARIOS_JSON` (JSON object keyed by task id).
+Example deployment configuration:
+```bash
+export OPENENV_EVAL_SPLIT="private_eval"
+export OPENENV_ALLOW_CLIENT_EVAL_OVERRIDE="false"
+export OPENENV_PRIVATE_SCENARIOS_JSON='{"task_easy":[{"scenario_id":"easy-private-001","emails":[{"email_id":"easy-p-001","subject":"Private billing exception","body":"Please correct invoice mismatch for contract addendum B-7 before end of day.","sender":"contracts@partner.example","timestamp":"2026-04-03T09:00:00Z","thread_history":["Customer requested corrected invoice reference."]}],"ground_truth":[{"label":"normal","route_to":"billing","priority_weight":1.0,"summary_keywords":["invoice mismatch","contract addendum","correct"]}]}],"task_medium":[],"task_hard":[]}'
+```
+Notes:
+- Keep this value in deployment secrets or runtime environment config.
+- Use valid JSON with double quotes only.
+- You can provide multiple scenarios per task by adding more objects to each task list.
+### Task 1 — Easy (`task_easy`)
+Objective: Correctly classify a single unambiguous email.
+Scoring:
+- Correct label: 1.0
+- Wrong label but correct routing: 0.3
+- Everything wrong: 0.0
+### Task 2 — Medium (`task_medium`)
+Objective: Triage a queue of 5 emails with mixed priority signals.
+Scoring:
+- Each email scored individually
+- Score = (correct labels / total emails) * priority weight factor
+- Higher-priority misclassifications are penalized more heavily
+- Final score = weighted mean of all individual scores
+### Task 3 — Hard (`task_hard`)
+Objective: Handle a complex complaint that crosses multiple categories.
+Scoring:
+- Escalated to safety: 0.4 weight
+- Correct routing: 0.3 weight
+- Marked as urgent: 0.3 weight
+- Penalty: -0.2 if marked as spam
+- Final score = weighted sum of sub-scores (clipped to 0.0 minimum)
+### Task 4 — Production (`task_production`)
+Objective: Simulate a production inbox with mixed operational load across safety,
+engineering, billing, support, spam, and low-priority traffic.
+Scoring:
+- Per-email weighted scoring by business priority
+- Route-noise penalty when actions route to too many teams
+- Summary quality based on contextual evidence keywords and anti-stuffing rules
+- Deterministic escalation follow-ups are inserted when critical triage is missed
+- Runtime controls available via `/reset` payload for production simulations:
+  - `production_profile`: `light` | `standard` | `heavy`
+  - `business_hours_mode`: `true` | `false`
+  - `escalation_mode`: `low` | `normal` | `high`
+---
+## Reward Function
+The reward function provides dense training signal at every step, not just binary pass/fail.
+### Formula
+```text
+final_reward = base_score + progress_signal + trajectory_bonus - penalties - step_cost
+```
+### Components
+| Component | Value | Condition |
+|---|---|---|
+| Base score | 0.0-1.0 | Raw grader score for the current step |
+| Progress signal | ~0.00 to ~0.13 | Partial credit for advancing queue, quality, and positive trend |
+| Step cost | ~-0.005 to ~-0.015 | Gentle efficiency pressure over longer episodes |
+| Trajectory bonus | +0.2 | If all tasks completed with mean score > 0.8 |
+| Archive quality penalty | -0.5 | Archive action with an underspecified summary |
+| Loop detection penalty | -0.3 | Same action repeated 3+ times consecutively |
+The final reward is clipped to [-1.0, 1.0] before being returned.
+---
+## Quick Start
+### Prerequisites
+- Python 3.11+
+- API endpoint, model name, and token for inference
+### Installation
+```bash
+pip install -r requirements.txt
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="gpt-4o"
+export HF_TOKEN="your-token-here"
+```
+### Run the environment locally
+```bash
+python server.py
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"label": "urgent", "summary": "Test", "route_to": "billing"}'
+curl -X POST http://localhost:7860/state
+```
+---
+## Running Inference
+```bash
+python inference.py --task all
+python inference.py --task 1
+python inference.py --task 4 --production-profile heavy --business-hours-mode --escalation-mode high
+```
+The script reads API settings from environment variables and uses fallback actions when
+model output is unparseable, so episodes still complete.
+---
+## Inference Architecture
+The inference script (inference.py) follows this loop:
+```text
+1. Initialize OpenAI client + environment
+2. reset() to get first observation
+3. Loop until done or MAX_STEPS:
+  - Build prompt from observation + history
+  - Call LLM with OpenAI client (catch request errors)
+  - Parse response into action (fallback on parse failure)
+  - env.step(action)
+  - Record reward and history
+4. Print score table
+```
+### Environment Variables Required
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="gpt-4o"
+export HF_TOKEN="your-token-here"
+export INFERENCE_RUNTIME_BUDGET_SECONDS="1140"
+export INFERENCE_REQUEST_TIMEOUT_SECONDS="12"
+```
+Runtime controls:
+- `INFERENCE_RUNTIME_BUDGET_SECONDS` limits full-script wall-clock runtime (default 1140s, under 20 minutes).
+- `INFERENCE_REQUEST_TIMEOUT_SECONDS` limits each LLM request timeout (default 12s).
+- Equivalent CLI flags: `--runtime-budget-seconds` and `--request-timeout-seconds`.
+Fallback behavior when parsing fails:
+```json
+{"label": "normal", "summary": "Unable to parse response", "route_to": "general"}
+```
+---
+## Score Table
+Placeholder until inference is run.
+| Model | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Mean |
+|---|---|---|---|---|
+| MODEL_NAME | TBD | TBD | TBD | TBD |
+Expected rough ranges:
+- GPT-4o: 0.8-1.0 on easy, 0.5-0.8 on medium, 0.4-0.7 on hard
+---
+## Docker Deployment
+```bash
+docker build -t email-triage-env .
+docker run -p 7860:7860 email-triage-env
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+```
+For Apple Silicon:
+```bash
+docker build --platform linux/amd64 -t email-triage-env .
+```
+---
+## Hugging Face Space
+Live URL placeholder:
+`https://huggingface.co/spaces/YOUR_USERNAME/email-triage-env`
+The Space homepage (`/`) now serves a lightweight interactive triage console for
+manual testing. Machine-readable service metadata is available at `GET /meta`.
+Example interaction:
+```bash
+export SPACE_URL="https://YOUR_USERNAME-email-triage-env.hf.space"
+curl -X POST "$SPACE_URL/reset" \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_easy"}'
+```
+---
+## Pre-Submission Validation
+Run the validator before submitting your environment.
+```bash
+chmod +x validate-submission.sh
+./validate-submission.sh https://YOUR_USERNAME-email-triage-env.hf.space .
+```
+The script checks:
+- HF Space `/reset` health (HTTP 200 expected)
+- Docker build success
+- `openenv validate` pass status
+---
+## API Reference
+### POST /reset
+Request:
+```json
+{"task_id": "task_easy"}
+```
+Response:
+```json
+{
+  "observation": {
+    "email_id": "easy-001",
+    "subject": "Quarterly invoice available",
+    "body": "...",
+    "sender": "accounts@vendor-example.com",
+    "timestamp": "2026-03-25T09:15:00Z",
+    "thread_history": ["..."],
+    "task_id": "task_easy",
+    "step_number": 0,
+    "total_emails": 1
+  },
+  "info": {"task_id": "task_easy", "step": 0}
+}
+```
+### POST /step
+Request:
+```json
+{
+  "label": "urgent",
+  "summary": "Customer needs immediate help",
+  "route_to": "support"
+}
+```
+Response:
+```json
+{
+  "observation": {},
+  "reward": 0.85,
+  "done": false,
+  "info": {"step": 1, "task_id": "task_easy"}
+}
+```
+### POST /state
+No request body required.
+Response: `EnvironmentState` JSON object.
+---
+## Project Structure
+```text
+.
+├── models.py
+├── tasks.py
+├── graders.py
+├── environment.py
+├── server.py
+├── server/
+│   └── app.py
+├── inference.py
+├── openenv.yaml
+├── Dockerfile
+├── requirements.txt
+├── pyproject.toml
+├── uv.lock
+├── validate-submission.sh
+├── README.md
+└── RULES.md
+```
+---
+## Known Limitations
+| Limitation | Impact |
+|---|---|
+| Static scenario pools | No live inbox ingestion from production systems |
+| Single-agent server instance | Concurrent agents can conflict |
+| No live thread simulation | Thread history is static |
+| English-only content | No multilingual coverage |
+| No attachments | Text-only triage |
+| Simplified routing | No org chart or availability modeling |
+| Limited temporal dynamics | Production task can generate deterministic escalations, but not full live message streams |
+| Rule-based grading edges | Equivalent decisions may score differently from humans |
+What an agent cannot exploit:
+- The correct answer is never present in observations
+- The grader is a pure function and cannot be manipulated
+- Step penalty cannot be bypassed except by efficient actions
+---
+## Summary of Revision 2 Changes
+| What Changed | Before | After | Why |
+|---|---|---|---|
+| Return type of step() | tuple | StepResult object | Match sample result.observation pattern |
+| Return type of reset() | EmailObservation | ResetResult object | Match sample result.observation pattern |
+| New models | 4 models | 6 models (+StepResult, +ResetResult) | Match sample interface |
+| API key reading | OPENAI_API_KEY style | HF_TOKEN or API_KEY via os.getenv | Match sample fallback pattern |
+| Temperature guidance | 0 | 0.2 | Match sample behavior |
+| Response parsing | JSON-only assumption | Text parsing with fallback action | Robustness to non-JSON model output |
+| History tracking | Optional | Mandatory | Match sample architecture |
+| Step cap | Not explicit | MAX_STEPS constant | Runtime safety and reproducibility |
+---
+## Contributing
+Read `RULES.md` before contributing.
+Key constraints:
+- Type hints and Pydantic models required
+- No extra dependencies without explicit approval
+- No features beyond project brief
+- Graders must remain deterministic pure functions
+---
+## License
+MIT License.