Spaces:
Running
Running
Deploy Project Epsilon Space bundle
Browse files- .gitignore +1 -0
- README.md +87 -16
- docs/HF_SPACE_README.md +89 -20
- src/executive_assistant/deployment.py +96 -22
- tests/test_deployment.py +7 -5
.gitignore
CHANGED
|
@@ -5,4 +5,5 @@ artifacts/
|
|
| 5 |
__pycache__/
|
| 6 |
.env
|
| 7 |
.env.app
|
|
|
|
| 8 |
.env.training
|
|
|
|
| 5 |
__pycache__/
|
| 6 |
.env
|
| 7 |
.env.app
|
| 8 |
+
.env.hf.space
|
| 9 |
.env.training
|
README.md
CHANGED
|
@@ -11,44 +11,115 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
|
|
| 11 |
|
| 12 |
# Project Epsilon
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
## Team
|
| 17 |
|
| 18 |
- Team name: `Project Epsilon`
|
| 19 |
-
- Hugging Face
|
| 20 |
-
-
|
|
|
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
|
| 28 |
-
- Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
|
| 29 |
-
- Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
- SDK: `docker`
|
| 38 |
- App port: `7860`
|
| 39 |
- Entry point: `python app.py`
|
| 40 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 41 |
- A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
## Judge Flow
|
| 44 |
|
| 45 |
1. Open the Space and choose one of the seeded scenarios.
|
| 46 |
2. Run the deterministic `baseline` policy for a guaranteed reference trace.
|
| 47 |
3. Switch to `rl` to replay the bundled learned checkpoint.
|
| 48 |
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
## References
|
| 51 |
|
| 52 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 53 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
| 54 |
-
- Space
|
|
|
|
|
|
| 11 |
|
| 12 |
# Project Epsilon
|
| 13 |
|
| 14 |
+
Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Project Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
|
| 15 |
|
| 16 |
+
## Team Epsilon Roster
|
| 17 |
|
| 18 |
- Team name: `Project Epsilon`
|
| 19 |
+
- Hugging Face Space: `Flickinshots/EmailMaestro`
|
| 20 |
+
- Live app: `https://Flickinshots-EmailMaestro.hf.space`
|
| 21 |
+
- Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
|
| 22 |
|
| 23 |
+
- `@Flickinshots` — Team lead and primary Space owner
|
| 24 |
+
- `@HF_USERNAME_2` — Team member
|
| 25 |
+
- `@HF_USERNAME_3` — Team member
|
| 26 |
|
| 27 |
+
## Executive Summary
|
| 28 |
|
| 29 |
+
EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
This Space is intended to show three things clearly:
|
| 32 |
|
| 33 |
+
- The agent can operate as a structured tool user, not just a text generator.
|
| 34 |
+
- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
|
| 35 |
+
- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
|
| 36 |
|
| 37 |
+
## Why This Fits The Hackathon
|
| 38 |
+
|
| 39 |
+
OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
|
| 40 |
+
|
| 41 |
+
- Deterministic environment setup
|
| 42 |
+
- Typed environment contracts
|
| 43 |
+
- Observable step-by-step policy execution
|
| 44 |
+
- Reproducible seeded tasks
|
| 45 |
+
- Judge-friendly visualization of state transitions
|
| 46 |
+
|
| 47 |
+
## Problem Framing
|
| 48 |
+
|
| 49 |
+
Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
|
| 50 |
+
|
| 51 |
+
- read a chaotic inbox
|
| 52 |
+
- extract structured work into a task list
|
| 53 |
+
- triage low-priority versus high-priority communication
|
| 54 |
+
- search a local knowledge source before replying
|
| 55 |
+
- produce actions that can be graded deterministically
|
| 56 |
+
|
| 57 |
+
To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
|
| 58 |
+
|
| 59 |
+
## Core Architecture
|
| 60 |
+
|
| 61 |
+
- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
|
| 62 |
+
- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
|
| 63 |
+
- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
|
| 64 |
+
- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
|
| 65 |
+
- **UI layer:** Gradio control room plus visible workspace snapshots for judges
|
| 66 |
+
|
| 67 |
+
## Seeded Judge Tasks
|
| 68 |
+
|
| 69 |
+
### 1. Easy: Deadline Extraction
|
| 70 |
+
|
| 71 |
+
The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
|
| 72 |
+
|
| 73 |
+
### 2. Medium: Inbox Triage And Negotiation
|
| 74 |
+
|
| 75 |
+
The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
|
| 76 |
+
|
| 77 |
+
### 3. Hard: RAG Reply
|
| 78 |
+
|
| 79 |
+
The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
|
| 80 |
+
|
| 81 |
+
## What Judges Can Inspect In This Space
|
| 82 |
+
|
| 83 |
+
- Live observation payloads
|
| 84 |
+
- Workspace tables for emails, todos, files, and action logs
|
| 85 |
+
- Step-by-step trace rows with reasoning, action type, status, score, and done state
|
| 86 |
+
- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
|
| 87 |
+
|
| 88 |
+
## Runtime And Deployment Notes
|
| 89 |
|
| 90 |
- SDK: `docker`
|
| 91 |
- App port: `7860`
|
| 92 |
- Entry point: `python app.py`
|
| 93 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 94 |
- A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
|
| 95 |
+
- The bundled RL artifact lives at `artifacts/checkpoints/q_policy_notebook.json`
|
| 96 |
+
- The Space is deployed from the same repository used for local tests and notebook-backed experiments
|
| 97 |
|
| 98 |
+
## Recommended Judge Flow
|
| 99 |
|
| 100 |
1. Open the Space and choose one of the seeded scenarios.
|
| 101 |
2. Run the deterministic `baseline` policy for a guaranteed reference trace.
|
| 102 |
3. Switch to `rl` to replay the bundled learned checkpoint.
|
| 103 |
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
|
| 104 |
+
5. Compare how the workspace mutates after each step instead of evaluating only the final response.
|
| 105 |
+
|
| 106 |
+
## Implementation Notes
|
| 107 |
+
|
| 108 |
+
- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
|
| 109 |
+
- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
|
| 110 |
+
- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
|
| 111 |
+
- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
|
| 112 |
+
|
| 113 |
+
## What We Want Judges To Notice
|
| 114 |
+
|
| 115 |
+
- Strong separation between environment state, policy choice, and reward logic
|
| 116 |
+
- Clear evidence of agent tool use
|
| 117 |
+
- Reproducibility across runs
|
| 118 |
+
- A hackathon-friendly deployment that still preserves engineering discipline
|
| 119 |
|
| 120 |
+
## References And Context
|
| 121 |
|
| 122 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 123 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
| 124 |
+
- Space page: https://huggingface.co/spaces/Flickinshots/EmailMaestro
|
| 125 |
+
- Live app: https://Flickinshots-EmailMaestro.hf.space
|
docs/HF_SPACE_README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: "🧭"
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: gray
|
|
@@ -11,43 +11,112 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
|
|
| 11 |
|
| 12 |
# Project Epsilon
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
## Team
|
| 17 |
|
| 18 |
-
- Team name: `
|
| 19 |
-
- Hugging Face
|
| 20 |
-
-
|
|
|
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- A Gradio judge console that replays the shared `EpisodeRunner` loop step by step
|
| 28 |
-
- Policy switching across `baseline`, bundled `rl`, and optional `openrouter`
|
| 29 |
-
- Visible inbox, todo, file-search, and action-log state transitions
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
- SDK: `docker`
|
| 38 |
- App port: `7860`
|
| 39 |
- Entry point: `python app.py`
|
| 40 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 41 |
- Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
|
|
|
|
| 42 |
|
| 43 |
-
## Judge Flow
|
| 44 |
|
| 45 |
1. Open the Space and choose one of the seeded scenarios.
|
| 46 |
-
2. Run `baseline`
|
| 47 |
-
3. Switch to `rl` to replay the
|
| 48 |
-
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
## References
|
| 51 |
|
| 52 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 53 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
|
|
|
| 1 |
---
|
| 2 |
+
title: EmailMaestro | Executive Assistant Sandbox
|
| 3 |
emoji: "🧭"
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: gray
|
|
|
|
| 11 |
|
| 12 |
# Project Epsilon
|
| 13 |
|
| 14 |
+
Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Team Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
|
| 15 |
|
| 16 |
+
## Team Epsilon Roster
|
| 17 |
|
| 18 |
+
- Team name: `Team Epsilon`
|
| 19 |
+
- Hugging Face Space: `Flickinshots/EmailMaestro`
|
| 20 |
+
- Live app: `https://Flickinshots-EmailMaestro.hf.space`
|
| 21 |
+
- Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
|
| 22 |
|
| 23 |
+
- `@flickinshots` — Team lead and primary Space owner
|
| 24 |
+
- `@ShreyaKhatik` — Team member
|
| 25 |
+
- `@itsayushdey` — Team member
|
| 26 |
|
| 27 |
+
## Executive Summary
|
| 28 |
|
| 29 |
+
EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
This Space is intended to show three things clearly:
|
| 32 |
|
| 33 |
+
- The agent can operate as a structured tool user, not just a text generator.
|
| 34 |
+
- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
|
| 35 |
+
- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
|
| 36 |
|
| 37 |
+
## Why This Fits The Hackathon
|
| 38 |
+
|
| 39 |
+
OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
|
| 40 |
+
|
| 41 |
+
- Deterministic environment setup
|
| 42 |
+
- Typed environment contracts
|
| 43 |
+
- Observable step-by-step policy execution
|
| 44 |
+
- Reproducible seeded tasks
|
| 45 |
+
- Judge-friendly visualization of state transitions
|
| 46 |
+
|
| 47 |
+
## Problem Framing
|
| 48 |
+
|
| 49 |
+
Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
|
| 50 |
+
|
| 51 |
+
- read a chaotic inbox
|
| 52 |
+
- extract structured work into a task list
|
| 53 |
+
- triage low-priority versus high-priority communication
|
| 54 |
+
- search a local knowledge source before replying
|
| 55 |
+
- produce actions that can be graded deterministically
|
| 56 |
+
|
| 57 |
+
To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
|
| 58 |
+
|
| 59 |
+
## Core Architecture
|
| 60 |
+
|
| 61 |
+
- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
|
| 62 |
+
- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
|
| 63 |
+
- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
|
| 64 |
+
- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
|
| 65 |
+
- **UI layer:** Gradio control room plus visible workspace snapshots for judges
|
| 66 |
+
|
| 67 |
+
## Seeded Judge Tasks
|
| 68 |
+
|
| 69 |
+
### 1. Easy: Deadline Extraction
|
| 70 |
+
|
| 71 |
+
The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
|
| 72 |
+
|
| 73 |
+
### 2. Medium: Inbox Triage And Negotiation
|
| 74 |
+
|
| 75 |
+
The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
|
| 76 |
+
|
| 77 |
+
### 3. Hard: RAG Reply
|
| 78 |
+
|
| 79 |
+
The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
|
| 80 |
+
|
| 81 |
+
## What Judges Can Inspect In This Space
|
| 82 |
+
|
| 83 |
+
- Live observation payloads
|
| 84 |
+
- Workspace tables for emails, todos, files, and action logs
|
| 85 |
+
- Step-by-step trace rows with reasoning, action type, status, score, and done state
|
| 86 |
+
- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
|
| 87 |
+
|
| 88 |
+
## Runtime And Deployment Notes
|
| 89 |
|
| 90 |
- SDK: `docker`
|
| 91 |
- App port: `7860`
|
| 92 |
- Entry point: `python app.py`
|
| 93 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 94 |
- Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
|
| 95 |
+
- The Space is deployed from the same repository used for local tests and notebook-backed experiments
|
| 96 |
|
| 97 |
+
## Recommended Judge Flow
|
| 98 |
|
| 99 |
1. Open the Space and choose one of the seeded scenarios.
|
| 100 |
+
2. Run the deterministic `baseline` policy for a guaranteed reference trace.
|
| 101 |
+
3. Switch to `rl` to replay the bundled learned checkpoint.
|
| 102 |
+
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
|
| 103 |
+
5. Compare how the workspace mutates after each step instead of evaluating only the final response.
|
| 104 |
+
|
| 105 |
+
## Implementation Notes
|
| 106 |
+
|
| 107 |
+
- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
|
| 108 |
+
- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
|
| 109 |
+
- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
|
| 110 |
+
- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
|
| 111 |
+
|
| 112 |
+
## What We Want Judges To Notice
|
| 113 |
+
|
| 114 |
+
- Strong separation between environment state, policy choice, and reward logic
|
| 115 |
+
- Clear evidence of agent tool use
|
| 116 |
+
- Reproducibility across runs
|
| 117 |
+
- A hackathon-friendly deployment that still preserves engineering discipline
|
| 118 |
|
| 119 |
+
## References And Context
|
| 120 |
|
| 121 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 122 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
src/executive_assistant/deployment.py
CHANGED
|
@@ -9,11 +9,11 @@ from src.executive_assistant.training import default_checkpoint_path, train_q_le
|
|
| 9 |
|
| 10 |
|
| 11 |
REPO_ROOT = Path(__file__).resolve().parents[2]
|
| 12 |
-
DEFAULT_SPACE_TITLE = "
|
| 13 |
DEFAULT_HF_USERNAMES = [
|
| 14 |
-
"
|
| 15 |
-
"
|
| 16 |
-
"
|
| 17 |
]
|
| 18 |
DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
|
| 19 |
DEFAULT_STAGE_IGNORE_NAMES = {
|
|
@@ -41,7 +41,7 @@ DEFAULT_STAGE_IGNORE_FILES = {
|
|
| 41 |
class HFSpaceDeployConfig:
|
| 42 |
repo_id: str
|
| 43 |
title: str = DEFAULT_SPACE_TITLE
|
| 44 |
-
team_name: str = "
|
| 45 |
hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
|
| 46 |
checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
|
| 47 |
app_port: int = 7860
|
|
@@ -77,7 +77,12 @@ def parse_hf_usernames(raw_value: str | None) -> tuple[str, ...]:
|
|
| 77 |
|
| 78 |
|
| 79 |
def render_space_readme(config: HFSpaceDeployConfig) -> str:
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
checkpoint_note = (
|
| 82 |
"A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
|
| 83 |
"is available immediately in the demo."
|
|
@@ -98,47 +103,116 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
|
|
| 98 |
|
| 99 |
# {config.team_name}
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
## Team
|
| 104 |
|
| 105 |
- Team name: `{config.team_name}`
|
| 106 |
-
- Hugging Face
|
| 107 |
-
-
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
##
|
| 112 |
|
| 113 |
-
|
| 114 |
-
- A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
|
| 115 |
-
- Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
|
| 116 |
-
- Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
- SDK: `docker`
|
| 125 |
- App port: `{config.app_port}`
|
| 126 |
- Entry point: `python app.py`
|
| 127 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 128 |
- {checkpoint_note}
|
|
|
|
|
|
|
| 129 |
|
| 130 |
-
## Judge Flow
|
| 131 |
|
| 132 |
1. Open the Space and choose one of the seeded scenarios.
|
| 133 |
2. Run the deterministic `baseline` policy for a guaranteed reference trace.
|
| 134 |
3. Switch to `rl` to replay the bundled learned checkpoint.
|
| 135 |
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
## References
|
| 138 |
|
| 139 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 140 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
| 141 |
-
- Space
|
|
|
|
| 142 |
"""
|
| 143 |
|
| 144 |
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
REPO_ROOT = Path(__file__).resolve().parents[2]
|
| 12 |
+
DEFAULT_SPACE_TITLE = "EmailMaestro | Executive Assistant Sandbox"
|
| 13 |
DEFAULT_HF_USERNAMES = [
|
| 14 |
+
"flickinshots",
|
| 15 |
+
"ShreyaKhatik",
|
| 16 |
+
"itsayushdey",
|
| 17 |
]
|
| 18 |
DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
|
| 19 |
DEFAULT_STAGE_IGNORE_NAMES = {
|
|
|
|
| 41 |
class HFSpaceDeployConfig:
|
| 42 |
repo_id: str
|
| 43 |
title: str = DEFAULT_SPACE_TITLE
|
| 44 |
+
team_name: str = "Team Epsilon"
|
| 45 |
hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
|
| 46 |
checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
|
| 47 |
app_port: int = 7860
|
|
|
|
| 77 |
|
| 78 |
|
| 79 |
def render_space_readme(config: HFSpaceDeployConfig) -> str:
|
| 80 |
+
roster_lines: list[str] = []
|
| 81 |
+
if config.hf_usernames:
|
| 82 |
+
roster_lines.append(f"- `@{config.hf_usernames[0]}` — Team lead and primary Space owner")
|
| 83 |
+
for username in config.hf_usernames[1:]:
|
| 84 |
+
roster_lines.append(f"- `@{username}` — Team member")
|
| 85 |
+
roster = "\n".join(roster_lines) if roster_lines else "- Team roster to be added"
|
| 86 |
checkpoint_note = (
|
| 87 |
"A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
|
| 88 |
"is available immediately in the demo."
|
|
|
|
| 103 |
|
| 104 |
# {config.team_name}
|
| 105 |
|
| 106 |
+
Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **{config.team_name}** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
|
| 107 |
|
| 108 |
+
## Team Epsilon Roster
|
| 109 |
|
| 110 |
- Team name: `{config.team_name}`
|
| 111 |
+
- Hugging Face Space: `{config.repo_id}`
|
| 112 |
+
- Live app: `{config.app_url}`
|
| 113 |
+
- Public repository view on Hugging Face: `{config.space_url}`
|
| 114 |
|
| 115 |
+
{roster}
|
| 116 |
|
| 117 |
+
## Executive Summary
|
| 118 |
|
| 119 |
+
EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
This Space is intended to show three things clearly:
|
| 122 |
|
| 123 |
+
- The agent can operate as a structured tool user, not just a text generator.
|
| 124 |
+
- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
|
| 125 |
+
- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
|
| 126 |
|
| 127 |
+
## Why This Fits The Hackathon
|
| 128 |
+
|
| 129 |
+
OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
|
| 130 |
+
|
| 131 |
+
- Deterministic environment setup
|
| 132 |
+
- Typed environment contracts
|
| 133 |
+
- Observable step-by-step policy execution
|
| 134 |
+
- Reproducible seeded tasks
|
| 135 |
+
- Judge-friendly visualization of state transitions
|
| 136 |
+
|
| 137 |
+
## Problem Framing
|
| 138 |
+
|
| 139 |
+
Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
|
| 140 |
+
|
| 141 |
+
- read a chaotic inbox
|
| 142 |
+
- extract structured work into a task list
|
| 143 |
+
- triage low-priority versus high-priority communication
|
| 144 |
+
- search a local knowledge source before replying
|
| 145 |
+
- produce actions that can be graded deterministically
|
| 146 |
+
|
| 147 |
+
To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
|
| 148 |
+
|
| 149 |
+
## Core Architecture
|
| 150 |
+
|
| 151 |
+
- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
|
| 152 |
+
- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
|
| 153 |
+
- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
|
| 154 |
+
- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
|
| 155 |
+
- **UI layer:** Gradio control room plus visible workspace snapshots for judges
|
| 156 |
+
|
| 157 |
+
## Seeded Judge Tasks
|
| 158 |
+
|
| 159 |
+
### 1. Easy: Deadline Extraction
|
| 160 |
+
|
| 161 |
+
The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
|
| 162 |
+
|
| 163 |
+
### 2. Medium: Inbox Triage And Negotiation
|
| 164 |
+
|
| 165 |
+
The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
|
| 166 |
+
|
| 167 |
+
### 3. Hard: RAG Reply
|
| 168 |
+
|
| 169 |
+
The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
|
| 170 |
+
|
| 171 |
+
## What Judges Can Inspect In This Space
|
| 172 |
+
|
| 173 |
+
- Live observation payloads
|
| 174 |
+
- Workspace tables for emails, todos, files, and action logs
|
| 175 |
+
- Step-by-step trace rows with reasoning, action type, status, score, and done state
|
| 176 |
+
- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
|
| 177 |
+
|
| 178 |
+
## Runtime And Deployment Notes
|
| 179 |
|
| 180 |
- SDK: `docker`
|
| 181 |
- App port: `{config.app_port}`
|
| 182 |
- Entry point: `python app.py`
|
| 183 |
- Optional secret: `OPENROUTER_API_KEY`
|
| 184 |
- {checkpoint_note}
|
| 185 |
+
- The bundled RL artifact lives at `artifacts/checkpoints/{config.checkpoint_name}`
|
| 186 |
+
- The Space is deployed from the same repository used for local tests and notebook-backed experiments
|
| 187 |
|
| 188 |
+
## Recommended Judge Flow
|
| 189 |
|
| 190 |
1. Open the Space and choose one of the seeded scenarios.
|
| 191 |
2. Run the deterministic `baseline` policy for a guaranteed reference trace.
|
| 192 |
3. Switch to `rl` to replay the bundled learned checkpoint.
|
| 193 |
4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
|
| 194 |
+
5. Compare how the workspace mutates after each step instead of evaluating only the final response.
|
| 195 |
+
|
| 196 |
+
## Implementation Notes
|
| 197 |
+
|
| 198 |
+
- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
|
| 199 |
+
- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
|
| 200 |
+
- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
|
| 201 |
+
- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
|
| 202 |
+
|
| 203 |
+
## What We Want Judges To Notice
|
| 204 |
+
|
| 205 |
+
- Strong separation between environment state, policy choice, and reward logic
|
| 206 |
+
- Clear evidence of agent tool use
|
| 207 |
+
- Reproducibility across runs
|
| 208 |
+
- A hackathon-friendly deployment that still preserves engineering discipline
|
| 209 |
|
| 210 |
+
## References And Context
|
| 211 |
|
| 212 |
- Hack dashboard: https://www.scaler.com/openenv-hackathon
|
| 213 |
- OpenEnv launch: https://huggingface.co/blog/openenv
|
| 214 |
+
- Space page: {config.space_url}
|
| 215 |
+
- Live app: {config.app_url}
|
| 216 |
"""
|
| 217 |
|
| 218 |
|
tests/test_deployment.py
CHANGED
|
@@ -13,14 +13,16 @@ def test_parse_hf_usernames_strips_at_signs() -> None:
|
|
| 13 |
assert usernames == ("alice", "bob", "carol")
|
| 14 |
|
| 15 |
|
| 16 |
-
def
|
| 17 |
config = HFSpaceDeployConfig(
|
| 18 |
-
repo_id="
|
| 19 |
-
|
|
|
|
| 20 |
)
|
| 21 |
rendered = render_space_readme(config)
|
| 22 |
-
assert "
|
| 23 |
-
assert "@
|
|
|
|
| 24 |
assert "sdk: docker" in rendered
|
| 25 |
assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered
|
| 26 |
|
|
|
|
| 13 |
assert usernames == ("alice", "bob", "carol")
|
| 14 |
|
| 15 |
|
| 16 |
+
def test_render_space_readme_includes_team_epsilon_roster() -> None:
|
| 17 |
config = HFSpaceDeployConfig(
|
| 18 |
+
repo_id="Flickinshots/EmailMaestro",
|
| 19 |
+
team_name="Team Epsilon",
|
| 20 |
+
hf_usernames=("flickinshots", "ShreyaKhatik", "itsayushdey"),
|
| 21 |
)
|
| 22 |
rendered = render_space_readme(config)
|
| 23 |
+
assert "Team Epsilon" in rendered
|
| 24 |
+
assert "@flickinshots" in rendered
|
| 25 |
+
assert "Team lead and primary Space owner" in rendered
|
| 26 |
assert "sdk: docker" in rendered
|
| 27 |
assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered
|
| 28 |
|