Flickinshots commited on
Commit
696d083
·
verified ·
1 Parent(s): f816f0e

Deploy Project Epsilon Space bundle

Browse files
.gitignore CHANGED
@@ -5,4 +5,5 @@ artifacts/
5
  __pycache__/
6
  .env
7
  .env.app
 
8
  .env.training
 
5
  __pycache__/
6
  .env
7
  .env.app
8
+ .env.hf.space
9
  .env.training
README.md CHANGED
@@ -11,44 +11,115 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
11
 
12
  # Project Epsilon
13
 
14
- Discrete Hugging Face Space for the **Autonomous Executive Assistant Sandbox**, built for the **OpenEnv Scaler x Meta x PyTorch Hack**.
15
 
16
- ## Team
17
 
18
  - Team name: `Project Epsilon`
19
- - Hugging Face usernames: `@Flickinshots`, `@HF_USERNAME_2`, `@HF_USERNAME_3`
20
- - Space repo: `Flickinshots/EmailMaestro`
 
21
 
22
- Replace the placeholder usernames above once the final team accounts are ready.
 
 
23
 
24
- ## What This Space Shows
25
 
26
- - A deterministic OpenEnv-style executive assistant environment backed by an isolated SQLite workspace
27
- - A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
28
- - Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
29
- - Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
30
 
31
- ## Hack Context
32
 
33
- OpenEnv was announced by Hugging Face and Meta as an open source framework for building agent environments with typed observations, actions, and rewards. The Scaler dashboard for this hack lists the submission round as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space packages our environment to match that workflow: deterministic tasks, structured actions, visible state transitions, and reproducible judge demos.
 
 
34
 
35
- ## Runtime Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  - SDK: `docker`
38
  - App port: `7860`
39
  - Entry point: `python app.py`
40
  - Optional secret: `OPENROUTER_API_KEY`
41
  - A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
 
 
42
 
43
- ## Judge Flow
44
 
45
  1. Open the Space and choose one of the seeded scenarios.
46
  2. Run the deterministic `baseline` policy for a guaranteed reference trace.
47
  3. Switch to `rl` to replay the bundled learned checkpoint.
48
  4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- ## References
51
 
52
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
53
  - OpenEnv launch: https://huggingface.co/blog/openenv
54
- - Space URL: https://huggingface.co/spaces/Flickinshots/EmailMaestro
 
 
11
 
12
  # Project Epsilon
13
 
14
+ Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Project Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
15
 
16
+ ## Team Epsilon Roster
17
 
18
  - Team name: `Project Epsilon`
19
+ - Hugging Face Space: `Flickinshots/EmailMaestro`
20
+ - Live app: `https://Flickinshots-EmailMaestro.hf.space`
21
+ - Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
22
 
23
+ - `@Flickinshots` Team lead and primary Space owner
24
+ - `@HF_USERNAME_2` — Team member
25
+ - `@HF_USERNAME_3` — Team member
26
 
27
+ ## Executive Summary
28
 
29
+ EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
 
 
 
30
 
31
+ This Space is intended to show three things clearly:
32
 
33
+ - The agent can operate as a structured tool user, not just a text generator.
34
+ - The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
35
+ - Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
36
 
37
+ ## Why This Fits The Hackathon
38
+
39
+ OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
40
+
41
+ - Deterministic environment setup
42
+ - Typed environment contracts
43
+ - Observable step-by-step policy execution
44
+ - Reproducible seeded tasks
45
+ - Judge-friendly visualization of state transitions
46
+
47
+ ## Problem Framing
48
+
49
+ Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
50
+
51
+ - read a chaotic inbox
52
+ - extract structured work into a task list
53
+ - triage low-priority versus high-priority communication
54
+ - search a local knowledge source before replying
55
+ - produce actions that can be graded deterministically
56
+
57
+ To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
58
+
59
+ ## Core Architecture
60
+
61
+ - **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
62
+ - **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
63
+ - **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
64
+ - **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
65
+ - **UI layer:** Gradio control room plus visible workspace snapshots for judges
66
+
67
+ ## Seeded Judge Tasks
68
+
69
+ ### 1. Easy: Deadline Extraction
70
+
71
+ The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
72
+
73
+ ### 2. Medium: Inbox Triage And Negotiation
74
+
75
+ The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
76
+
77
+ ### 3. Hard: RAG Reply
78
+
79
+ The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
80
+
81
+ ## What Judges Can Inspect In This Space
82
+
83
+ - Live observation payloads
84
+ - Workspace tables for emails, todos, files, and action logs
85
+ - Step-by-step trace rows with reasoning, action type, status, score, and done state
86
+ - Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
87
+
88
+ ## Runtime And Deployment Notes
89
 
90
  - SDK: `docker`
91
  - App port: `7860`
92
  - Entry point: `python app.py`
93
  - Optional secret: `OPENROUTER_API_KEY`
94
  - A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
95
+ - The bundled RL artifact lives at `artifacts/checkpoints/q_policy_notebook.json`
96
+ - The Space is deployed from the same repository used for local tests and notebook-backed experiments
97
 
98
+ ## Recommended Judge Flow
99
 
100
  1. Open the Space and choose one of the seeded scenarios.
101
  2. Run the deterministic `baseline` policy for a guaranteed reference trace.
102
  3. Switch to `rl` to replay the bundled learned checkpoint.
103
  4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
104
+ 5. Compare how the workspace mutates after each step instead of evaluating only the final response.
105
+
106
+ ## Implementation Notes
107
+
108
+ - The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
109
+ - Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
110
+ - The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
111
+ - The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
112
+
113
+ ## What We Want Judges To Notice
114
+
115
+ - Strong separation between environment state, policy choice, and reward logic
116
+ - Clear evidence of agent tool use
117
+ - Reproducibility across runs
118
+ - A hackathon-friendly deployment that still preserves engineering discipline
119
 
120
+ ## References And Context
121
 
122
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
123
  - OpenEnv launch: https://huggingface.co/blog/openenv
124
+ - Space page: https://huggingface.co/spaces/Flickinshots/EmailMaestro
125
+ - Live app: https://Flickinshots-EmailMaestro.hf.space
docs/HF_SPACE_README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Project Epsilon | Executive Assistant Sandbox
3
  emoji: "🧭"
4
  colorFrom: yellow
5
  colorTo: gray
@@ -11,43 +11,112 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
11
 
12
  # Project Epsilon
13
 
14
- Discrete Hugging Face Space README for the **Autonomous Executive Assistant Sandbox**, prepared for the **OpenEnv Scaler x Meta x PyTorch Hack**.
15
 
16
- ## Team
17
 
18
- - Team name: `Project Epsilon`
19
- - Hugging Face usernames: `@HF_USERNAME_1`, `@HF_USERNAME_2`, `@HF_USERNAME_3`
20
- - Space repo: `HF_USERNAME_PLACEHOLDER/project-epsilon-executive-assistant`
 
21
 
22
- Replace the placeholder usernames and repo owner when the final team accounts are ready.
 
 
23
 
24
- ## What This Space Shows
25
 
26
- - Deterministic OpenEnv-style tasks over a SQLite-backed executive assistant workspace
27
- - A Gradio judge console that replays the shared `EpisodeRunner` loop step by step
28
- - Policy switching across `baseline`, bundled `rl`, and optional `openrouter`
29
- - Visible inbox, todo, file-search, and action-log state transitions
30
 
31
- ## Hack Context
32
 
33
- OpenEnv was introduced by Hugging Face and Meta as an open source framework for typed agent environments. The Scaler hack dashboard lists the build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space is tuned for that style of evaluation: deterministic tasks, structured actions, reproducible runs, and a judge-friendly visual trace.
 
 
34
 
35
- ## Runtime Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  - SDK: `docker`
38
  - App port: `7860`
39
  - Entry point: `python app.py`
40
  - Optional secret: `OPENROUTER_API_KEY`
41
  - Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
 
42
 
43
- ## Judge Flow
44
 
45
  1. Open the Space and choose one of the seeded scenarios.
46
- 2. Run `baseline` first for the reference trace.
47
- 3. Switch to `rl` to replay the trained checkpoint bundled with the Space.
48
- 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed policy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- ## References
51
 
52
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
53
  - OpenEnv launch: https://huggingface.co/blog/openenv
 
1
  ---
2
+ title: EmailMaestro | Executive Assistant Sandbox
3
  emoji: "🧭"
4
  colorFrom: yellow
5
  colorTo: gray
 
11
 
12
  # Project Epsilon
13
 
14
+ Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Team Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
15
 
16
+ ## Team Epsilon Roster
17
 
18
+ - Team name: `Team Epsilon`
19
+ - Hugging Face Space: `Flickinshots/EmailMaestro`
20
+ - Live app: `https://Flickinshots-EmailMaestro.hf.space`
21
+ - Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
22
 
23
+ - `@flickinshots` Team lead and primary Space owner
24
+ - `@ShreyaKhatik` — Team member
25
+ - `@itsayushdey` — Team member
26
 
27
+ ## Executive Summary
28
 
29
+ EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
 
 
 
30
 
31
+ This Space is intended to show three things clearly:
32
 
33
+ - The agent can operate as a structured tool user, not just a text generator.
34
+ - The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
35
+ - Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
36
 
37
+ ## Why This Fits The Hackathon
38
+
39
+ OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
40
+
41
+ - Deterministic environment setup
42
+ - Typed environment contracts
43
+ - Observable step-by-step policy execution
44
+ - Reproducible seeded tasks
45
+ - Judge-friendly visualization of state transitions
46
+
47
+ ## Problem Framing
48
+
49
+ Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
50
+
51
+ - read a chaotic inbox
52
+ - extract structured work into a task list
53
+ - triage low-priority versus high-priority communication
54
+ - search a local knowledge source before replying
55
+ - produce actions that can be graded deterministically
56
+
57
+ To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
58
+
59
+ ## Core Architecture
60
+
61
+ - **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
62
+ - **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
63
+ - **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
64
+ - **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
65
+ - **UI layer:** Gradio control room plus visible workspace snapshots for judges
66
+
67
+ ## Seeded Judge Tasks
68
+
69
+ ### 1. Easy: Deadline Extraction
70
+
71
+ The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
72
+
73
+ ### 2. Medium: Inbox Triage And Negotiation
74
+
75
+ The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
76
+
77
+ ### 3. Hard: RAG Reply
78
+
79
+ The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
80
+
81
+ ## What Judges Can Inspect In This Space
82
+
83
+ - Live observation payloads
84
+ - Workspace tables for emails, todos, files, and action logs
85
+ - Step-by-step trace rows with reasoning, action type, status, score, and done state
86
+ - Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
87
+
88
+ ## Runtime And Deployment Notes
89
 
90
  - SDK: `docker`
91
  - App port: `7860`
92
  - Entry point: `python app.py`
93
  - Optional secret: `OPENROUTER_API_KEY`
94
  - Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
95
+ - The Space is deployed from the same repository used for local tests and notebook-backed experiments
96
 
97
+ ## Recommended Judge Flow
98
 
99
  1. Open the Space and choose one of the seeded scenarios.
100
+ 2. Run the deterministic `baseline` policy for a guaranteed reference trace.
101
+ 3. Switch to `rl` to replay the bundled learned checkpoint.
102
+ 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
103
+ 5. Compare how the workspace mutates after each step instead of evaluating only the final response.
104
+
105
+ ## Implementation Notes
106
+
107
+ - The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
108
+ - Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
109
+ - The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
110
+ - The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
111
+
112
+ ## What We Want Judges To Notice
113
+
114
+ - Strong separation between environment state, policy choice, and reward logic
115
+ - Clear evidence of agent tool use
116
+ - Reproducibility across runs
117
+ - A hackathon-friendly deployment that still preserves engineering discipline
118
 
119
+ ## References And Context
120
 
121
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
122
  - OpenEnv launch: https://huggingface.co/blog/openenv
src/executive_assistant/deployment.py CHANGED
@@ -9,11 +9,11 @@ from src.executive_assistant.training import default_checkpoint_path, train_q_le
9
 
10
 
11
  REPO_ROOT = Path(__file__).resolve().parents[2]
12
- DEFAULT_SPACE_TITLE = "Project Epsilon | Executive Assistant Sandbox"
13
  DEFAULT_HF_USERNAMES = [
14
- "HF_USERNAME_1",
15
- "HF_USERNAME_2",
16
- "HF_USERNAME_3",
17
  ]
18
  DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
19
  DEFAULT_STAGE_IGNORE_NAMES = {
@@ -41,7 +41,7 @@ DEFAULT_STAGE_IGNORE_FILES = {
41
  class HFSpaceDeployConfig:
42
  repo_id: str
43
  title: str = DEFAULT_SPACE_TITLE
44
- team_name: str = "Project Epsilon"
45
  hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
46
  checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
47
  app_port: int = 7860
@@ -77,7 +77,12 @@ def parse_hf_usernames(raw_value: str | None) -> tuple[str, ...]:
77
 
78
 
79
  def render_space_readme(config: HFSpaceDeployConfig) -> str:
80
- usernames = ", ".join(f"`@{username}`" for username in config.hf_usernames)
 
 
 
 
 
81
  checkpoint_note = (
82
  "A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
83
  "is available immediately in the demo."
@@ -98,47 +103,116 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
98
 
99
  # {config.team_name}
100
 
101
- Discrete Hugging Face Space for the **Autonomous Executive Assistant Sandbox**, built for the **OpenEnv Scaler x Meta x PyTorch Hack**.
102
 
103
- ## Team
104
 
105
  - Team name: `{config.team_name}`
106
- - Hugging Face usernames: {usernames}
107
- - Space repo: `{config.repo_id}`
 
108
 
109
- Replace the placeholder usernames above once the final team accounts are ready.
110
 
111
- ## What This Space Shows
112
 
113
- - A deterministic OpenEnv-style executive assistant environment backed by an isolated SQLite workspace
114
- - A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
115
- - Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
116
- - Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
117
 
118
- ## Hack Context
119
 
120
- OpenEnv was announced by Hugging Face and Meta as an open source framework for building agent environments with typed observations, actions, and rewards. The Scaler dashboard for this hack lists the submission round as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space packages our environment to match that workflow: deterministic tasks, structured actions, visible state transitions, and reproducible judge demos.
 
 
121
 
122
- ## Runtime Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  - SDK: `docker`
125
  - App port: `{config.app_port}`
126
  - Entry point: `python app.py`
127
  - Optional secret: `OPENROUTER_API_KEY`
128
  - {checkpoint_note}
 
 
129
 
130
- ## Judge Flow
131
 
132
  1. Open the Space and choose one of the seeded scenarios.
133
  2. Run the deterministic `baseline` policy for a guaranteed reference trace.
134
  3. Switch to `rl` to replay the bundled learned checkpoint.
135
  4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
- ## References
138
 
139
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
140
  - OpenEnv launch: https://huggingface.co/blog/openenv
141
- - Space URL: {config.space_url}
 
142
  """
143
 
144
 
 
9
 
10
 
11
  REPO_ROOT = Path(__file__).resolve().parents[2]
12
+ DEFAULT_SPACE_TITLE = "EmailMaestro | Executive Assistant Sandbox"
13
  DEFAULT_HF_USERNAMES = [
14
+ "flickinshots",
15
+ "ShreyaKhatik",
16
+ "itsayushdey",
17
  ]
18
  DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
19
  DEFAULT_STAGE_IGNORE_NAMES = {
 
41
  class HFSpaceDeployConfig:
42
  repo_id: str
43
  title: str = DEFAULT_SPACE_TITLE
44
+ team_name: str = "Team Epsilon"
45
  hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
46
  checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
47
  app_port: int = 7860
 
77
 
78
 
79
  def render_space_readme(config: HFSpaceDeployConfig) -> str:
80
+ roster_lines: list[str] = []
81
+ if config.hf_usernames:
82
+ roster_lines.append(f"- `@{config.hf_usernames[0]}` — Team lead and primary Space owner")
83
+ for username in config.hf_usernames[1:]:
84
+ roster_lines.append(f"- `@{username}` — Team member")
85
+ roster = "\n".join(roster_lines) if roster_lines else "- Team roster to be added"
86
  checkpoint_note = (
87
  "A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
88
  "is available immediately in the demo."
 
103
 
104
  # {config.team_name}
105
 
106
+ Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **{config.team_name}** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
107
 
108
+ ## Team Epsilon Roster
109
 
110
  - Team name: `{config.team_name}`
111
+ - Hugging Face Space: `{config.repo_id}`
112
+ - Live app: `{config.app_url}`
113
+ - Public repository view on Hugging Face: `{config.space_url}`
114
 
115
+ {roster}
116
 
117
+ ## Executive Summary
118
 
119
+ EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
 
 
 
120
 
121
+ This Space is intended to show three things clearly:
122
 
123
+ - The agent can operate as a structured tool user, not just a text generator.
124
+ - The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
125
+ - Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
126
 
127
+ ## Why This Fits The Hackathon
128
+
129
+ OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
130
+
131
+ - Deterministic environment setup
132
+ - Typed environment contracts
133
+ - Observable step-by-step policy execution
134
+ - Reproducible seeded tasks
135
+ - Judge-friendly visualization of state transitions
136
+
137
+ ## Problem Framing
138
+
139
+ Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
140
+
141
+ - read a chaotic inbox
142
+ - extract structured work into a task list
143
+ - triage low-priority versus high-priority communication
144
+ - search a local knowledge source before replying
145
+ - produce actions that can be graded deterministically
146
+
147
+ To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
148
+
149
+ ## Core Architecture
150
+
151
+ - **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
152
+ - **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
153
+ - **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
154
+ - **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
155
+ - **UI layer:** Gradio control room plus visible workspace snapshots for judges
156
+
157
+ ## Seeded Judge Tasks
158
+
159
+ ### 1. Easy: Deadline Extraction
160
+
161
+ The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
162
+
163
+ ### 2. Medium: Inbox Triage And Negotiation
164
+
165
+ The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
166
+
167
+ ### 3. Hard: RAG Reply
168
+
169
+ The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
170
+
171
+ ## What Judges Can Inspect In This Space
172
+
173
+ - Live observation payloads
174
+ - Workspace tables for emails, todos, files, and action logs
175
+ - Step-by-step trace rows with reasoning, action type, status, score, and done state
176
+ - Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
177
+
178
+ ## Runtime And Deployment Notes
179
 
180
  - SDK: `docker`
181
  - App port: `{config.app_port}`
182
  - Entry point: `python app.py`
183
  - Optional secret: `OPENROUTER_API_KEY`
184
  - {checkpoint_note}
185
+ - The bundled RL artifact lives at `artifacts/checkpoints/{config.checkpoint_name}`
186
+ - The Space is deployed from the same repository used for local tests and notebook-backed experiments
187
 
188
+ ## Recommended Judge Flow
189
 
190
  1. Open the Space and choose one of the seeded scenarios.
191
  2. Run the deterministic `baseline` policy for a guaranteed reference trace.
192
  3. Switch to `rl` to replay the bundled learned checkpoint.
193
  4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
194
+ 5. Compare how the workspace mutates after each step instead of evaluating only the final response.
195
+
196
+ ## Implementation Notes
197
+
198
+ - The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
199
+ - Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
200
+ - The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
201
+ - The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
202
+
203
+ ## What We Want Judges To Notice
204
+
205
+ - Strong separation between environment state, policy choice, and reward logic
206
+ - Clear evidence of agent tool use
207
+ - Reproducibility across runs
208
+ - A hackathon-friendly deployment that still preserves engineering discipline
209
 
210
+ ## References And Context
211
 
212
  - Hack dashboard: https://www.scaler.com/openenv-hackathon
213
  - OpenEnv launch: https://huggingface.co/blog/openenv
214
+ - Space page: {config.space_url}
215
+ - Live app: {config.app_url}
216
  """
217
 
218
 
tests/test_deployment.py CHANGED
@@ -13,14 +13,16 @@ def test_parse_hf_usernames_strips_at_signs() -> None:
13
  assert usernames == ("alice", "bob", "carol")
14
 
15
 
16
- def test_render_space_readme_includes_project_epsilon_placeholders() -> None:
17
  config = HFSpaceDeployConfig(
18
- repo_id="placeholder/project-epsilon-executive-assistant",
19
- hf_usernames=("HF_USERNAME_1", "HF_USERNAME_2"),
 
20
  )
21
  rendered = render_space_readme(config)
22
- assert "Project Epsilon" in rendered
23
- assert "@HF_USERNAME_1" in rendered
 
24
  assert "sdk: docker" in rendered
25
  assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered
26
 
 
13
  assert usernames == ("alice", "bob", "carol")
14
 
15
 
16
+ def test_render_space_readme_includes_team_epsilon_roster() -> None:
17
  config = HFSpaceDeployConfig(
18
+ repo_id="Flickinshots/EmailMaestro",
19
+ team_name="Team Epsilon",
20
+ hf_usernames=("flickinshots", "ShreyaKhatik", "itsayushdey"),
21
  )
22
  rendered = render_space_readme(config)
23
+ assert "Team Epsilon" in rendered
24
+ assert "@flickinshots" in rendered
25
+ assert "Team lead and primary Space owner" in rendered
26
  assert "sdk: docker" in rendered
27
  assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered
28