Spaces:

Flickinshots
/

EmailMaestro

Running

App Files Files Community

Flickinshots commited on 13 days ago

Commit

696d083

verified ·

1 Parent(s): f816f0e

Deploy Project Epsilon Space bundle

Browse files

Files changed (5) hide show

.gitignore +1 -0
README.md +87 -16
docs/HF_SPACE_README.md +89 -20
src/executive_assistant/deployment.py +96 -22
tests/test_deployment.py +7 -5

.gitignore CHANGED Viewed

@@ -5,4 +5,5 @@ artifacts/
 __pycache__/
 .env
 .env.app
 .env.training

 __pycache__/
 .env
 .env.app
+.env.hf.space
 .env.training

README.md CHANGED Viewed

@@ -11,44 +11,115 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
 # Project Epsilon
-Discrete Hugging Face Space for the **Autonomous Executive Assistant Sandbox**, built for the **OpenEnv Scaler x Meta x PyTorch Hack**.
-## Team
 - Team name: `Project Epsilon`
-- Hugging Face usernames: `@Flickinshots`, `@HF_USERNAME_2`, `@HF_USERNAME_3`
-- Space repo: `Flickinshots/EmailMaestro`
-Replace the placeholder usernames above once the final team accounts are ready.
-## What This Space Shows
-- A deterministic OpenEnv-style executive assistant environment backed by an isolated SQLite workspace
-- A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
-- Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
-- Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
-## Hack Context
-OpenEnv was announced by Hugging Face and Meta as an open source framework for building agent environments with typed observations, actions, and rewards. The Scaler dashboard for this hack lists the submission round as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space packages our environment to match that workflow: deterministic tasks, structured actions, visible state transitions, and reproducible judge demos.
-## Runtime Notes
 - SDK: `docker`
 - App port: `7860`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
-## Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
 2. Run the deterministic `baseline` policy for a guaranteed reference trace.
 3. Switch to `rl` to replay the bundled learned checkpoint.
 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
-## References
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv
-- Space URL: https://huggingface.co/spaces/Flickinshots/EmailMaestro

 # Project Epsilon
+Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Project Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
+## Team Epsilon Roster
 - Team name: `Project Epsilon`
+- Hugging Face Space: `Flickinshots/EmailMaestro`
+- Live app: `https://Flickinshots-EmailMaestro.hf.space`
+- Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
+- `@Flickinshots` — Team lead and primary Space owner
+- `@HF_USERNAME_2` — Team member
+- `@HF_USERNAME_3` — Team member
+## Executive Summary
+EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
+This Space is intended to show three things clearly:
+- The agent can operate as a structured tool user, not just a text generator.
+- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
+- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
+## Why This Fits The Hackathon
+OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
+- Deterministic environment setup
+- Typed environment contracts
+- Observable step-by-step policy execution
+- Reproducible seeded tasks
+- Judge-friendly visualization of state transitions
+## Problem Framing
+Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
+- read a chaotic inbox
+- extract structured work into a task list
+- triage low-priority versus high-priority communication
+- search a local knowledge source before replying
+- produce actions that can be graded deterministically
+To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
+## Core Architecture
+- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
+- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
+- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
+- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
+- **UI layer:** Gradio control room plus visible workspace snapshots for judges
+## Seeded Judge Tasks
+### 1. Easy: Deadline Extraction
+The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
+### 2. Medium: Inbox Triage And Negotiation
+The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
+### 3. Hard: RAG Reply
+The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
+## What Judges Can Inspect In This Space
+- Live observation payloads
+- Workspace tables for emails, todos, files, and action logs
+- Step-by-step trace rows with reasoning, action type, status, score, and done state
+- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
+## Runtime And Deployment Notes
 - SDK: `docker`
 - App port: `7860`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy is available immediately in the demo.
+- The bundled RL artifact lives at `artifacts/checkpoints/q_policy_notebook.json`
+- The Space is deployed from the same repository used for local tests and notebook-backed experiments
+## Recommended Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
 2. Run the deterministic `baseline` policy for a guaranteed reference trace.
 3. Switch to `rl` to replay the bundled learned checkpoint.
 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
+5. Compare how the workspace mutates after each step instead of evaluating only the final response.
+## Implementation Notes
+- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
+- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
+- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
+- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
+## What We Want Judges To Notice
+- Strong separation between environment state, policy choice, and reward logic
+- Clear evidence of agent tool use
+- Reproducibility across runs
+- A hackathon-friendly deployment that still preserves engineering discipline
+## References And Context
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv
+- Space page: https://huggingface.co/spaces/Flickinshots/EmailMaestro
+- Live app: https://Flickinshots-EmailMaestro.hf.space

docs/HF_SPACE_README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Project Epsilon | Executive Assistant Sandbox
 emoji: "🧭"
 colorFrom: yellow
 colorTo: gray
@@ -11,43 +11,112 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
 # Project Epsilon
-Discrete Hugging Face Space README for the **Autonomous Executive Assistant Sandbox**, prepared for the **OpenEnv Scaler x Meta x PyTorch Hack**.
-## Team
-- Team name: `Project Epsilon`
-- Hugging Face usernames: `@HF_USERNAME_1`, `@HF_USERNAME_2`, `@HF_USERNAME_3`
-- Space repo: `HF_USERNAME_PLACEHOLDER/project-epsilon-executive-assistant`
-Replace the placeholder usernames and repo owner when the final team accounts are ready.
-## What This Space Shows
-- Deterministic OpenEnv-style tasks over a SQLite-backed executive assistant workspace
-- A Gradio judge console that replays the shared `EpisodeRunner` loop step by step
-- Policy switching across `baseline`, bundled `rl`, and optional `openrouter`
-- Visible inbox, todo, file-search, and action-log state transitions
-## Hack Context
-OpenEnv was introduced by Hugging Face and Meta as an open source framework for typed agent environments. The Scaler hack dashboard lists the build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space is tuned for that style of evaluation: deterministic tasks, structured actions, reproducible runs, and a judge-friendly visual trace.
-## Runtime Notes
 - SDK: `docker`
 - App port: `7860`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
-## Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
-2. Run `baseline` first for the reference trace.
-3. Switch to `rl` to replay the trained checkpoint bundled with the Space.
-4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed policy.
-## References
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv

 ---
+title: EmailMaestro | Executive Assistant Sandbox
 emoji: "🧭"
 colorFrom: yellow
 colorTo: gray
 # Project Epsilon
+Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **Team Epsilon** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
+## Team Epsilon Roster
+- Team name: `Team Epsilon`
+- Hugging Face Space: `Flickinshots/EmailMaestro`
+- Live app: `https://Flickinshots-EmailMaestro.hf.space`
+- Public repository view on Hugging Face: `https://huggingface.co/spaces/Flickinshots/EmailMaestro`
+- `@flickinshots` — Team lead and primary Space owner
+- `@ShreyaKhatik` — Team member
+- `@itsayushdey` — Team member
+## Executive Summary
+EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
+This Space is intended to show three things clearly:
+- The agent can operate as a structured tool user, not just a text generator.
+- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
+- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
+## Why This Fits The Hackathon
+OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
+- Deterministic environment setup
+- Typed environment contracts
+- Observable step-by-step policy execution
+- Reproducible seeded tasks
+- Judge-friendly visualization of state transitions
+## Problem Framing
+Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
+- read a chaotic inbox
+- extract structured work into a task list
+- triage low-priority versus high-priority communication
+- search a local knowledge source before replying
+- produce actions that can be graded deterministically
+To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
+## Core Architecture
+- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
+- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
+- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
+- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
+- **UI layer:** Gradio control room plus visible workspace snapshots for judges
+## Seeded Judge Tasks
+### 1. Easy: Deadline Extraction
+The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
+### 2. Medium: Inbox Triage And Negotiation
+The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
+### 3. Hard: RAG Reply
+The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
+## What Judges Can Inspect In This Space
+- Live observation payloads
+- Workspace tables for emails, todos, files, and action logs
+- Step-by-step trace rows with reasoning, action type, status, score, and done state
+- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
+## Runtime And Deployment Notes
 - SDK: `docker`
 - App port: `7860`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - Bundled RL checkpoint path: `artifacts/checkpoints/q_policy_notebook.json`
+- The Space is deployed from the same repository used for local tests and notebook-backed experiments
+## Recommended Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
+2. Run the deterministic `baseline` policy for a guaranteed reference trace.
+3. Switch to `rl` to replay the bundled learned checkpoint.
+4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
+5. Compare how the workspace mutates after each step instead of evaluating only the final response.
+## Implementation Notes
+- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
+- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
+- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
+- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
+## What We Want Judges To Notice
+- Strong separation between environment state, policy choice, and reward logic
+- Clear evidence of agent tool use
+- Reproducibility across runs
+- A hackathon-friendly deployment that still preserves engineering discipline
+## References And Context
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv

src/executive_assistant/deployment.py CHANGED Viewed

@@ -9,11 +9,11 @@ from src.executive_assistant.training import default_checkpoint_path, train_q_le
 REPO_ROOT = Path(__file__).resolve().parents[2]
-DEFAULT_SPACE_TITLE = "Project Epsilon | Executive Assistant Sandbox"
 DEFAULT_HF_USERNAMES = [
-    "HF_USERNAME_1",
-    "HF_USERNAME_2",
-    "HF_USERNAME_3",
 ]
 DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
 DEFAULT_STAGE_IGNORE_NAMES = {
@@ -41,7 +41,7 @@ DEFAULT_STAGE_IGNORE_FILES = {
 class HFSpaceDeployConfig:
     repo_id: str
     title: str = DEFAULT_SPACE_TITLE
-    team_name: str = "Project Epsilon"
     hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
     checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
     app_port: int = 7860
@@ -77,7 +77,12 @@ def parse_hf_usernames(raw_value: str | None) -> tuple[str, ...]:
 def render_space_readme(config: HFSpaceDeployConfig) -> str:
-    usernames = ", ".join(f"`@{username}`" for username in config.hf_usernames)
     checkpoint_note = (
         "A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
         "is available immediately in the demo."
@@ -98,47 +103,116 @@ short_description: OpenEnv executive assistant sandbox demo for judges.
 # {config.team_name}
-Discrete Hugging Face Space for the **Autonomous Executive Assistant Sandbox**, built for the **OpenEnv Scaler x Meta x PyTorch Hack**.
-## Team
 - Team name: `{config.team_name}`
-- Hugging Face usernames: {usernames}
-- Space repo: `{config.repo_id}`
-Replace the placeholder usernames above once the final team accounts are ready.
-## What This Space Shows
-- A deterministic OpenEnv-style executive assistant environment backed by an isolated SQLite workspace
-- A judge-friendly Gradio interface that replays the shared `EpisodeRunner` loop step by step
-- Side-by-side policy execution for `baseline`, `rl`, and optional `openrouter`
-- Visible inbox, todo, file-search, and action-log state so evaluators can inspect each mutation
-## Hack Context
-OpenEnv was announced by Hugging Face and Meta as an open source framework for building agent environments with typed observations, actions, and rewards. The Scaler dashboard for this hack lists the submission round as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. This Space packages our environment to match that workflow: deterministic tasks, structured actions, visible state transitions, and reproducible judge demos.
-## Runtime Notes
 - SDK: `docker`
 - App port: `{config.app_port}`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - {checkpoint_note}
-## Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
 2. Run the deterministic `baseline` policy for a guaranteed reference trace.
 3. Switch to `rl` to replay the bundled learned checkpoint.
 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
-## References
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv
-- Space URL: {config.space_url}
 """

 REPO_ROOT = Path(__file__).resolve().parents[2]
+DEFAULT_SPACE_TITLE = "EmailMaestro | Executive Assistant Sandbox"
 DEFAULT_HF_USERNAMES = [
+    "flickinshots",
+    "ShreyaKhatik",
+    "itsayushdey",
 ]
 DEFAULT_CHECKPOINT_NAME = "q_policy_notebook.json"
 DEFAULT_STAGE_IGNORE_NAMES = {
 class HFSpaceDeployConfig:
     repo_id: str
     title: str = DEFAULT_SPACE_TITLE
+    team_name: str = "Team Epsilon"
     hf_usernames: tuple[str, ...] = tuple(DEFAULT_HF_USERNAMES)
     checkpoint_name: str = DEFAULT_CHECKPOINT_NAME
     app_port: int = 7860
 def render_space_readme(config: HFSpaceDeployConfig) -> str:
+    roster_lines: list[str] = []
+    if config.hf_usernames:
+        roster_lines.append(f"- `@{config.hf_usernames[0]}` — Team lead and primary Space owner")
+    for username in config.hf_usernames[1:]:
+        roster_lines.append(f"- `@{username}` — Team member")
+    roster = "\n".join(roster_lines) if roster_lines else "- Team roster to be added"
     checkpoint_note = (
         "A trained RL checkpoint is bundled in `artifacts/checkpoints/` so the `rl` policy "
         "is available immediately in the demo."
 # {config.team_name}
+Welcome judges of the **OpenEnv Scaler x Meta x PyTorch Hack**. This Space hosts **EmailMaestro**, our deterministic executive assistant environment and policy demo built by **{config.team_name}** for repeatable agent evaluation, visible tool use, and side-by-side policy comparison.
+## Team Epsilon Roster
 - Team name: `{config.team_name}`
+- Hugging Face Space: `{config.repo_id}`
+- Live app: `{config.app_url}`
+- Public repository view on Hugging Face: `{config.space_url}`
+{roster}
+## Executive Summary
+EmailMaestro is an **Autonomous Executive Assistant Sandbox** designed around the OpenEnv pattern: typed observations, typed actions, deterministic rewards, and a visible environment loop. Instead of depending on a brittle live email provider, the agent operates inside an isolated SQLite-backed workspace that simulates an inbox, a todo manager, and a local document store. That lets judges inspect policy quality through reproducible runs rather than through one-off anecdotal chats.
+This Space is intended to show three things clearly:
+- The agent can operate as a structured tool user, not just a text generator.
+- The same environment loop is shared across notebook experiments, CLI evaluation, tests, and the live Gradio UI.
+- Baseline, learned, and model-backed policies can be compared under the same task and reward conditions.
+## Why This Fits The Hackathon
+OpenEnv was introduced by Hugging Face and Meta as an open framework for building agent environments with typed observations, actions, and rewards. The Scaler hack timeline for this event lists the main build window as **March 25, 2026 through April 8, 2026**, with finals on **April 25-26, 2026** in Bengaluru. Our submission is shaped directly around that evaluation style:
+- Deterministic environment setup
+- Typed environment contracts
+- Observable step-by-step policy execution
+- Reproducible seeded tasks
+- Judge-friendly visualization of state transitions
+## Problem Framing
+Most assistant demos stop at text quality. We wanted to show that an agent can manage a workflow end to end:
+- read a chaotic inbox
+- extract structured work into a task list
+- triage low-priority versus high-priority communication
+- search a local knowledge source before replying
+- produce actions that can be graded deterministically
+To make that possible under hackathon constraints, we replaced live services with a controlled mock workspace that still feels operationally realistic.
+## Core Architecture
+- **Environment state:** in-memory SQLite workspace simulating emails, todos, files, and action history
+- **OpenEnv contract:** Pydantic models defining observations, actions, rewards, and policy decisions
+- **Execution loop:** shared `EpisodeRunner` used by tests, scripts, notebook experiments, and the Gradio app
+- **Policies:** deterministic baseline, tabular RL checkpoint replay, and optional OpenRouter-backed live policy execution
+- **UI layer:** Gradio control room plus visible workspace snapshots for judges
+## Seeded Judge Tasks
+### 1. Easy: Deadline Extraction
+The environment injects an academic email containing multiple deadlines. The policy must read the message, create the correct todo entries, and archive the source email.
+### 2. Medium: Inbox Triage And Negotiation
+The environment mixes newsletters, an urgent complaint, and a meeting reschedule request. The policy must archive low-value mail, escalate the complaint properly, and send a concrete meeting reply.
+### 3. Hard: RAG Reply
+The environment includes a stakeholder email asking for exact metrics from a local report. The policy must search the file store, recover the relevant values, and draft a grounded reply using the retrieved evidence.
+## What Judges Can Inspect In This Space
+- Live observation payloads
+- Workspace tables for emails, todos, files, and action logs
+- Step-by-step trace rows with reasoning, action type, status, score, and done state
+- Differences between `baseline`, bundled `rl`, and optional `openrouter` policies
+## Runtime And Deployment Notes
 - SDK: `docker`
 - App port: `{config.app_port}`
 - Entry point: `python app.py`
 - Optional secret: `OPENROUTER_API_KEY`
 - {checkpoint_note}
+- The bundled RL artifact lives at `artifacts/checkpoints/{config.checkpoint_name}`
+- The Space is deployed from the same repository used for local tests and notebook-backed experiments
+## Recommended Judge Flow
 1. Open the Space and choose one of the seeded scenarios.
 2. Run the deterministic `baseline` policy for a guaranteed reference trace.
 3. Switch to `rl` to replay the bundled learned checkpoint.
 4. Add `OPENROUTER_API_KEY` in Space secrets to enable the live model-backed path.
+5. Compare how the workspace mutates after each step instead of evaluating only the final response.
+## Implementation Notes
+- The app, scripts, notebook, and tests all rely on the same `EpisodeRunner` workflow loop.
+- Live API access stays in the policy layer, so deterministic evaluation remains possible without network access.
+- The current RL path is intentionally lightweight and reproducible: a tabular Q-learning prototype trained over seeded action templates.
+- The Gradio interface is designed for demonstration and debugging, not just for final-state screenshots.
+## What We Want Judges To Notice
+- Strong separation between environment state, policy choice, and reward logic
+- Clear evidence of agent tool use
+- Reproducibility across runs
+- A hackathon-friendly deployment that still preserves engineering discipline
+## References And Context
 - Hack dashboard: https://www.scaler.com/openenv-hackathon
 - OpenEnv launch: https://huggingface.co/blog/openenv
+- Space page: {config.space_url}
+- Live app: {config.app_url}
 """

tests/test_deployment.py CHANGED Viewed

@@ -13,14 +13,16 @@ def test_parse_hf_usernames_strips_at_signs() -> None:
     assert usernames == ("alice", "bob", "carol")
-def test_render_space_readme_includes_project_epsilon_placeholders() -> None:
     config = HFSpaceDeployConfig(
-        repo_id="placeholder/project-epsilon-executive-assistant",
-        hf_usernames=("HF_USERNAME_1", "HF_USERNAME_2"),
     )
     rendered = render_space_readme(config)
-    assert "Project Epsilon" in rendered
-    assert "@HF_USERNAME_1" in rendered
     assert "sdk: docker" in rendered
     assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered

     assert usernames == ("alice", "bob", "carol")
+def test_render_space_readme_includes_team_epsilon_roster() -> None:
     config = HFSpaceDeployConfig(
+        repo_id="Flickinshots/EmailMaestro",
+        team_name="Team Epsilon",
+        hf_usernames=("flickinshots", "ShreyaKhatik", "itsayushdey"),
     )
     rendered = render_space_readme(config)
+    assert "Team Epsilon" in rendered
+    assert "@flickinshots" in rendered
+    assert "Team lead and primary Space owner" in rendered
     assert "sdk: docker" in rendered
     assert "OpenEnv Scaler x Meta x PyTorch Hack" in rendered