# Product Requirements Document (PRD): Autonomous Executive Assistant Sandbox **Target Deployment:** Hugging Face Spaces (Gradio UI + OpenEnv Container) **Primary Dev Environment:** Kaggle / Jupyter Notebooks (`training_env.ipynb`) --- ## Progress Note Status as of 2026-04-08: - The deterministic SQLite-backed workspace is implemented with action logging, seeded scenarios, snapshots, and richer step semantics. - The OpenEnv contract is represented in typed Pydantic models for observations, actions, rewards, and policy decisions. - Deterministic graders are implemented for all three seeded tasks with dense reward shaping and terminal success checks. - A shared `EpisodeRunner` now owns the agent workflow loop across scripts, tests, the notebook, and Gradio. - A deterministic baseline policy is implemented and solves all three seeded tasks end to end. - An OpenRouter-backed `google/gemma-4-31b-it` policy path is integrated, prompt-hardened, and validated on the hard task. - Separate app and training environments are in place, including a registered `scalerhack2-training` Jupyter kernel. - The training notebook loads `.env.training`, exports traces, runs RL training, and saves checkpoints. - A tabular Q-learning policy exists as a seeded-task RL prototype and can be trained, evaluated, and checkpointed. - The current Gradio app can reset scenarios and run full episodes for baseline and OpenRouter policies. Resume from here: - Make the trained RL checkpoint a first-class runtime policy in the app and scripts. - Refine the Gradio UI from one-shot episode execution into a stepwise or streaming judge-facing experience. - Ensure the app, notebook, and scripts can all use the same trained RL artifact without drift. - Expand notebook analysis cells and runtime metrics for stronger model-vs-baseline-vs-RL comparisons. - Keep the current tabular RL policy as a prototype while leaving room for a richer learned policy after hackathon delivery. --- ## 1. Executive Summary We are building a deterministic, isolated OpenEnv simulation of a corporate or academic workflow. Instead of wrapping a brittle, live API like Gmail (which causes rate limits and non-deterministic grading), we will engineer an **in-memory SQLite Mock Mail Server & Local File System**. The AI agent will act as an Autonomous Executive Assistant. It must navigate a chaotic mock inbox, extract deadlines to a mock task manager, negotiate meeting times, and perform Retrieval-Augmented Generation (RAG) over a mock file system to draft intelligent replies. This environment proves the agent's ability to act as a *router* and a *tool-user*, moving beyond text generation into full workflow automation. --- ## 2. Core Architecture & Stack * **State Management:** In-memory SQLite (`sqlite3`) simulating a mail server, calendar, and file system. * **Typing & Validation:** `pydantic` (Strictly defining Observations, Actions, and Rewards per OpenEnv spec). * **Development & Debugging:** Jupyter Notebooks plus scriptable runners. The state machine, model prompts, rollout export, and RL smoke training are exercised from `training_env.ipynb` and mirrored by CLI scripts. * **Model Runtime:** OpenRouter using `google/gemma-4-31b-it` for live policy inference, with prompt/schema hardening and response repair. * **RL Prototype:** Tabular Q-learning over a finite action template catalog, with teacher warm-start from the deterministic baseline and JSON checkpoint persistence. * **Deployment & Visualization:** Gradio (to visualize the inbox state for judges) packaged within a Docker container on Hugging Face Spaces. --- ## 3. Step-by-Step Implementation Plan ### Phase 1: The Mock Server Setup (Notebook Environment) **Goal:** Build the deterministic world the agent will live in. Do this entirely in the first few cells of your Kaggle notebook so you can instantly query and reset the state. 1. **Database Initialization:** Create an in-memory SQLite database (`sqlite3.connect(':memory:')`). 2. **Table Creation:** * `Emails` (id, sender, recipient, subject, body, timestamp, is_read, is_archived) * `Todos` (id, task_name, deadline_date, context) * `Files` (id, filename, content_text) - *This acts as the local knowledge base.* 3. **The Wrapper Class (`MockWorkspace`):** Write Python methods to interact with this DB safely. * `get_unread_emails()` * `send_reply(email_id, text)` * `create_todo(task, date)` * `search_documents(query)` ### Phase 2: OpenEnv Specifications (Pydantic Models) **Goal:** Define the strict APIs the agent must use. This is the core of the hackathon requirement. **Observation Space:** ```python class WorkspaceObservation(BaseModel): current_time: str unread_emails: List[Dict[str, str]] # ID, Sender, Subject snippet active_todos: List[str] last_action_status: str # e.g., "Email successfully sent to Manager" ``` **Action Space:** ```python class AssistantAction(BaseModel): action_type: Literal["read_email", "reply", "forward", "add_todo", "archive", "search_files"] target_id: Optional[str] = None # email_id or file_id payload: Optional[str] = None # The body of the reply, or the search query secondary_payload: Optional[str] = None # Date for todos, or recipient for forwards ``` **Reward Space:** ```python class TaskReward(BaseModel): step_reward: float total_score: float is_done: bool reasoning: str ``` ### Phase 3: Task Definitions & Deterministic Graders Implement the three required difficulty tiers. The grader simply runs SQL queries against your mock database to verify the agent's actions. #### Task 1: Easy (Syllabus & Deadline Extraction) * **Initial State:** DB injected with an email from `prof.smith@university.edu` containing 3 specific project deadlines. * **Agent Goal:** Read email, create 3 corresponding tasks in the `Todos` table, and archive the email. * **Grader Logic:** `SELECT COUNT(*) FROM Todos WHERE deadline_date IS NOT NULL;` -> If 3, return `+1.0`. #### Task 2: Medium (Triage & Meeting Negotiation) * **Initial State:** DB injected with 5 emails: 3 newsletters, 1 urgent client complaint, 1 team meeting reschedule request. * **Agent Goal:** Archive newsletters, forward the client complaint to `manager@company.com`, and reply to the reschedule request proposing a time. * **Grader Logic:** Check if newsletters are marked `is_archived=True` (+0.3). Check if complaint is in the DB as sent to manager (+0.4). Check if reply contains a valid time string (+0.3). #### Task 3: Hard (Autonomous RAG & Drafting) * **Initial State:** DB injected with an email from a VIP stakeholder asking for specific metrics from the "Q3 Architecture Report". * **Agent Goal:** Use `action_type: "search_files"` with query "Q3 Architecture", read the file contents, and use `action_type: "reply"` synthesizing the exact metrics from the file into a professional response. * **Grader Logic:** Check if `search_files` was called (+0.3). Use regex to verify the specific metric string from the mock file exists in the sent reply body (+0.7). ### Phase 4: Baseline Agent Testing (Notebook Environment) **Goal:** Prove the environment works using both a deterministic policy and a live model-backed policy. 1. Use the deterministic `BaselineAgent` to verify seeded tasks and grader behavior. 2. Use a standard `while not done:` loop, now centralized in `EpisodeRunner`. 3. Pass the `WorkspaceObservation` to the live model policy through OpenRouter using strict JSON outputs. 4. Pass the model action into the environment's `step()` function. 5. Print and export the interaction loop directly in the notebook to debug prompt formatting, policy behavior, and reward shaping. #### Agent Workflow Loop 1. Load environment state 2. Generate observation 3. Send to LLM 4. Receive structured action 5. Execute action in workspace 6. Update state 7. Repeat until task complete Implementation note: this loop is now represented directly in the shared `EpisodeRunner` so the notebook, scripts, tests, and Gradio app all execute the same control flow. ### Phase 5: Hugging Face Spaces & Gradio Deployment **Goal:** Package the OpenEnv logic and build a visual interface so judges can physically see the agent working, including deterministic, model-backed, and learned-policy runs. 1. **The Gradio Wrapper (`app.py`):** * Build a Gradio UI that exposes selectable policies (`baseline`, `openrouter`, and trained `rl`) and visually represents the `Emails`, `Todos`, `Files`, and action history tables. * As the OpenEnv `step()` function runs, update the Gradio state step by step so judges can watch the inbox drain, the to-do list populate, and the replies send in real time. * Ensure the app can load the same trained RL checkpoint artifact produced by the notebook and CLI training scripts. 2. **Containerization (`Dockerfile`):** ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.app.txt . RUN pip install --no-cache-dir -r requirements.app.txt COPY . . # OpenEnv requires specific metadata handling, Gradio runs on 7860 EXPOSE 7860 ENV GRADIO_SERVER_NAME="0.0.0.0" CMD ["python", "app.py"] ``` 3. **OpenEnv Spec Compliance:** Ensure your `openenv.yaml` is correctly mapped to your Pydantic classes at the root of the repository. 4. **Push to HF:** Commit the repo to a Hugging Face Space, tag it with `openenv`, and ensure the policy runners and training instructions are easily executable via the README instructions.