Spaces:
Sleeping
Sleeping
File size: 10,821 Bytes
d815df7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | # Ghostexec — innovation brief (for reviewers)
**Repository:** [Ghostexec (OpenEnv)](../README.md)
**Public Space:** https://huggingface.co/spaces/modelbuilderhq/ghostexec
This README is a **standalone** walkthrough for reviewers: why the environment is hard, what agent capabilities it stresses, how to verify claims in code and on the live Space. You can read it **without** opening the rest of the repo narrative.
---
## Contents
1. [How to read this document](#how-to-read-this-document)
2. [Short answers](#short-answers-so-nothing-is-buried)
3. [What Ghostexec is](#1-what-ghostexec-is-one-paragraph)
4. [What the agent observes](#2-what-the-agent-observes-and-why-that-matters)
5. [What the agent can do](#3-what-the-agent-can-do-actions-and-legality)
6. [What changes between steps](#4-what-changes-between-steps-dynamics-and-drift)
7. [How success is scored](#5-how-success-is-scored-two-layers-on-purpose)
8. [Task ladder](#6-the-public-task-ladder-difficulty-in-data-not-vibes)
9. [Reviewer checklist](#7-how-a-reviewer-can-verify-5-minute-checklist)
10. [Closing](#8-closing)
11. [Key files (from repo root)](#key-files-from-repo-root)
---
## How to read this document
We group the argument under **two angles** reviewers typically care about. Everything below maps to one or both:
| Angle | Sections that answer it |
|-------|-------------------------|
| Is the **world** itself interesting and genuinely hard? | [Short answers](#short-answers-so-nothing-is-buried), [§1–§4](#1-what-ghostexec-is-one-paragraph) |
| Does it **stress-test agents** in a way a toy demo would not? | [Short answers](#short-answers-so-nothing-is-buried), [§3–§6](#3-what-the-agent-can-do-actions-and-legality), [§8](#8-closing) |
---
## Short answers (so nothing is buried)
**Is it genuinely challenging?** Yes. The agent must survive **dense natural-language state**, emit **strict structured actions** that **mutate** a multi-entity world, and accept **time pressure**, **social consequences**, and **invalid-action economics** without crashing the server. “Easy” wins are rare because channels **compete**: mail, calendar, tasks, and relationships all pull in different directions.
**Is it a meaningful test of behavior?** Yes. Success requires **grounded parsing** (real ids from the briefing), **tool discipline** (legal JSON schema), **sequencing** over multiple steps (WebSocket sessions for real episodes; HTTP for resets and single steps), and **tradeoffs** reflected in a **multi-channel** reward—not a single template answer.
**Is every ingredient globally novel?** No—and we do not claim otherwise. Inboxes and calendars are familiar. What *is* uncommon is the **composition**: OpenEnv-first packaging, **plain-text-only** observations, **data-driven** scenarios, **live dynamics** and **timed drift**, **dual** evaluation (**dense step rewards** + **trajectory graders** in strict `(0.01, 0.99)`), and a **production-shaped** action API—together—in one environment you can train and ship.
---
### 1. What Ghostexec is (one paragraph)
Ghostexec is an **executive chief-of-staff simulator**. Each episode starts from JSON scenario data under `../scenarios/`, selected by **task id** in `../openenv.yaml`. The **engine** lives in `../server/ghostexec_environment.py` and `../server/reward.py`; the **deployment contract** for Hugging Face / OpenEnv is `../openenv.yaml` (name **`ghostexec`**, FastAPI `server.app:app`, port **8000**). The model never sees raw scenario JSON as its primary observation: it sees a **rendered briefing**—the same class of messy, overlapping information a human would scan under time pressure.
---
### 2. What the agent observes (and why that matters)
After `reset` (or the WebSocket equivalent), the policy receives `GhostexecObservation.echoed_message`: a **single plain-text** block that includes, at minimum:
- A **timestamped header** (simulated “now”).
- **Unread emails** with priority, sender, relationship, subject, and a short preview.
- **Calendar conflicts** in a rolling horizon (overlaps the agent could resolve or worsen).
- **Top contacts** with **mood**, relationship type, and communication preference.
- **Tasks** that are overdue or due soon.
- **Executive stress** and **steps remaining** toward `max_steps` (see `../openenv.yaml`, default **20**).
**Why this matters for “challenging”:** many demos hide structure in JSON observations or tool schemas. Here, the **only** narrative state the model is supposed to “read” like a user is **natural language**, while the **law** of the world is still **typed actions**. That forces **comprehension + compliance** together—hallucinated ids and “vibes-only” plans fail in ways you can measure.
---
### 3. What the agent can do (actions and legality)
Each step the agent returns **exactly one** `GhostexecAction` (`../models.py`): `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, or `do_nothing`.
**Validity is enforced against the live world:** wrong `email_id` / `meeting_id` / `task_id`, missing required fields, or impossible combinations produce an **invalid step**. The server **does not throw**; it returns structured metadata (`step_ok`, error text) so RL and HTTP clients can learn from mistakes instead of dying.
**Valid actions mutate state:** mail can be replied or archived; meetings moved or cancelled; tasks completed or delegated; direct messages sent. The episode is therefore a **small transactional simulation**, not a static Q&A.
---
### 4. What changes between steps (dynamics and drift)
Ghostexec is **not** a static paragraph with a hidden answer key. After actions, the environment runs **post-step dynamics** (see `../server/ghostexec_environment.py`):
- **Clock:** simulation time advances (default **20 minutes** per step), which can flip tasks into overdue and change what “urgent” means.
- **Mood:** stakeholders move along a mood ladder after real actions (e.g. a thoughtful reply can improve a sender; cancelling a meeting can upset attendees).
- **Pressure on idle / invalid behavior:** if the agent **`do_nothing`**s or **fails** while **critical** mail is still unanswered, mood pressure can concentrate on the sender who is actually waiting—so “safe” inaction is not safe in the social graph.
- **Stress and conflicts:** the world rebuilds an **active conflict list** (overlaps, unanswered critical mail) and maps that into the **stress** value surfaced in the briefing—so calendar debt is not cosmetic.
**Scenario-driven schema drift:** harder JSON can schedule **`after_step`** events that reshuffle the world mid-episode: shift meetings, move deadlines, change communication preferences, **suppress relationship credit** for certain reply paths, or force moods. That tests **adaptation**, not memorization of the first screen.
---
### 5. How success is scored (two layers, on purpose)
**A. Dense step reward (training and fine-grained analysis)** — `../server/reward.py`
A **fixed** weighted core (**0.35 conflict + 0.35 relationship + 0.30 task**) plus **bounded** shaping terms (synergy, tradeoffs, progress-style shaping, scaffold, quality separation). Invalid steps and **`do_nothing`** are handled explicitly (idle is **penalised**, not neutral). Rich `RewardBreakdown` fields can be logged to `outputs/logs/episode_rewards.jsonl` (gitignored) for auditing *why* a step moved.
**B. Trajectory graders (OpenEnv / hackathon validation)** — `../graders.py`
Each public task in `../openenv.yaml` binds a grader (`graders.phase2_core_grader`, etc.). Graders read **trajectory-shaped** payloads (e.g. lists of rewards) and return scores **strictly inside `(0.01, 0.99)`**—the validator-facing layer—while the step engine remains the **dense teaching signal**.
That split is deliberate: **agents learn from fine structure**, **judges certify** with stable bounded scores.
---
### 6. The public task ladder (difficulty in *data*, not vibes)
| Task id | Difficulty | Scenario file | What gets harder |
|---------|------------|----------------|------------------|
| `phase2_core` | easy | `../scenarios/phase2_core.json` | Dense default triage: VIP mail, calendar relief, overlapping obligations. |
| `monday_morning` | medium | `../scenarios/monday_morning.json` | Stacked Monday rush: more concurrent fires, less slack. |
| `dinner_disaster` | hard | `../scenarios/dinner_disaster.json` | Personal vs professional collision with **escalation risk**. |
All of this is declared in **`../openenv.yaml`** so the Space, CLI, and notebooks agree on **names**, **ports**, and **grader wiring** without a second source of truth.
---
### 7. How a reviewer can verify (5-minute checklist)
1. Open **`../openenv.yaml`** — confirm three tasks, `max_steps`, `app: server.app:app`, **`name: ghostexec`**.
2. Open **`../scenarios/*.json`** — confirm episodes are **data**, not hardcoded Python lore.
3. Skim **`../server/ghostexec_environment.py`** — `build_briefing_text`, `_apply_action`, `_apply_post_action_dynamics`, `_maybe_apply_schema_drift_events`.
4. Skim **`../server/reward.py`** — fixed weights, invalid / idle handling, shaping caps.
5. Open **`../graders.py`** — strict output bounds and trajectory consumption.
6. Open the **public Space**: https://huggingface.co/spaces/modelbuilderhq/ghostexec — use `/docs` or `POST /reset` + `POST /step`: legal actions change state; illegal actions return errors, **not** stack traces.
---
### 8. Closing
**World quality.** The challenge is **interactional and operational**: overlapping human-style goals, strict tool use, evolving social signals, and mid-episode drift—**not** a single binary “did you answer correctly.”
**What this stack proves.** If you strip Ghostexec to one bullet, it is: **plain-text situational awareness + legal structured world edits + multi-channel rewards + timed scenario pressure + OpenEnv-native deployment and graders**—in one coherent package you can train, log, and host.
That is the **innovation case** this repository is built to defend.
---
## Key files (from repo root)
| Path | Role |
|------|------|
| `openenv.yaml` | Space name, port, tasks, graders, `max_steps` |
| `scenarios/*.json` | Episode **data** (world content, drift hooks) |
| `server/ghostexec_environment.py` | Briefing text, actions, dynamics, drift |
| `server/reward.py` | Step reward, fixed 0.35 / 0.35 / 0.30 core + shaping |
| `graders.py` | Trajectory scores in `(0.01, 0.99)` per task |
| `models.py` | `GhostexecAction`, `GhostexecObservation`, `RewardBreakdown` |
For install, tests, training scripts, and the rest of the hackathon submission, see the [main project README](../README.md).
|