Spaces:

modelbuilderhq
/

ghostexec

Sleeping

App Files Files Community

ghostexec / environment-innovation /README.md

modelbuilderhq

Upload folder using huggingface_hub

d815df7 verified 11 days ago

preview code

raw

history blame contribute delete

10.8 kB

	# Ghostexec — innovation brief (for reviewers)

	Repository: [Ghostexec (OpenEnv)](../README.md)
	Public Space: https://huggingface.co/spaces/modelbuilderhq/ghostexec

	This README is a standalone walkthrough for reviewers: why the environment is hard, what agent capabilities it stresses, how to verify claims in code and on the live Space. You can read it without opening the rest of the repo narrative.

	---

	## Contents

	1. [How to read this document](#how-to-read-this-document)
	2. [Short answers](#short-answers-so-nothing-is-buried)
	3. [What Ghostexec is](#1-what-ghostexec-is-one-paragraph)
	4. [What the agent observes](#2-what-the-agent-observes-and-why-that-matters)
	5. [What the agent can do](#3-what-the-agent-can-do-actions-and-legality)
	6. [What changes between steps](#4-what-changes-between-steps-dynamics-and-drift)
	7. [How success is scored](#5-how-success-is-scored-two-layers-on-purpose)
	8. [Task ladder](#6-the-public-task-ladder-difficulty-in-data-not-vibes)
	9. [Reviewer checklist](#7-how-a-reviewer-can-verify-5-minute-checklist)
	10. [Closing](#8-closing)
	11. [Key files (from repo root)](#key-files-from-repo-root)

	---

	## How to read this document

	We group the argument under two angles reviewers typically care about. Everything below maps to one or both:

	\| Angle \| Sections that answer it \|
	\|-------\|-------------------------\|
	\| Is the world itself interesting and genuinely hard? \| [Short answers](#short-answers-so-nothing-is-buried), [§1–§4](#1-what-ghostexec-is-one-paragraph) \|
	\| Does it stress-test agents in a way a toy demo would not? \| [Short answers](#short-answers-so-nothing-is-buried), [§3–§6](#3-what-the-agent-can-do-actions-and-legality), [§8](#8-closing) \|

	---

	## Short answers (so nothing is buried)

	Is it genuinely challenging? Yes. The agent must survive dense natural-language state, emit strict structured actions that mutate a multi-entity world, and accept time pressure, social consequences, and invalid-action economics without crashing the server. “Easy” wins are rare because channels compete: mail, calendar, tasks, and relationships all pull in different directions.

	Is it a meaningful test of behavior? Yes. Success requires grounded parsing (real ids from the briefing), tool discipline (legal JSON schema), sequencing over multiple steps (WebSocket sessions for real episodes; HTTP for resets and single steps), and tradeoffs reflected in a multi-channel reward—not a single template answer.

	Is every ingredient globally novel? No—and we do not claim otherwise. Inboxes and calendars are familiar. What is uncommon is the composition: OpenEnv-first packaging, plain-text-only observations, data-driven scenarios, live dynamics and timed drift, dual evaluation (dense step rewards + trajectory graders in strict `(0.01, 0.99)`), and a production-shaped action API—together—in one environment you can train and ship.

	---

	### 1. What Ghostexec is (one paragraph)

	Ghostexec is an executive chief-of-staff simulator. Each episode starts from JSON scenario data under `../scenarios/`, selected by task id in `../openenv.yaml`. The engine lives in `../server/ghostexec_environment.py` and `../server/reward.py`; the deployment contract for Hugging Face / OpenEnv is `../openenv.yaml` (name `ghostexec`, FastAPI `server.app:app`, port 8000). The model never sees raw scenario JSON as its primary observation: it sees a rendered briefing—the same class of messy, overlapping information a human would scan under time pressure.

	---

	### 2. What the agent observes (and why that matters)

	After `reset` (or the WebSocket equivalent), the policy receives `GhostexecObservation.echoed_message`: a single plain-text block that includes, at minimum:

	- A timestamped header (simulated “now”).
	- Unread emails with priority, sender, relationship, subject, and a short preview.
	- Calendar conflicts in a rolling horizon (overlaps the agent could resolve or worsen).
	- Top contacts with mood, relationship type, and communication preference.
	- Tasks that are overdue or due soon.
	- Executive stress and steps remaining toward `max_steps` (see `../openenv.yaml`, default 20).

	Why this matters for “challenging”: many demos hide structure in JSON observations or tool schemas. Here, the only narrative state the model is supposed to “read” like a user is natural language, while the law of the world is still typed actions. That forces comprehension + compliance together—hallucinated ids and “vibes-only” plans fail in ways you can measure.

	---

	### 3. What the agent can do (actions and legality)

	Each step the agent returns exactly one `GhostexecAction` (`../models.py`): `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, or `do_nothing`.

	Validity is enforced against the live world: wrong `email_id` / `meeting_id` / `task_id`, missing required fields, or impossible combinations produce an invalid step. The server does not throw; it returns structured metadata (`step_ok`, error text) so RL and HTTP clients can learn from mistakes instead of dying.

	Valid actions mutate state: mail can be replied or archived; meetings moved or cancelled; tasks completed or delegated; direct messages sent. The episode is therefore a small transactional simulation, not a static Q&A.

	---

	### 4. What changes between steps (dynamics and drift)

	Ghostexec is not a static paragraph with a hidden answer key. After actions, the environment runs post-step dynamics (see `../server/ghostexec_environment.py`):

	- Clock: simulation time advances (default 20 minutes per step), which can flip tasks into overdue and change what “urgent” means.
	- Mood: stakeholders move along a mood ladder after real actions (e.g. a thoughtful reply can improve a sender; cancelling a meeting can upset attendees).
	- Pressure on idle / invalid behavior: if the agent `do_nothing`s or fails while critical mail is still unanswered, mood pressure can concentrate on the sender who is actually waiting—so “safe” inaction is not safe in the social graph.
	- Stress and conflicts: the world rebuilds an active conflict list (overlaps, unanswered critical mail) and maps that into the stress value surfaced in the briefing—so calendar debt is not cosmetic.

	Scenario-driven schema drift: harder JSON can schedule `after_step` events that reshuffle the world mid-episode: shift meetings, move deadlines, change communication preferences, suppress relationship credit for certain reply paths, or force moods. That tests adaptation, not memorization of the first screen.

	---

	### 5. How success is scored (two layers, on purpose)

	A. Dense step reward (training and fine-grained analysis) — `../server/reward.py`
	A fixed weighted core (0.35 conflict + 0.35 relationship + 0.30 task) plus bounded shaping terms (synergy, tradeoffs, progress-style shaping, scaffold, quality separation). Invalid steps and `do_nothing` are handled explicitly (idle is penalised, not neutral). Rich `RewardBreakdown` fields can be logged to `outputs/logs/episode_rewards.jsonl` (gitignored) for auditing why a step moved.

	B. Trajectory graders (OpenEnv / hackathon validation) — `../graders.py`
	Each public task in `../openenv.yaml` binds a grader (`graders.phase2_core_grader`, etc.). Graders read trajectory-shaped payloads (e.g. lists of rewards) and return scores strictly inside `(0.01, 0.99)`—the validator-facing layer—while the step engine remains the dense teaching signal.

	That split is deliberate: agents learn from fine structure, judges certify with stable bounded scores.

	---

	### 6. The public task ladder (difficulty in data, not vibes)

	\| Task id \| Difficulty \| Scenario file \| What gets harder \|
	\|---------\|------------\|----------------\|------------------\|
	\| `phase2_core` \| easy \| `../scenarios/phase2_core.json` \| Dense default triage: VIP mail, calendar relief, overlapping obligations. \|
	\| `monday_morning` \| medium \| `../scenarios/monday_morning.json` \| Stacked Monday rush: more concurrent fires, less slack. \|
	\| `dinner_disaster` \| hard \| `../scenarios/dinner_disaster.json` \| Personal vs professional collision with escalation risk. \|

	All of this is declared in `../openenv.yaml` so the Space, CLI, and notebooks agree on names, ports, and grader wiring without a second source of truth.

	---

	### 7. How a reviewer can verify (5-minute checklist)

	1. Open `../openenv.yaml` — confirm three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`.
	2. Open *`../scenarios/.json` — confirm episodes are data**, not hardcoded Python lore.
	3. Skim `../server/ghostexec_environment.py` — `build_briefing_text`, `_apply_action`, `_apply_post_action_dynamics`, `_maybe_apply_schema_drift_events`.
	4. Skim `../server/reward.py` — fixed weights, invalid / idle handling, shaping caps.
	5. Open `../graders.py` — strict output bounds and trajectory consumption.
	6. Open the public Space: https://huggingface.co/spaces/modelbuilderhq/ghostexec — use `/docs` or `POST /reset` + `POST /step`: legal actions change state; illegal actions return errors, not stack traces.

	---

	### 8. Closing

	World quality. The challenge is interactional and operational: overlapping human-style goals, strict tool use, evolving social signals, and mid-episode drift—not a single binary “did you answer correctly.”

	What this stack proves. If you strip Ghostexec to one bullet, it is: plain-text situational awareness + legal structured world edits + multi-channel rewards + timed scenario pressure + OpenEnv-native deployment and graders—in one coherent package you can train, log, and host.

	That is the innovation case this repository is built to defend.

	---

	## Key files (from repo root)

	\| Path \| Role \|
	\|------\|------\|
	\| `openenv.yaml` \| Space name, port, tasks, graders, `max_steps` \|
	\| `scenarios/.json` \| Episode data* (world content, drift hooks) \|
	\| `server/ghostexec_environment.py` \| Briefing text, actions, dynamics, drift \|
	\| `server/reward.py` \| Step reward, fixed 0.35 / 0.35 / 0.30 core + shaping \|
	\| `graders.py` \| Trajectory scores in `(0.01, 0.99)` per task \|
	\| `models.py` \| `GhostexecAction`, `GhostexecObservation`, `RewardBreakdown` \|

	For install, tests, training scripts, and the rest of the hackathon submission, see the [main project README](../README.md).

	# Ghostexec — innovation brief (for reviewers)

	Repository: [Ghostexec (OpenEnv)](../README.md)
	Public Space: https://huggingface.co/spaces/modelbuilderhq/ghostexec

	This README is a standalone walkthrough for reviewers: why the environment is hard, what agent capabilities it stresses, how to verify claims in code and on the live Space. You can read it without opening the rest of the repo narrative.

	---

	## Contents

	1. [How to read this document](#how-to-read-this-document)
	2. [Short answers](#short-answers-so-nothing-is-buried)
	3. [What Ghostexec is](#1-what-ghostexec-is-one-paragraph)
	4. [What the agent observes](#2-what-the-agent-observes-and-why-that-matters)
	5. [What the agent can do](#3-what-the-agent-can-do-actions-and-legality)
	6. [What changes between steps](#4-what-changes-between-steps-dynamics-and-drift)
	7. [How success is scored](#5-how-success-is-scored-two-layers-on-purpose)
	8. [Task ladder](#6-the-public-task-ladder-difficulty-in-data-not-vibes)
	9. [Reviewer checklist](#7-how-a-reviewer-can-verify-5-minute-checklist)
	10. [Closing](#8-closing)
	11. [Key files (from repo root)](#key-files-from-repo-root)

	---

	## How to read this document

	We group the argument under two angles reviewers typically care about. Everything below maps to one or both:

	\| Angle \| Sections that answer it \|
	\|-------\|-------------------------\|
	\| Is the world itself interesting and genuinely hard? \| [Short answers](#short-answers-so-nothing-is-buried), [§1–§4](#1-what-ghostexec-is-one-paragraph) \|
	\| Does it stress-test agents in a way a toy demo would not? \| [Short answers](#short-answers-so-nothing-is-buried), [§3–§6](#3-what-the-agent-can-do-actions-and-legality), [§8](#8-closing) \|

	---

	## Short answers (so nothing is buried)

	Is it genuinely challenging? Yes. The agent must survive dense natural-language state, emit strict structured actions that mutate a multi-entity world, and accept time pressure, social consequences, and invalid-action economics without crashing the server. “Easy” wins are rare because channels compete: mail, calendar, tasks, and relationships all pull in different directions.

	Is it a meaningful test of behavior? Yes. Success requires grounded parsing (real ids from the briefing), tool discipline (legal JSON schema), sequencing over multiple steps (WebSocket sessions for real episodes; HTTP for resets and single steps), and tradeoffs reflected in a multi-channel reward—not a single template answer.

	Is every ingredient globally novel? No—and we do not claim otherwise. Inboxes and calendars are familiar. What is uncommon is the composition: OpenEnv-first packaging, plain-text-only observations, data-driven scenarios, live dynamics and timed drift, dual evaluation (dense step rewards + trajectory graders in strict `(0.01, 0.99)`), and a production-shaped action API—together—in one environment you can train and ship.

	---

	### 1. What Ghostexec is (one paragraph)

	Ghostexec is an executive chief-of-staff simulator. Each episode starts from JSON scenario data under `../scenarios/`, selected by task id in `../openenv.yaml`. The engine lives in `../server/ghostexec_environment.py` and `../server/reward.py`; the deployment contract for Hugging Face / OpenEnv is `../openenv.yaml` (name `ghostexec`, FastAPI `server.app:app`, port 8000). The model never sees raw scenario JSON as its primary observation: it sees a rendered briefing—the same class of messy, overlapping information a human would scan under time pressure.

	---

	### 2. What the agent observes (and why that matters)

	After `reset` (or the WebSocket equivalent), the policy receives `GhostexecObservation.echoed_message`: a single plain-text block that includes, at minimum:

	- A timestamped header (simulated “now”).
	- Unread emails with priority, sender, relationship, subject, and a short preview.
	- Calendar conflicts in a rolling horizon (overlaps the agent could resolve or worsen).
	- Top contacts with mood, relationship type, and communication preference.
	- Tasks that are overdue or due soon.
	- Executive stress and steps remaining toward `max_steps` (see `../openenv.yaml`, default 20).

	Why this matters for “challenging”: many demos hide structure in JSON observations or tool schemas. Here, the only narrative state the model is supposed to “read” like a user is natural language, while the law of the world is still typed actions. That forces comprehension + compliance together—hallucinated ids and “vibes-only” plans fail in ways you can measure.

	---

	### 3. What the agent can do (actions and legality)

	Each step the agent returns exactly one `GhostexecAction` (`../models.py`): `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, or `do_nothing`.

	Validity is enforced against the live world: wrong `email_id` / `meeting_id` / `task_id`, missing required fields, or impossible combinations produce an invalid step. The server does not throw; it returns structured metadata (`step_ok`, error text) so RL and HTTP clients can learn from mistakes instead of dying.

	Valid actions mutate state: mail can be replied or archived; meetings moved or cancelled; tasks completed or delegated; direct messages sent. The episode is therefore a small transactional simulation, not a static Q&A.

	---

	### 4. What changes between steps (dynamics and drift)

	Ghostexec is not a static paragraph with a hidden answer key. After actions, the environment runs post-step dynamics (see `../server/ghostexec_environment.py`):

	- Clock: simulation time advances (default 20 minutes per step), which can flip tasks into overdue and change what “urgent” means.
	- Mood: stakeholders move along a mood ladder after real actions (e.g. a thoughtful reply can improve a sender; cancelling a meeting can upset attendees).
	- Pressure on idle / invalid behavior: if the agent `do_nothing`s or fails while critical mail is still unanswered, mood pressure can concentrate on the sender who is actually waiting—so “safe” inaction is not safe in the social graph.
	- Stress and conflicts: the world rebuilds an active conflict list (overlaps, unanswered critical mail) and maps that into the stress value surfaced in the briefing—so calendar debt is not cosmetic.

	Scenario-driven schema drift: harder JSON can schedule `after_step` events that reshuffle the world mid-episode: shift meetings, move deadlines, change communication preferences, suppress relationship credit for certain reply paths, or force moods. That tests adaptation, not memorization of the first screen.

	---

	### 5. How success is scored (two layers, on purpose)

	A. Dense step reward (training and fine-grained analysis) — `../server/reward.py`
	A fixed weighted core (0.35 conflict + 0.35 relationship + 0.30 task) plus bounded shaping terms (synergy, tradeoffs, progress-style shaping, scaffold, quality separation). Invalid steps and `do_nothing` are handled explicitly (idle is penalised, not neutral). Rich `RewardBreakdown` fields can be logged to `outputs/logs/episode_rewards.jsonl` (gitignored) for auditing why a step moved.

	B. Trajectory graders (OpenEnv / hackathon validation) — `../graders.py`
	Each public task in `../openenv.yaml` binds a grader (`graders.phase2_core_grader`, etc.). Graders read trajectory-shaped payloads (e.g. lists of rewards) and return scores strictly inside `(0.01, 0.99)`—the validator-facing layer—while the step engine remains the dense teaching signal.

	That split is deliberate: agents learn from fine structure, judges certify with stable bounded scores.

	---

	### 6. The public task ladder (difficulty in data, not vibes)

	\| Task id \| Difficulty \| Scenario file \| What gets harder \|
	\|---------\|------------\|----------------\|------------------\|
	\| `phase2_core` \| easy \| `../scenarios/phase2_core.json` \| Dense default triage: VIP mail, calendar relief, overlapping obligations. \|
	\| `monday_morning` \| medium \| `../scenarios/monday_morning.json` \| Stacked Monday rush: more concurrent fires, less slack. \|
	\| `dinner_disaster` \| hard \| `../scenarios/dinner_disaster.json` \| Personal vs professional collision with escalation risk. \|

	All of this is declared in `../openenv.yaml` so the Space, CLI, and notebooks agree on names, ports, and grader wiring without a second source of truth.

	---

	### 7. How a reviewer can verify (5-minute checklist)

	1. Open `../openenv.yaml` — confirm three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`.
	2. Open *`../scenarios/.json` — confirm episodes are data**, not hardcoded Python lore.
	3. Skim `../server/ghostexec_environment.py` — `build_briefing_text`, `_apply_action`, `_apply_post_action_dynamics`, `_maybe_apply_schema_drift_events`.
	4. Skim `../server/reward.py` — fixed weights, invalid / idle handling, shaping caps.
	5. Open `../graders.py` — strict output bounds and trajectory consumption.
	6. Open the public Space: https://huggingface.co/spaces/modelbuilderhq/ghostexec — use `/docs` or `POST /reset` + `POST /step`: legal actions change state; illegal actions return errors, not stack traces.

	---

	### 8. Closing

	World quality. The challenge is interactional and operational: overlapping human-style goals, strict tool use, evolving social signals, and mid-episode drift—not a single binary “did you answer correctly.”

	What this stack proves. If you strip Ghostexec to one bullet, it is: plain-text situational awareness + legal structured world edits + multi-channel rewards + timed scenario pressure + OpenEnv-native deployment and graders—in one coherent package you can train, log, and host.

	That is the innovation case this repository is built to defend.

	---

	## Key files (from repo root)

	\| Path \| Role \|
	\|------\|------\|
	\| `openenv.yaml` \| Space name, port, tasks, graders, `max_steps` \|
	\| `scenarios/.json` \| Episode data* (world content, drift hooks) \|
	\| `server/ghostexec_environment.py` \| Briefing text, actions, dynamics, drift \|
	\| `server/reward.py` \| Step reward, fixed 0.35 / 0.35 / 0.30 core + shaping \|
	\| `graders.py` \| Trajectory scores in `(0.01, 0.99)` per task \|
	\| `models.py` \| `GhostexecAction`, `GhostexecObservation`, `RewardBreakdown` \|

	For install, tests, training scripts, and the rest of the hackathon submission, see the [main project README](../README.md).