Spaces:

modelbuilderhq
/

ghostexec

Sleeping

File size: 10,821 Bytes

d815df7

# Ghostexec — innovation brief (for reviewers)

**Repository:** [Ghostexec (OpenEnv)](../README.md)  
**Public Space:** https://huggingface.co/spaces/modelbuilderhq/ghostexec  

This README is a **standalone** walkthrough for reviewers: why the environment is hard, what agent capabilities it stresses, how to verify claims in code and on the live Space. You can read it **without** opening the rest of the repo narrative.

---

## Contents

1. [How to read this document](#how-to-read-this-document)  
2. [Short answers](#short-answers-so-nothing-is-buried)  
3. [What Ghostexec is](#1-what-ghostexec-is-one-paragraph)  
4. [What the agent observes](#2-what-the-agent-observes-and-why-that-matters)  
5. [What the agent can do](#3-what-the-agent-can-do-actions-and-legality)  
6. [What changes between steps](#4-what-changes-between-steps-dynamics-and-drift)  
7. [How success is scored](#5-how-success-is-scored-two-layers-on-purpose)  
8. [Task ladder](#6-the-public-task-ladder-difficulty-in-data-not-vibes)  
9. [Reviewer checklist](#7-how-a-reviewer-can-verify-5-minute-checklist)  
10. [Closing](#8-closing)  
11. [Key files (from repo root)](#key-files-from-repo-root)

---

## How to read this document

We group the argument under **two angles** reviewers typically care about. Everything below maps to one or both:

| Angle | Sections that answer it |
|-------|-------------------------|
| Is the **world** itself interesting and genuinely hard? | [Short answers](#short-answers-so-nothing-is-buried), [§1–§4](#1-what-ghostexec-is-one-paragraph) |
| Does it **stress-test agents** in a way a toy demo would not? | [Short answers](#short-answers-so-nothing-is-buried), [§3–§6](#3-what-the-agent-can-do-actions-and-legality), [§8](#8-closing) |

---

## Short answers (so nothing is buried)

**Is it genuinely challenging?** Yes. The agent must survive **dense natural-language state**, emit **strict structured actions** that **mutate** a multi-entity world, and accept **time pressure**, **social consequences**, and **invalid-action economics** without crashing the server. “Easy” wins are rare because channels **compete**: mail, calendar, tasks, and relationships all pull in different directions.

**Is it a meaningful test of behavior?** Yes. Success requires **grounded parsing** (real ids from the briefing), **tool discipline** (legal JSON schema), **sequencing** over multiple steps (WebSocket sessions for real episodes; HTTP for resets and single steps), and **tradeoffs** reflected in a **multi-channel** reward—not a single template answer.

**Is every ingredient globally novel?** No—and we do not claim otherwise. Inboxes and calendars are familiar. What *is* uncommon is the **composition**: OpenEnv-first packaging, **plain-text-only** observations, **data-driven** scenarios, **live dynamics** and **timed drift**, **dual** evaluation (**dense step rewards** + **trajectory graders** in strict `(0.01, 0.99)`), and a **production-shaped** action API—together—in one environment you can train and ship.

---

### 1. What Ghostexec is (one paragraph)

Ghostexec is an **executive chief-of-staff simulator**. Each episode starts from JSON scenario data under `../scenarios/`, selected by **task id** in `../openenv.yaml`. The **engine** lives in `../server/ghostexec_environment.py` and `../server/reward.py`; the **deployment contract** for Hugging Face / OpenEnv is `../openenv.yaml` (name **`ghostexec`**, FastAPI `server.app:app`, port **8000**). The model never sees raw scenario JSON as its primary observation: it sees a **rendered briefing**—the same class of messy, overlapping information a human would scan under time pressure.

---

### 2. What the agent observes (and why that matters)

After `reset` (or the WebSocket equivalent), the policy receives `GhostexecObservation.echoed_message`: a **single plain-text** block that includes, at minimum:

- A **timestamped header** (simulated “now”).
- **Unread emails** with priority, sender, relationship, subject, and a short preview.
- **Calendar conflicts** in a rolling horizon (overlaps the agent could resolve or worsen).
- **Top contacts** with **mood**, relationship type, and communication preference.
- **Tasks** that are overdue or due soon.
- **Executive stress** and **steps remaining** toward `max_steps` (see `../openenv.yaml`, default **20**).

**Why this matters for “challenging”:** many demos hide structure in JSON observations or tool schemas. Here, the **only** narrative state the model is supposed to “read” like a user is **natural language**, while the **law** of the world is still **typed actions**. That forces **comprehension + compliance** together—hallucinated ids and “vibes-only” plans fail in ways you can measure.

---

### 3. What the agent can do (actions and legality)

Each step the agent returns **exactly one** `GhostexecAction` (`../models.py`): `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, or `do_nothing`.

**Validity is enforced against the live world:** wrong `email_id` / `meeting_id` / `task_id`, missing required fields, or impossible combinations produce an **invalid step**. The server **does not throw**; it returns structured metadata (`step_ok`, error text) so RL and HTTP clients can learn from mistakes instead of dying.

**Valid actions mutate state:** mail can be replied or archived; meetings moved or cancelled; tasks completed or delegated; direct messages sent. The episode is therefore a **small transactional simulation**, not a static Q&A.

---

### 4. What changes between steps (dynamics and drift)

Ghostexec is **not** a static paragraph with a hidden answer key. After actions, the environment runs **post-step dynamics** (see `../server/ghostexec_environment.py`):

- **Clock:** simulation time advances (default **20 minutes** per step), which can flip tasks into overdue and change what “urgent” means.
- **Mood:** stakeholders move along a mood ladder after real actions (e.g. a thoughtful reply can improve a sender; cancelling a meeting can upset attendees).
- **Pressure on idle / invalid behavior:** if the agent **`do_nothing`**s or **fails** while **critical** mail is still unanswered, mood pressure can concentrate on the sender who is actually waiting—so “safe” inaction is not safe in the social graph.
- **Stress and conflicts:** the world rebuilds an **active conflict list** (overlaps, unanswered critical mail) and maps that into the **stress** value surfaced in the briefing—so calendar debt is not cosmetic.

**Scenario-driven schema drift:** harder JSON can schedule **`after_step`** events that reshuffle the world mid-episode: shift meetings, move deadlines, change communication preferences, **suppress relationship credit** for certain reply paths, or force moods. That tests **adaptation**, not memorization of the first screen.

---

### 5. How success is scored (two layers, on purpose)

**A. Dense step reward (training and fine-grained analysis)** — `../server/reward.py`  
A **fixed** weighted core (**0.35 conflict + 0.35 relationship + 0.30 task**) plus **bounded** shaping terms (synergy, tradeoffs, progress-style shaping, scaffold, quality separation). Invalid steps and **`do_nothing`** are handled explicitly (idle is **penalised**, not neutral). Rich `RewardBreakdown` fields can be logged to `outputs/logs/episode_rewards.jsonl` (gitignored) for auditing *why* a step moved.

**B. Trajectory graders (OpenEnv / hackathon validation)** — `../graders.py`  
Each public task in `../openenv.yaml` binds a grader (`graders.phase2_core_grader`, etc.). Graders read **trajectory-shaped** payloads (e.g. lists of rewards) and return scores **strictly inside `(0.01, 0.99)`**—the validator-facing layer—while the step engine remains the **dense teaching signal**.

That split is deliberate: **agents learn from fine structure**, **judges certify** with stable bounded scores.

---

### 6. The public task ladder (difficulty in *data*, not vibes)

| Task id | Difficulty | Scenario file | What gets harder |
|---------|------------|----------------|------------------|
| `phase2_core` | easy | `../scenarios/phase2_core.json` | Dense default triage: VIP mail, calendar relief, overlapping obligations. |
| `monday_morning` | medium | `../scenarios/monday_morning.json` | Stacked Monday rush: more concurrent fires, less slack. |
| `dinner_disaster` | hard | `../scenarios/dinner_disaster.json` | Personal vs professional collision with **escalation risk**. |

All of this is declared in **`../openenv.yaml`** so the Space, CLI, and notebooks agree on **names**, **ports**, and **grader wiring** without a second source of truth.

---

### 7. How a reviewer can verify (5-minute checklist)

1. Open **`../openenv.yaml`** — confirm three tasks, `max_steps`, `app: server.app:app`, **`name: ghostexec`**.  
2. Open **`../scenarios/*.json`** — confirm episodes are **data**, not hardcoded Python lore.  
3. Skim **`../server/ghostexec_environment.py`** — `build_briefing_text`, `_apply_action`, `_apply_post_action_dynamics`, `_maybe_apply_schema_drift_events`.  
4. Skim **`../server/reward.py`** — fixed weights, invalid / idle handling, shaping caps.  
5. Open **`../graders.py`** — strict output bounds and trajectory consumption.  
6. Open the **public Space**: https://huggingface.co/spaces/modelbuilderhq/ghostexec — use `/docs` or `POST /reset` + `POST /step`: legal actions change state; illegal actions return errors, **not** stack traces.

---

### 8. Closing

**World quality.** The challenge is **interactional and operational**: overlapping human-style goals, strict tool use, evolving social signals, and mid-episode drift—**not** a single binary “did you answer correctly.”

**What this stack proves.** If you strip Ghostexec to one bullet, it is: **plain-text situational awareness + legal structured world edits + multi-channel rewards + timed scenario pressure + OpenEnv-native deployment and graders**—in one coherent package you can train, log, and host.

That is the **innovation case** this repository is built to defend.

---

## Key files (from repo root)

| Path | Role |
|------|------|
| `openenv.yaml` | Space name, port, tasks, graders, `max_steps` |
| `scenarios/*.json` | Episode **data** (world content, drift hooks) |
| `server/ghostexec_environment.py` | Briefing text, actions, dynamics, drift |
| `server/reward.py` | Step reward, fixed 0.35 / 0.35 / 0.30 core + shaping |
| `graders.py` | Trajectory scores in `(0.01, 0.99)` per task |
| `models.py` | `GhostexecAction`, `GhostexecObservation`, `RewardBreakdown` |

For install, tests, training scripts, and the rest of the hackathon submission, see the [main project README](../README.md).