---
title: Ghostexec Environment Server
emoji: 📢
colorFrom: pink
colorTo: yellow
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
- openenv
---
# Ghostexec
**Ghostexec** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment: a busy **executive chief-of-staff** simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a **plain-text briefing**, then emit **one structured action per step** (`reply_email`, `reschedule_meeting`, …). The server returns rewards shaped around **conflict**, **relationships**, and **tasks**—plus trajectory **graders** for hackathon validation. All episode **content** lives in `scenarios/*.json`; the engine is in `server/ghostexec_environment.py` and `server/reward.py`.
| Item | Value |
|------|--------|
| **HF Space name / manifest** | `ghostexec` in [`openenv.yaml`](openenv.yaml) |
| **Python package** | `openenv-ghostexec` in [`pyproject.toml`](pyproject.toml) (import `ghostexec`) |
| **Public Space** | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
| **Deeper innovation-only brief** | [`environment-innovation/README.md`](environment-innovation/README.md) |
---
## Deliverables (fill before freeze)
| Deliverable | URL |
|-------------|-----|
| Public HF Space (required) | [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
| Write-up / blog (HF post preferred) | `TODO: paste your post URL` |
| Short demo video (<2 min) | `TODO: paste your video URL` |
---
## Contents
**Judging criteria (this README is organized around them)**
1. [Criterion: Environment Innovation (40%)](#ghostexec-env-innovation)
2. [Criterion: Storytelling & Presentation (30%)](#ghostexec-storytelling)
3. [Criterion: Showing Improvement in Rewards (20%)](#ghostexec-reward-improvement)
4. [Criterion: Reward & Training Pipeline (10%)](#ghostexec-reward-pipeline)
**Reference**
5. [Hackathon themes & checklist](#openenv-hackathon-themes--checklist)
6. [Quick start](#quick-start-python-client)
7. [Actions](#actions-and-fields)
8. [Observation](#observation)
9. [Reward (formula summary)](#reward-formula-summary)
10. [HTTP vs WebSocket](#http-vs-websocket-episode-state)
11. [Running and testing locally](#running-and-testing-locally)
12. [Hugging Face Spaces](#hugging-face-spaces)
13. [Scenarios](#scenarios)
14. [Project layout](#project-layout)
15. [Resources & references](#resources--references)
16. [License](#license)
---
## Criterion: Environment Innovation (40%)
**Weight:** 40%
**What it means:**
- Is the environment novel, creative, or genuinely challenging?
- Does it meaningfully test agent behavior in a way that hasn't been done before?
### How Ghostexec answers this
**Challenging world.** The policy sees **one dense natural-language briefing** per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)—not a JSON dump of the world. It must **ground** decisions in real ids from that text, return **valid typed actions**, and accept **time pressure** and **social fallout** when meetings move or mail goes unanswered. Invalid actions **do not crash** the server; they return structured errors so learning signals stay intact.
**Meaningful behavior, not a toy Q&A.** Success needs **comprehension + tool discipline**: legal JSON schema, multi-step **sequences** (WebSocket sessions for real episodes), and **tradeoffs** across channels (mail vs calendar vs tasks vs relationships). **`do_nothing` is penalised** so “safe” idleness is costly when fires are burning.
**Dynamics, not a static paragraph.** After each valid action, the simulation **advances the clock**, updates **moods**, rebuilds **conflicts**, and can apply **scenario-driven drift** (`after_step` events in JSON): shifted meetings, new deadlines, preference changes—so the agent is tested on **adaptation**, not memorizing the first screen.
**Dual evaluation.** **Dense step rewards** in `server/reward.py` teach fine structure; **trajectory graders** in `graders.py` return scores strictly in **`(0.01, 0.99)`** per OpenEnv task wiring in `openenv.yaml`. Agents learn from the dense signal; judges get bounded certification scores.
**Honest novelty claim.** Inboxes and calendars are familiar **ingredients**. What is less common is the **composition**: OpenEnv-native packaging, **plain-text-only** observations, **data-defined** scenarios, live dynamics + drift, dual reward/grader stack, and a **transactional** action API in one trainable, hostable environment.
### Task ladder (difficulty in data)
| Task id | Difficulty | Scenario | What gets harder |
|---------|------------|----------|------------------|
| `phase2_core` | easy | `scenarios/phase2_core.json` | Dense triage: VIP mail, calendar relief, overlapping work. |
| `monday_morning` | medium | `scenarios/monday_morning.json` | Stacked Monday rush, less slack. |
| `dinner_disaster` | hard | `scenarios/dinner_disaster.json` | Personal vs professional collision, escalation risk. |
### 5-minute verification checklist
1. **`openenv.yaml`** — three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`, grader paths.
2. **`scenarios/*.json`** — world content is **data**, not hardcoded lore in Python.
3. **`server/ghostexec_environment.py`** — `build_briefing_text`, `_apply_action`, post-step dynamics, schema drift hooks.
4. **`server/reward.py`** — fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
5. **`graders.py`** — bounded grader outputs, trajectory consumption.
6. **Live Space** — `/docs` or `POST /reset` + `POST /step`: legal steps change state; illegal steps return errors, not stack traces.
For a **standalone** walkthrough of the innovation angle only, see **[environment-innovation/README.md](environment-innovation/README.md)**.
---
## Criterion: Storytelling & Presentation (30%)
**Weight:** 30%
**What it means:**
- Can you clearly explain the problem, the environment, and what the agent learned?
- Is the demo engaging and easy to follow for a non-technical audience?
### The problem (plain language)
An executive’s day is **messy**: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice **ripples**—someone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a **small simulator** the model must **run**, not a single paragraph to summarize.
### The environment (one sentence)
**You read a realistic staff briefing; you pick one legal “move” (reply, reschedule, delegate, …); the world updates; you get a score that reflects tension across work, people, and tasks.**
### What the agent is supposed to learn
- **Read carefully** — wrong `email_id` / `meeting_id` / `task_id` fails cleanly with feedback.
- **Act under pressure** — clock, `max_steps`, and stress push toward decisions, not endless analysis.
- **Balance competing goals** — improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff.
- **Recover from change** — drift events mean the “right” plan from step 1 may not stay right at step 8.
### Demo tips for a non-technical audience
1. **Show the briefing first** — let viewers see the same wall of text the model sees (relatable chaos).
2. **Show one good step vs one bad step** — e.g. thoughtful reply vs invalid id or `do_nothing` while critical mail waits (mood / reward visibly differ).
3. **Name the three “channels”** — calmer calendar, happier stakeholders, tasks moving forward—without math jargon.
4. **End on “what improved”** — after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).
### Hackathon alignment (themes)
**Theme fit (examples):** Ghostexec fits **Theme 3.2 — Personalized tasks** (executive-style inbox, calendar, delegation). **Theme 4** is partially supported via `GHOSTEXEC_CURRICULUM`, `GHOSTEXEC_PERTURB`, and diverse `scenarios/`.
---
## Criterion: Showing Improvement in Rewards (20%)
**Weight:** 20%
**What it means:**
- Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baseline—anything that proves the agent learned something.
### Where evidence lives in this repo
| Artifact | Role |
|----------|------|
| `outputs/logs/episode_rewards.jsonl` | Per-step reward trace (gitignored); use for **reward curves** and component debugging. |
| `outputs/trainer_state.json` / training logs | Produced by training scripts when configured; feed into plotting. |
| `outputs/reward_log.csv` | Optional CSV companion for plotting pipelines. |
| `outputs/compliance_manifest.json` | Baseline / compliance metadata for **comparison** charts. |
| `outputs/plots/*.png` | Generated report figures (see command below). |
**Plot pack (loss + reward + components + baseline bar):**
```bash
uv run python scripts/plot_training_report.py \
--trainer-history outputs/trainer_state.json \
--reward-csv outputs/reward_log.csv \
--baselines-json outputs/compliance_manifest.json \
--out-dir outputs/plots
```
Writes `loss_curve.png`, `reward_curve.png`, `components_curve.png`, `baseline_comparison.png` under `outputs/plots/`.
**End-to-end notebook:** [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) is intended to **Run All** without manual steps (per project convention).
**Before / after narrative for judges:** same `task_id` and seed—show **lower invalid rate**, **higher mean step reward**, or **clearer grader trajectory** after finetuning. Pair numbers with **one short clip** of two runs side by side on the Space or local server.
---
## Criterion: Reward & Training Pipeline (10%)
**Weight:** 10%
**What it means:**
- Is the reward logic coherent?
- Does the pipeline produce meaningful improvement in the trained agent's behavior?
### Reward logic (coherent and inspectable)
Phase-4 scoring in `server/reward.py` uses a **fixed** core blend:
\[
\text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
\]
Then bounded shaping, invalid-step handling, and explicit penalties (including **`do_nothing`**). Components surface on `RewardBreakdown` and in observation **metadata** where configured—so “why did this step score X?” is **auditable**, not a black box.
Design rationale is aligned with dense reward-shaping practice (see [arXiv:2408.10215](https://arxiv.org/abs/2408.10215))—fixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.
### Training pipeline (entrypoints)
| Step | Command / artifact |
|------|---------------------|
| Install | `uv sync` (from repo root) |
| Server (matches Dockerfile) | `uv run server --port 8000` |
| SFT → GRPO script | `uv run python scripts/train_sft_then_grpo.py` (see [Running and testing locally](#running-and-testing-locally) for a full example invocation) |
| Tests | `uv run pytest tests/ -q` |
| Docker build gate | `GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q` |
The pipeline is **meaningful** when tied to the **20% evidence** above: same env URL, logged rewards, and plots that move in the right direction over training—not when loss alone decreases.
---
## OpenEnv Hackathon themes & checklist
| Item | Status |
|------|--------|
| OpenEnv-based env + `openenv.yaml` | In-repo (`openenv-core[core]>=0.2.3`). |
| Short write-up or <2 min video | **You:** publish and paste URLs in [Deliverables](#deliverables-fill-before-freeze). |
| Public HF Space | [Deliverables](#deliverables-fill-before-freeze); deploy with `openenv push --repo-id /ghostexec`. |
---
## Quick start (Python client)
From the repo root (where `pyproject.toml` lives):
```bash
uv sync
uv run server --port 8000
```
```python
from ghostexec import GhostexecAction, GhostexecEnv
with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
out = env.reset()
print(out.observation.echoed_message[:500], "…")
step = env.step(
GhostexecAction(
action_type="reply_email",
email_id="e01",
message_body=(
"Marcus — acknowledged. Revised figures and short rationale "
"before noon. — Exec"
),
)
)
print("reward:", step.reward)
print("metadata keys:", sorted((step.observation.metadata or {}).keys()))
```
**Docker (optional):**
```bash
docker build -t ghostexec-env:latest .
```
---
## Actions and fields
`GhostexecAction` (`models.py`):
| `action_type` | Typical fields |
|---------------|----------------|
| `reply_email` | `email_id`, `message_body` |
| `archive_email` | `email_id` |
| `reschedule_meeting` | `meeting_id`, `new_time`, `reason` |
| `cancel_meeting` | `meeting_id`, `reason` |
| `complete_task` | `task_id` |
| `delegate_task` | `task_id`, `contact_name` |
| `send_message` | `contact_name`, `message` |
| `do_nothing` | — (penalised path) |
Malformed HTTP payloads are handled safely so clients do not crash the server.
---
## Observation
- **`echoed_message`** — Full plain-text briefing.
- **`message_length`** — Length of briefing.
- **`reward`**, **`done`**, **`metadata`** — Step outcome; metadata includes `step_ok`, reward breakdown fields, and debug ids.
---
## Reward (formula summary)
Full detail is under [Criterion: Reward & Training Pipeline (10%)](#criterion-reward--training-pipeline-10). Episode logs: `outputs/logs/episode_rewards.jsonl` (gitignored).
---
## HTTP vs WebSocket (episode state)
- **HTTP** `POST /reset` and `POST /step` may use **short-lived** instances; consecutive HTTP calls might not share one in-memory episode.
- **WebSocket `/ws`** (or `GhostexecEnv`) — use for **multi-step episodes** on one session.
Endpoints: **`/web`**, **`/docs`**, **`/health`**, **`/ws`**.
---
## Running and testing locally
```bash
uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
# or
uv run server --port 8000
```
**HTTP smoke:**
```bash
uv run python scripts/http_endpoint_smoke.py --local
```
**Tests:**
```bash
uv run pytest tests/ -q
GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
uv run pytest tests/test_live_server_exhaustive.py -v --tb=short # server on :8000
```
**SFT → GRPO (example):**
```bash
uv run python scripts/train_sft_then_grpo.py \
--model-preset small_iter_fast \
--training-preset hackathon_turbo \
--env-url http://127.0.0.1:8000 \
--generate-sft-from-env \
--sft-samples 120 \
--max-sft-steps 60 \
--max-grpo-steps 120 \
--env-reward-scale 1.0 \
--local-reward-scale 0.35 \
--complexity-curriculum easy_to_full \
--curriculum-ramp-ratio 0.60
```
---
## Hugging Face Spaces
```bash
openenv serve
openenv build
openenv validate --verbose
openenv push
# openenv push --repo-id your-username/ghostexec
```
Use a **public** Space for the default hackathon flow. `openenv.yaml` carries **name**, **version**, and **description** for metadata—keep them in sync with submission needs.
---
## Scenarios
| File | Role |
|------|------|
| `scenarios/phase2_core.json` | Default dense fixture |
| `scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json` | Narrative pressure |
| `scenarios/vip_meltdown_drift.json` | Mood / escalation drift |
| `scenarios/schema_drift_test.json` | Drift-event harness |
---
## Project layout
```
ghostexec/
├── openenv.yaml
├── pyproject.toml
├── models.py
├── client.py
├── graders.py
├── scenarios/
├── scripts/
├── notebooks/
├── tests/
└── server/
├── app.py
├── ghostexec_environment.py
├── reward.py
└── Dockerfile
```
---
## Resources & references
- [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) — core stack
- [Packaging & Deploying](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)
- [OpenEnv Hub](https://huggingface.co/openenv)
- [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA) (and related talks linked in prior README iterations)
---
## License
BSD-style — see license notices in source files (Meta / OpenEnv lineage).