Spaces:
Sleeping
Sleeping
| title: Ghostexec Environment Server | |
| emoji: π’ | |
| colorFrom: pink | |
| colorTo: yellow | |
| sdk: docker | |
| pinned: false | |
| app_port: 7860 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # Ghostexec | |
| **Ghostexec** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment: a busy **executive chief-of-staff** simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a **plain-text briefing**, then emit **one structured action per step** (`reply_email`, `reschedule_meeting`, β¦). The server returns rewards shaped around **conflict**, **relationships**, and **tasks**βplus trajectory **graders** for hackathon validation. All episode **content** lives in `scenarios/*.json`; the engine is in `server/ghostexec_environment.py` and `server/reward.py`. | |
| | Item | Value | | |
| |------|--------| | |
| | **HF Space name / manifest** | `ghostexec` in [`openenv.yaml`](openenv.yaml) | | |
| | **Python package** | `openenv-ghostexec` in [`pyproject.toml`](pyproject.toml) (import `ghostexec`) | | |
| | **Public Space** | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) | | |
| | **Deeper innovation-only brief** | [`environment-innovation/README.md`](environment-innovation/README.md) | | |
| --- | |
| ## Deliverables (fill before freeze) | |
| | Deliverable | URL | | |
| |-------------|-----| | |
| | Public HF Space (required) | [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) | | |
| | Write-up / blog (HF post preferred) | `TODO: paste your post URL` | | |
| | Short demo video (<2 min) | `TODO: paste your video URL` | | |
| --- | |
| ## Contents | |
| **Judging criteria (this README is organized around them)** | |
| 1. [Criterion: Environment Innovation (40%)](#ghostexec-env-innovation) | |
| 2. [Criterion: Storytelling & Presentation (30%)](#ghostexec-storytelling) | |
| 3. [Criterion: Showing Improvement in Rewards (20%)](#ghostexec-reward-improvement) | |
| 4. [Criterion: Reward & Training Pipeline (10%)](#ghostexec-reward-pipeline) | |
| **Reference** | |
| 5. [Hackathon themes & checklist](#openenv-hackathon-themes--checklist) | |
| 6. [Quick start](#quick-start-python-client) | |
| 7. [Actions](#actions-and-fields) | |
| 8. [Observation](#observation) | |
| 9. [Reward (formula summary)](#reward-formula-summary) | |
| 10. [HTTP vs WebSocket](#http-vs-websocket-episode-state) | |
| 11. [Running and testing locally](#running-and-testing-locally) | |
| 12. [Hugging Face Spaces](#hugging-face-spaces) | |
| 13. [Scenarios](#scenarios) | |
| 14. [Project layout](#project-layout) | |
| 15. [Resources & references](#resources--references) | |
| 16. [License](#license) | |
| --- | |
| ## Criterion: Environment Innovation (40%) | |
| <a id="ghostexec-env-innovation"></a> | |
| **Weight:** 40% | |
| **What it means:** | |
| - Is the environment novel, creative, or genuinely challenging? | |
| - Does it meaningfully test agent behavior in a way that hasn't been done before? | |
| ### How Ghostexec answers this | |
| **Challenging world.** The policy sees **one dense natural-language briefing** per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)βnot a JSON dump of the world. It must **ground** decisions in real ids from that text, return **valid typed actions**, and accept **time pressure** and **social fallout** when meetings move or mail goes unanswered. Invalid actions **do not crash** the server; they return structured errors so learning signals stay intact. | |
| **Meaningful behavior, not a toy Q&A.** Success needs **comprehension + tool discipline**: legal JSON schema, multi-step **sequences** (WebSocket sessions for real episodes), and **tradeoffs** across channels (mail vs calendar vs tasks vs relationships). **`do_nothing` is penalised** so βsafeβ idleness is costly when fires are burning. | |
| **Dynamics, not a static paragraph.** After each valid action, the simulation **advances the clock**, updates **moods**, rebuilds **conflicts**, and can apply **scenario-driven drift** (`after_step` events in JSON): shifted meetings, new deadlines, preference changesβso the agent is tested on **adaptation**, not memorizing the first screen. | |
| **Dual evaluation.** **Dense step rewards** in `server/reward.py` teach fine structure; **trajectory graders** in `graders.py` return scores strictly in **`(0.01, 0.99)`** per OpenEnv task wiring in `openenv.yaml`. Agents learn from the dense signal; judges get bounded certification scores. | |
| **Honest novelty claim.** Inboxes and calendars are familiar **ingredients**. What is less common is the **composition**: OpenEnv-native packaging, **plain-text-only** observations, **data-defined** scenarios, live dynamics + drift, dual reward/grader stack, and a **transactional** action API in one trainable, hostable environment. | |
| ### Task ladder (difficulty in data) | |
| | Task id | Difficulty | Scenario | What gets harder | | |
| |---------|------------|----------|------------------| | |
| | `phase2_core` | easy | `scenarios/phase2_core.json` | Dense triage: VIP mail, calendar relief, overlapping work. | | |
| | `monday_morning` | medium | `scenarios/monday_morning.json` | Stacked Monday rush, less slack. | | |
| | `dinner_disaster` | hard | `scenarios/dinner_disaster.json` | Personal vs professional collision, escalation risk. | | |
| ### 5-minute verification checklist | |
| 1. **`openenv.yaml`** β three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`, grader paths. | |
| 2. **`scenarios/*.json`** β world content is **data**, not hardcoded lore in Python. | |
| 3. **`server/ghostexec_environment.py`** β `build_briefing_text`, `_apply_action`, post-step dynamics, schema drift hooks. | |
| 4. **`server/reward.py`** β fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps. | |
| 5. **`graders.py`** β bounded grader outputs, trajectory consumption. | |
| 6. **Live Space** β `/docs` or `POST /reset` + `POST /step`: legal steps change state; illegal steps return errors, not stack traces. | |
| For a **standalone** walkthrough of the innovation angle only, see **[environment-innovation/README.md](environment-innovation/README.md)**. | |
| --- | |
| ## Criterion: Storytelling & Presentation (30%) | |
| <a id="ghostexec-storytelling"></a> | |
| **Weight:** 30% | |
| **What it means:** | |
| - Can you clearly explain the problem, the environment, and what the agent learned? | |
| - Is the demo engaging and easy to follow for a non-technical audience? | |
| ### The problem (plain language) | |
| An executiveβs day is **messy**: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice **ripples**βsomeone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a **small simulator** the model must **run**, not a single paragraph to summarize. | |
| ### The environment (one sentence) | |
| **You read a realistic staff briefing; you pick one legal βmoveβ (reply, reschedule, delegate, β¦); the world updates; you get a score that reflects tension across work, people, and tasks.** | |
| ### What the agent is supposed to learn | |
| - **Read carefully** β wrong `email_id` / `meeting_id` / `task_id` fails cleanly with feedback. | |
| - **Act under pressure** β clock, `max_steps`, and stress push toward decisions, not endless analysis. | |
| - **Balance competing goals** β improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff. | |
| - **Recover from change** β drift events mean the βrightβ plan from step 1 may not stay right at step 8. | |
| ### Demo tips for a non-technical audience | |
| 1. **Show the briefing first** β let viewers see the same wall of text the model sees (relatable chaos). | |
| 2. **Show one good step vs one bad step** β e.g. thoughtful reply vs invalid id or `do_nothing` while critical mail waits (mood / reward visibly differ). | |
| 3. **Name the three βchannelsβ** β calmer calendar, happier stakeholders, tasks moving forwardβwithout math jargon. | |
| 4. **End on βwhat improvedβ** β after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below). | |
| ### Hackathon alignment (themes) | |
| **Theme fit (examples):** Ghostexec fits **Theme 3.2 β Personalized tasks** (executive-style inbox, calendar, delegation). **Theme 4** is partially supported via `GHOSTEXEC_CURRICULUM`, `GHOSTEXEC_PERTURB`, and diverse `scenarios/`. | |
| --- | |
| ## Criterion: Showing Improvement in Rewards (20%) | |
| <a id="ghostexec-reward-improvement"></a> | |
| **Weight:** 20% | |
| **What it means:** | |
| - Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baselineβanything that proves the agent learned something. | |
| ### Where evidence lives in this repo | |
| | Artifact | Role | | |
| |----------|------| | |
| | `outputs/logs/episode_rewards.jsonl` | Per-step reward trace (gitignored); use for **reward curves** and component debugging. | | |
| | `outputs/trainer_state.json` / training logs | Produced by training scripts when configured; feed into plotting. | | |
| | `outputs/reward_log.csv` | Optional CSV companion for plotting pipelines. | | |
| | `outputs/compliance_manifest.json` | Baseline / compliance metadata for **comparison** charts. | | |
| | `outputs/plots/*.png` | Generated report figures (see command below). | | |
| **Plot pack (loss + reward + components + baseline bar):** | |
| ```bash | |
| uv run python scripts/plot_training_report.py \ | |
| --trainer-history outputs/trainer_state.json \ | |
| --reward-csv outputs/reward_log.csv \ | |
| --baselines-json outputs/compliance_manifest.json \ | |
| --out-dir outputs/plots | |
| ``` | |
| Writes `loss_curve.png`, `reward_curve.png`, `components_curve.png`, `baseline_comparison.png` under `outputs/plots/`. | |
| **End-to-end notebook:** [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) is intended to **Run All** without manual steps (per project convention). | |
| **Before / after narrative for judges:** same `task_id` and seedβshow **lower invalid rate**, **higher mean step reward**, or **clearer grader trajectory** after finetuning. Pair numbers with **one short clip** of two runs side by side on the Space or local server. | |
| --- | |
| ## Criterion: Reward & Training Pipeline (10%) | |
| <a id="ghostexec-reward-pipeline"></a> | |
| **Weight:** 10% | |
| **What it means:** | |
| - Is the reward logic coherent? | |
| - Does the pipeline produce meaningful improvement in the trained agent's behavior? | |
| ### Reward logic (coherent and inspectable) | |
| Phase-4 scoring in `server/reward.py` uses a **fixed** core blend: | |
| \[ | |
| \text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task} | |
| \] | |
| Then bounded shaping, invalid-step handling, and explicit penalties (including **`do_nothing`**). Components surface on `RewardBreakdown` and in observation **metadata** where configuredβso βwhy did this step score X?β is **auditable**, not a black box. | |
| Design rationale is aligned with dense reward-shaping practice (see [arXiv:2408.10215](https://arxiv.org/abs/2408.10215))βfixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training. | |
| ### Training pipeline (entrypoints) | |
| | Step | Command / artifact | | |
| |------|---------------------| | |
| | Install | `uv sync` (from repo root) | | |
| | Server (matches Dockerfile) | `uv run server --port 8000` | | |
| | SFT β GRPO script | `uv run python scripts/train_sft_then_grpo.py` (see [Running and testing locally](#running-and-testing-locally) for a full example invocation) | | |
| | Tests | `uv run pytest tests/ -q` | | |
| | Docker build gate | `GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q` | | |
| The pipeline is **meaningful** when tied to the **20% evidence** above: same env URL, logged rewards, and plots that move in the right direction over trainingβnot when loss alone decreases. | |
| --- | |
| ## OpenEnv Hackathon themes & checklist | |
| | Item | Status | | |
| |------|--------| | |
| | OpenEnv-based env + `openenv.yaml` | In-repo (`openenv-core[core]>=0.2.3`). | | |
| | Short write-up or <2 min video | **You:** publish and paste URLs in [Deliverables](#deliverables-fill-before-freeze). | | |
| | Public HF Space | [Deliverables](#deliverables-fill-before-freeze); deploy with `openenv push --repo-id <your>/ghostexec`. | | |
| --- | |
| ## Quick start (Python client) | |
| From the repo root (where `pyproject.toml` lives): | |
| ```bash | |
| uv sync | |
| uv run server --port 8000 | |
| ``` | |
| ```python | |
| from ghostexec import GhostexecAction, GhostexecEnv | |
| with GhostexecEnv(base_url="http://127.0.0.1:8000") as env: | |
| out = env.reset() | |
| print(out.observation.echoed_message[:500], "β¦") | |
| step = env.step( | |
| GhostexecAction( | |
| action_type="reply_email", | |
| email_id="e01", | |
| message_body=( | |
| "Marcus β acknowledged. Revised figures and short rationale " | |
| "before noon. β Exec" | |
| ), | |
| ) | |
| ) | |
| print("reward:", step.reward) | |
| print("metadata keys:", sorted((step.observation.metadata or {}).keys())) | |
| ``` | |
| **Docker (optional):** | |
| ```bash | |
| docker build -t ghostexec-env:latest . | |
| ``` | |
| --- | |
| ## Actions and fields | |
| `GhostexecAction` (`models.py`): | |
| | `action_type` | Typical fields | | |
| |---------------|----------------| | |
| | `reply_email` | `email_id`, `message_body` | | |
| | `archive_email` | `email_id` | | |
| | `reschedule_meeting` | `meeting_id`, `new_time`, `reason` | | |
| | `cancel_meeting` | `meeting_id`, `reason` | | |
| | `complete_task` | `task_id` | | |
| | `delegate_task` | `task_id`, `contact_name` | | |
| | `send_message` | `contact_name`, `message` | | |
| | `do_nothing` | β (penalised path) | | |
| Malformed HTTP payloads are handled safely so clients do not crash the server. | |
| --- | |
| ## Observation | |
| - **`echoed_message`** β Full plain-text briefing. | |
| - **`message_length`** β Length of briefing. | |
| - **`reward`**, **`done`**, **`metadata`** β Step outcome; metadata includes `step_ok`, reward breakdown fields, and debug ids. | |
| --- | |
| ## Reward (formula summary) | |
| Full detail is under [Criterion: Reward & Training Pipeline (10%)](#criterion-reward--training-pipeline-10). Episode logs: `outputs/logs/episode_rewards.jsonl` (gitignored). | |
| --- | |
| ## HTTP vs WebSocket (episode state) | |
| - **HTTP** `POST /reset` and `POST /step` may use **short-lived** instances; consecutive HTTP calls might not share one in-memory episode. | |
| - **WebSocket `/ws`** (or `GhostexecEnv`) β use for **multi-step episodes** on one session. | |
| Endpoints: **`/web`**, **`/docs`**, **`/health`**, **`/ws`**. | |
| --- | |
| ## Running and testing locally | |
| ```bash | |
| uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000 | |
| # or | |
| uv run server --port 8000 | |
| ``` | |
| **HTTP smoke:** | |
| ```bash | |
| uv run python scripts/http_endpoint_smoke.py --local | |
| ``` | |
| **Tests:** | |
| ```bash | |
| uv run pytest tests/ -q | |
| GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q | |
| uv run pytest tests/test_live_server_exhaustive.py -v --tb=short # server on :8000 | |
| ``` | |
| **SFT β GRPO (example):** | |
| ```bash | |
| uv run python scripts/train_sft_then_grpo.py \ | |
| --model-preset small_iter_fast \ | |
| --training-preset hackathon_turbo \ | |
| --env-url http://127.0.0.1:8000 \ | |
| --generate-sft-from-env \ | |
| --sft-samples 120 \ | |
| --max-sft-steps 60 \ | |
| --max-grpo-steps 120 \ | |
| --env-reward-scale 1.0 \ | |
| --local-reward-scale 0.35 \ | |
| --complexity-curriculum easy_to_full \ | |
| --curriculum-ramp-ratio 0.60 | |
| ``` | |
| --- | |
| ## Hugging Face Spaces | |
| ```bash | |
| openenv serve | |
| openenv build | |
| openenv validate --verbose | |
| openenv push | |
| # openenv push --repo-id your-username/ghostexec | |
| ``` | |
| Use a **public** Space for the default hackathon flow. `openenv.yaml` carries **name**, **version**, and **description** for metadataβkeep them in sync with submission needs. | |
| --- | |
| ## Scenarios | |
| | File | Role | | |
| |------|------| | |
| | `scenarios/phase2_core.json` | Default dense fixture | | |
| | `scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json` | Narrative pressure | | |
| | `scenarios/vip_meltdown_drift.json` | Mood / escalation drift | | |
| | `scenarios/schema_drift_test.json` | Drift-event harness | | |
| --- | |
| ## Project layout | |
| ``` | |
| ghostexec/ | |
| βββ openenv.yaml | |
| βββ pyproject.toml | |
| βββ models.py | |
| βββ client.py | |
| βββ graders.py | |
| βββ scenarios/ | |
| βββ scripts/ | |
| βββ notebooks/ | |
| βββ tests/ | |
| βββ server/ | |
| βββ app.py | |
| βββ ghostexec_environment.py | |
| βββ reward.py | |
| βββ Dockerfile | |
| ``` | |
| --- | |
| ## Resources & references | |
| - [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) β core stack | |
| - [Packaging & Deploying](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html) | |
| - [OpenEnv Hub](https://huggingface.co/openenv) | |
| - [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA) (and related talks linked in prior README iterations) | |
| --- | |
| ## License | |
| BSD-style β see license notices in source files (Meta / OpenEnv lineage). | |