ghostexec / README.md
modelbuilderhq's picture
Upload folder using huggingface_hub
d815df7 verified
|
raw
history blame
17.1 kB
metadata
title: Ghostexec Environment Server
emoji: 📢
colorFrom: pink
colorTo: yellow
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv

Ghostexec

Ghostexec is an OpenEnv-compatible environment: a busy executive chief-of-staff simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a plain-text briefing, then emit one structured action per step (reply_email, reschedule_meeting, …). The server returns rewards shaped around conflict, relationships, and tasks—plus trajectory graders for hackathon validation. All episode content lives in scenarios/*.json; the engine is in server/ghostexec_environment.py and server/reward.py.

Item Value
HF Space name / manifest ghostexec in openenv.yaml
Python package openenv-ghostexec in pyproject.toml (import ghostexec)
Public Space modelbuilderhq/ghostexec
Deeper innovation-only brief environment-innovation/README.md

Deliverables (fill before freeze)

Deliverable URL
Public HF Space (required) https://huggingface.co/spaces/modelbuilderhq/ghostexec
Write-up / blog (HF post preferred) TODO: paste your post URL
Short demo video (<2 min) TODO: paste your video URL

Contents

Judging criteria (this README is organized around them)

  1. Criterion: Environment Innovation (40%)
  2. Criterion: Storytelling & Presentation (30%)
  3. Criterion: Showing Improvement in Rewards (20%)
  4. Criterion: Reward & Training Pipeline (10%)

Reference

  1. Hackathon themes & checklist
  2. Quick start
  3. Actions
  4. Observation
  5. Reward (formula summary)
  6. HTTP vs WebSocket
  7. Running and testing locally
  8. Hugging Face Spaces
  9. Scenarios
  10. Project layout
  11. Resources & references
  12. License

Criterion: Environment Innovation (40%)

Weight: 40%

What it means:

  • Is the environment novel, creative, or genuinely challenging?
  • Does it meaningfully test agent behavior in a way that hasn't been done before?

How Ghostexec answers this

Challenging world. The policy sees one dense natural-language briefing per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)—not a JSON dump of the world. It must ground decisions in real ids from that text, return valid typed actions, and accept time pressure and social fallout when meetings move or mail goes unanswered. Invalid actions do not crash the server; they return structured errors so learning signals stay intact.

Meaningful behavior, not a toy Q&A. Success needs comprehension + tool discipline: legal JSON schema, multi-step sequences (WebSocket sessions for real episodes), and tradeoffs across channels (mail vs calendar vs tasks vs relationships). do_nothing is penalised so “safe” idleness is costly when fires are burning.

Dynamics, not a static paragraph. After each valid action, the simulation advances the clock, updates moods, rebuilds conflicts, and can apply scenario-driven drift (after_step events in JSON): shifted meetings, new deadlines, preference changes—so the agent is tested on adaptation, not memorizing the first screen.

Dual evaluation. Dense step rewards in server/reward.py teach fine structure; trajectory graders in graders.py return scores strictly in (0.01, 0.99) per OpenEnv task wiring in openenv.yaml. Agents learn from the dense signal; judges get bounded certification scores.

Honest novelty claim. Inboxes and calendars are familiar ingredients. What is less common is the composition: OpenEnv-native packaging, plain-text-only observations, data-defined scenarios, live dynamics + drift, dual reward/grader stack, and a transactional action API in one trainable, hostable environment.

Task ladder (difficulty in data)

Task id Difficulty Scenario What gets harder
phase2_core easy scenarios/phase2_core.json Dense triage: VIP mail, calendar relief, overlapping work.
monday_morning medium scenarios/monday_morning.json Stacked Monday rush, less slack.
dinner_disaster hard scenarios/dinner_disaster.json Personal vs professional collision, escalation risk.

5-minute verification checklist

  1. openenv.yaml — three tasks, max_steps, app: server.app:app, name: ghostexec, grader paths.
  2. scenarios/*.json — world content is data, not hardcoded lore in Python.
  3. server/ghostexec_environment.pybuild_briefing_text, _apply_action, post-step dynamics, schema drift hooks.
  4. server/reward.py — fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
  5. graders.py — bounded grader outputs, trajectory consumption.
  6. Live Space/docs or POST /reset + POST /step: legal steps change state; illegal steps return errors, not stack traces.

For a standalone walkthrough of the innovation angle only, see environment-innovation/README.md.


Criterion: Storytelling & Presentation (30%)

Weight: 30%

What it means:

  • Can you clearly explain the problem, the environment, and what the agent learned?
  • Is the demo engaging and easy to follow for a non-technical audience?

The problem (plain language)

An executive’s day is messy: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice ripples—someone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a small simulator the model must run, not a single paragraph to summarize.

The environment (one sentence)

You read a realistic staff briefing; you pick one legal “move” (reply, reschedule, delegate, …); the world updates; you get a score that reflects tension across work, people, and tasks.

What the agent is supposed to learn

  • Read carefully — wrong email_id / meeting_id / task_id fails cleanly with feedback.
  • Act under pressure — clock, max_steps, and stress push toward decisions, not endless analysis.
  • Balance competing goals — improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff.
  • Recover from change — drift events mean the “right” plan from step 1 may not stay right at step 8.

Demo tips for a non-technical audience

  1. Show the briefing first — let viewers see the same wall of text the model sees (relatable chaos).
  2. Show one good step vs one bad step — e.g. thoughtful reply vs invalid id or do_nothing while critical mail waits (mood / reward visibly differ).
  3. Name the three “channels” — calmer calendar, happier stakeholders, tasks moving forward—without math jargon.
  4. End on “what improved” — after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).

Hackathon alignment (themes)

Theme fit (examples): Ghostexec fits Theme 3.2 — Personalized tasks (executive-style inbox, calendar, delegation). Theme 4 is partially supported via GHOSTEXEC_CURRICULUM, GHOSTEXEC_PERTURB, and diverse scenarios/.


Criterion: Showing Improvement in Rewards (20%)

Weight: 20%

What it means:

  • Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baseline—anything that proves the agent learned something.

Where evidence lives in this repo

Artifact Role
outputs/logs/episode_rewards.jsonl Per-step reward trace (gitignored); use for reward curves and component debugging.
outputs/trainer_state.json / training logs Produced by training scripts when configured; feed into plotting.
outputs/reward_log.csv Optional CSV companion for plotting pipelines.
outputs/compliance_manifest.json Baseline / compliance metadata for comparison charts.
outputs/plots/*.png Generated report figures (see command below).

Plot pack (loss + reward + components + baseline bar):

uv run python scripts/plot_training_report.py \
  --trainer-history outputs/trainer_state.json \
  --reward-csv outputs/reward_log.csv \
  --baselines-json outputs/compliance_manifest.json \
  --out-dir outputs/plots

Writes loss_curve.png, reward_curve.png, components_curve.png, baseline_comparison.png under outputs/plots/.

End-to-end notebook: notebooks/ghostexec_unsloth_grpo_hf_api.ipynb is intended to Run All without manual steps (per project convention).

Before / after narrative for judges: same task_id and seed—show lower invalid rate, higher mean step reward, or clearer grader trajectory after finetuning. Pair numbers with one short clip of two runs side by side on the Space or local server.


Criterion: Reward & Training Pipeline (10%)

Weight: 10%

What it means:

  • Is the reward logic coherent?
  • Does the pipeline produce meaningful improvement in the trained agent's behavior?

Reward logic (coherent and inspectable)

Phase-4 scoring in server/reward.py uses a fixed core blend:

[ \text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task} ]

Then bounded shaping, invalid-step handling, and explicit penalties (including do_nothing). Components surface on RewardBreakdown and in observation metadata where configured—so “why did this step score X?” is auditable, not a black box.

Design rationale is aligned with dense reward-shaping practice (see arXiv:2408.10215)—fixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.

Training pipeline (entrypoints)

Step Command / artifact
Install uv sync (from repo root)
Server (matches Dockerfile) uv run server --port 8000
SFT → GRPO script uv run python scripts/train_sft_then_grpo.py (see Running and testing locally for a full example invocation)
Tests uv run pytest tests/ -q
Docker build gate GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q

The pipeline is meaningful when tied to the 20% evidence above: same env URL, logged rewards, and plots that move in the right direction over training—not when loss alone decreases.


OpenEnv Hackathon themes & checklist

Item Status
OpenEnv-based env + openenv.yaml In-repo (openenv-core[core]>=0.2.3).
Short write-up or <2 min video You: publish and paste URLs in Deliverables.
Public HF Space Deliverables; deploy with openenv push --repo-id <your>/ghostexec.

Quick start (Python client)

From the repo root (where pyproject.toml lives):

uv sync
uv run server --port 8000
from ghostexec import GhostexecAction, GhostexecEnv

with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
    out = env.reset()
    print(out.observation.echoed_message[:500], "…")

    step = env.step(
        GhostexecAction(
            action_type="reply_email",
            email_id="e01",
            message_body=(
                "Marcus — acknowledged. Revised figures and short rationale "
                "before noon. — Exec"
            ),
        )
    )
    print("reward:", step.reward)
    print("metadata keys:", sorted((step.observation.metadata or {}).keys()))

Docker (optional):

docker build -t ghostexec-env:latest .

Actions and fields

GhostexecAction (models.py):

action_type Typical fields
reply_email email_id, message_body
archive_email email_id
reschedule_meeting meeting_id, new_time, reason
cancel_meeting meeting_id, reason
complete_task task_id
delegate_task task_id, contact_name
send_message contact_name, message
do_nothing — (penalised path)

Malformed HTTP payloads are handled safely so clients do not crash the server.


Observation

  • echoed_message — Full plain-text briefing.
  • message_length — Length of briefing.
  • reward, done, metadata — Step outcome; metadata includes step_ok, reward breakdown fields, and debug ids.

Reward (formula summary)

Full detail is under Criterion: Reward & Training Pipeline (10%). Episode logs: outputs/logs/episode_rewards.jsonl (gitignored).


HTTP vs WebSocket (episode state)

  • HTTP POST /reset and POST /step may use short-lived instances; consecutive HTTP calls might not share one in-memory episode.
  • WebSocket /ws (or GhostexecEnv) — use for multi-step episodes on one session.

Endpoints: /web, /docs, /health, /ws.


Running and testing locally

uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
# or
uv run server --port 8000

HTTP smoke:

uv run python scripts/http_endpoint_smoke.py --local

Tests:

uv run pytest tests/ -q
GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
uv run pytest tests/test_live_server_exhaustive.py -v --tb=short   # server on :8000

SFT → GRPO (example):

uv run python scripts/train_sft_then_grpo.py \
  --model-preset small_iter_fast \
  --training-preset hackathon_turbo \
  --env-url http://127.0.0.1:8000 \
  --generate-sft-from-env \
  --sft-samples 120 \
  --max-sft-steps 60 \
  --max-grpo-steps 120 \
  --env-reward-scale 1.0 \
  --local-reward-scale 0.35 \
  --complexity-curriculum easy_to_full \
  --curriculum-ramp-ratio 0.60

Hugging Face Spaces

openenv serve
openenv build
openenv validate --verbose
openenv push
# openenv push --repo-id your-username/ghostexec

Use a public Space for the default hackathon flow. openenv.yaml carries name, version, and description for metadata—keep them in sync with submission needs.


Scenarios

File Role
scenarios/phase2_core.json Default dense fixture
scenarios/monday_morning.json, dinner_disaster.json, vip_meltdown.json Narrative pressure
scenarios/vip_meltdown_drift.json Mood / escalation drift
scenarios/schema_drift_test.json Drift-event harness

Project layout

ghostexec/
├── openenv.yaml
├── pyproject.toml
├── models.py
├── client.py
├── graders.py
├── scenarios/
├── scripts/
├── notebooks/
├── tests/
└── server/
    ├── app.py
    ├── ghostexec_environment.py
    ├── reward.py
    └── Dockerfile

Resources & references


License

BSD-style — see license notices in source files (Meta / OpenEnv lineage).