Spaces:

modelbuilderhq
/

ghostexec

Sleeping

App Files Files Community

ghostexec / README.md

modelbuilderhq

Upload folder using huggingface_hub

d815df7 verified 28 days ago

17.1 kB

title: Ghostexec Environment Server
emoji: 📢
colorFrom: pink
colorTo: yellow
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv

Ghostexec

Ghostexec is an OpenEnv-compatible environment: a busy executive chief-of-staff simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a plain-text briefing, then emit one structured action per step (reply_email, reschedule_meeting, …). The server returns rewards shaped around conflict, relationships, and tasks—plus trajectory graders for hackathon validation. All episode content lives in scenarios/*.json; the engine is in server/ghostexec_environment.py and server/reward.py.

Item	Value
HF Space name / manifest	`ghostexec` in `openenv.yaml`
Python package	`openenv-ghostexec` in `pyproject.toml` (import `ghostexec`)
Public Space	modelbuilderhq/ghostexec
Deeper innovation-only brief	`environment-innovation/README.md`

Deliverables (fill before freeze)

Deliverable	URL
Public HF Space (required)	https://huggingface.co/spaces/modelbuilderhq/ghostexec
Write-up / blog (HF post preferred)	`TODO: paste your post URL`
Short demo video (<2 min)	`TODO: paste your video URL`

Judging criteria (this README is organized around them)

Criterion: Environment Innovation (40%)
Criterion: Storytelling & Presentation (30%)
Criterion: Showing Improvement in Rewards (20%)
Criterion: Reward & Training Pipeline (10%)

Reference

Hackathon themes & checklist
Quick start
Actions
Observation
Reward (formula summary)
HTTP vs WebSocket
Running and testing locally
Hugging Face Spaces
Scenarios
Project layout
Resources & references
License

Criterion: Environment Innovation (40%)

Weight: 40%

What it means:

Is the environment novel, creative, or genuinely challenging?
Does it meaningfully test agent behavior in a way that hasn't been done before?

How Ghostexec answers this

Challenging world. The policy sees one dense natural-language briefing per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)—not a JSON dump of the world. It must ground decisions in real ids from that text, return valid typed actions, and accept time pressure and social fallout when meetings move or mail goes unanswered. Invalid actions do not crash the server; they return structured errors so learning signals stay intact.

Meaningful behavior, not a toy Q&A. Success needs comprehension + tool discipline: legal JSON schema, multi-step sequences (WebSocket sessions for real episodes), and tradeoffs across channels (mail vs calendar vs tasks vs relationships). do_nothing is penalised so “safe” idleness is costly when fires are burning.

Dynamics, not a static paragraph. After each valid action, the simulation advances the clock, updates moods, rebuilds conflicts, and can apply scenario-driven drift (after_step events in JSON): shifted meetings, new deadlines, preference changes—so the agent is tested on adaptation, not memorizing the first screen.

Dual evaluation. Dense step rewards in server/reward.py teach fine structure; trajectory graders in graders.py return scores strictly in (0.01, 0.99) per OpenEnv task wiring in openenv.yaml. Agents learn from the dense signal; judges get bounded certification scores.

Honest novelty claim. Inboxes and calendars are familiar ingredients. What is less common is the composition: OpenEnv-native packaging, plain-text-only observations, data-defined scenarios, live dynamics + drift, dual reward/grader stack, and a transactional action API in one trainable, hostable environment.

Task ladder (difficulty in data)

Task id	Difficulty	Scenario	What gets harder
`phase2_core`	easy	`scenarios/phase2_core.json`	Dense triage: VIP mail, calendar relief, overlapping work.
`monday_morning`	medium	`scenarios/monday_morning.json`	Stacked Monday rush, less slack.
`dinner_disaster`	hard	`scenarios/dinner_disaster.json`	Personal vs professional collision, escalation risk.

5-minute verification checklist

openenv.yaml — three tasks, max_steps, app: server.app:app, name: ghostexec, grader paths.
scenarios/*.json — world content is data, not hardcoded lore in Python.
server/ghostexec_environment.py — build_briefing_text, _apply_action, post-step dynamics, schema drift hooks.
server/reward.py — fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
graders.py — bounded grader outputs, trajectory consumption.
Live Space — /docs or POST /reset + POST /step: legal steps change state; illegal steps return errors, not stack traces.

For a standalone walkthrough of the innovation angle only, see environment-innovation/README.md.

Criterion: Storytelling & Presentation (30%)

Weight: 30%

What it means:

Can you clearly explain the problem, the environment, and what the agent learned?
Is the demo engaging and easy to follow for a non-technical audience?

The problem (plain language)

An executive’s day is messy: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice ripples—someone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a small simulator the model must run, not a single paragraph to summarize.

The environment (one sentence)

You read a realistic staff briefing; you pick one legal “move” (reply, reschedule, delegate, …); the world updates; you get a score that reflects tension across work, people, and tasks.

What the agent is supposed to learn

Read carefully — wrong email_id / meeting_id / task_id fails cleanly with feedback.
Act under pressure — clock, max_steps, and stress push toward decisions, not endless analysis.
Balance competing goals — improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff.
Recover from change — drift events mean the “right” plan from step 1 may not stay right at step 8.

Demo tips for a non-technical audience

Show the briefing first — let viewers see the same wall of text the model sees (relatable chaos).
Show one good step vs one bad step — e.g. thoughtful reply vs invalid id or do_nothing while critical mail waits (mood / reward visibly differ).
Name the three “channels” — calmer calendar, happier stakeholders, tasks moving forward—without math jargon.
End on “what improved” — after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).

Hackathon alignment (themes)

Theme fit (examples): Ghostexec fits Theme 3.2 — Personalized tasks (executive-style inbox, calendar, delegation). Theme 4 is partially supported via GHOSTEXEC_CURRICULUM, GHOSTEXEC_PERTURB, and diverse scenarios/.

Criterion: Showing Improvement in Rewards (20%)

Weight: 20%

What it means:

Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baseline—anything that proves the agent learned something.

Where evidence lives in this repo

Artifact	Role
`outputs/logs/episode_rewards.jsonl`	Per-step reward trace (gitignored); use for reward curves and component debugging.
`outputs/trainer_state.json` / training logs	Produced by training scripts when configured; feed into plotting.
`outputs/reward_log.csv`	Optional CSV companion for plotting pipelines.
`outputs/compliance_manifest.json`	Baseline / compliance metadata for comparison charts.
`outputs/plots/*.png`	Generated report figures (see command below).

Plot pack (loss + reward + components + baseline bar):

uv run python scripts/plot_training_report.py \
  --trainer-history outputs/trainer_state.json \
  --reward-csv outputs/reward_log.csv \
  --baselines-json outputs/compliance_manifest.json \
  --out-dir outputs/plots

Writes loss_curve.png, reward_curve.png, components_curve.png, baseline_comparison.png under outputs/plots/.

End-to-end notebook: notebooks/ghostexec_unsloth_grpo_hf_api.ipynb is intended to Run All without manual steps (per project convention).

Before / after narrative for judges: same task_id and seed—show lower invalid rate, higher mean step reward, or clearer grader trajectory after finetuning. Pair numbers with one short clip of two runs side by side on the Space or local server.

Criterion: Reward & Training Pipeline (10%)

Weight: 10%

What it means:

Is the reward logic coherent?
Does the pipeline produce meaningful improvement in the trained agent's behavior?

Reward logic (coherent and inspectable)

Phase-4 scoring in server/reward.py uses a fixed core blend:

[ \text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task} ]

Then bounded shaping, invalid-step handling, and explicit penalties (including do_nothing). Components surface on RewardBreakdown and in observation metadata where configured—so “why did this step score X?” is auditable, not a black box.

Design rationale is aligned with dense reward-shaping practice (see arXiv:2408.10215)—fixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.

Training pipeline (entrypoints)

Step	Command / artifact
Install	`uv sync` (from repo root)
Server (matches Dockerfile)	`uv run server --port 8000`
SFT → GRPO script	`uv run python scripts/train_sft_then_grpo.py` (see Running and testing locally for a full example invocation)
Tests	`uv run pytest tests/ -q`
Docker build gate	`GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q`

The pipeline is meaningful when tied to the 20% evidence above: same env URL, logged rewards, and plots that move in the right direction over training—not when loss alone decreases.

OpenEnv Hackathon themes & checklist

Item	Status
OpenEnv-based env + `openenv.yaml`	In-repo (`openenv-core[core]>=0.2.3`).
Short write-up or <2 min video	You: publish and paste URLs in Deliverables.
Public HF Space	Deliverables; deploy with `openenv push --repo-id <your>/ghostexec`.

Quick start (Python client)

From the repo root (where pyproject.toml lives):

uv sync
uv run server --port 8000

from ghostexec import GhostexecAction, GhostexecEnv

with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
    out = env.reset()
    print(out.observation.echoed_message[:500], "…")

    step = env.step(
        GhostexecAction(
            action_type="reply_email",
            email_id="e01",
            message_body=(
                "Marcus — acknowledged. Revised figures and short rationale "
                "before noon. — Exec"
            ),
        )
    )
    print("reward:", step.reward)
    print("metadata keys:", sorted((step.observation.metadata or {}).keys()))

Docker (optional):

docker build -t ghostexec-env:latest .

Actions and fields

GhostexecAction (models.py):

`action_type`	Typical fields
`reply_email`	`email_id`, `message_body`
`archive_email`	`email_id`
`reschedule_meeting`	`meeting_id`, `new_time`, `reason`
`cancel_meeting`	`meeting_id`, `reason`
`complete_task`	`task_id`
`delegate_task`	`task_id`, `contact_name`
`send_message`	`contact_name`, `message`
`do_nothing`	— (penalised path)

Malformed HTTP payloads are handled safely so clients do not crash the server.

Observation

echoed_message — Full plain-text briefing.
message_length — Length of briefing.
reward, done, metadata — Step outcome; metadata includes step_ok, reward breakdown fields, and debug ids.

Reward (formula summary)

Full detail is under Criterion: Reward & Training Pipeline (10%). Episode logs: outputs/logs/episode_rewards.jsonl (gitignored).

HTTP vs WebSocket (episode state)

HTTP POST /reset and POST /step may use short-lived instances; consecutive HTTP calls might not share one in-memory episode.
WebSocket /ws (or GhostexecEnv) — use for multi-step episodes on one session.

Endpoints: /web, /docs, /health, /ws.

Running and testing locally

uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
# or
uv run server --port 8000

HTTP smoke:

uv run python scripts/http_endpoint_smoke.py --local

Tests:

uv run pytest tests/ -q
GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
uv run pytest tests/test_live_server_exhaustive.py -v --tb=short   # server on :8000

SFT → GRPO (example):

uv run python scripts/train_sft_then_grpo.py \
  --model-preset small_iter_fast \
  --training-preset hackathon_turbo \
  --env-url http://127.0.0.1:8000 \
  --generate-sft-from-env \
  --sft-samples 120 \
  --max-sft-steps 60 \
  --max-grpo-steps 120 \
  --env-reward-scale 1.0 \
  --local-reward-scale 0.35 \
  --complexity-curriculum easy_to_full \
  --curriculum-ramp-ratio 0.60

Hugging Face Spaces

openenv serve
openenv build
openenv validate --verbose
openenv push
# openenv push --repo-id your-username/ghostexec

Use a public Space for the default hackathon flow. openenv.yaml carries name, version, and description for metadata—keep them in sync with submission needs.

Scenarios

File	Role
`scenarios/phase2_core.json`	Default dense fixture
`scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json`	Narrative pressure
`scenarios/vip_meltdown_drift.json`	Mood / escalation drift
`scenarios/schema_drift_test.json`	Drift-event harness

Project layout

ghostexec/
├── openenv.yaml
├── pyproject.toml
├── models.py
├── client.py
├── graders.py
├── scenarios/
├── scripts/
├── notebooks/
├── tests/
└── server/
    ├── app.py
    ├── ghostexec_environment.py
    ├── reward.py
    └── Dockerfile

Resources & references

meta-pytorch/OpenEnv — core stack
Packaging & Deploying
OpenEnv Hub
Building RL Environments with OpenEnv (and related talks linked in prior README iterations)

License

BSD-style — see license notices in source files (Meta / OpenEnv lineage).

Ghostexec

Deliverables (fill before freeze)

Contents

Criterion: Environment Innovation (40%)

How Ghostexec answers this

Task ladder (difficulty in data)

5-minute verification checklist

Criterion: Storytelling & Presentation (30%)

The problem (plain language)

The environment (one sentence)

What the agent is supposed to learn

Demo tips for a non-technical audience

Hackathon alignment (themes)

Criterion: Showing Improvement in Rewards (20%)

Where evidence lives in this repo

Criterion: Reward & Training Pipeline (10%)

Reward logic (coherent and inspectable)

Training pipeline (entrypoints)

OpenEnv Hackathon themes & checklist

Quick start (Python client)

Actions and fields

Observation

Reward (formula summary)

HTTP vs WebSocket (episode state)

Running and testing locally

Hugging Face Spaces

Scenarios

Project layout

Resources & references

License