Spaces:

modelbuilderhq
/

ghostexec

Sleeping

File size: 49,266 Bytes

160c47d

{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "33566e3d",
      "metadata": {},
      "source": [
        "# Ghostexec — Unsloth + TRL SFT -> GRPO against the deployed HF Space API\n",
        "\n",
        "Post-train `unsloth/Llama-3.2-3B-Instruct` with **SFT warmup first** and then GRPO, where rewards are fetched over HTTP from the **live** Ghostexec OpenEnv Space.\n",
        "\n",
        "- Live endpoint: `https://modelbuilderhq-ghostexec.hf.space`\n",
        "- Algorithm: TRL `0.22.2` `SFTTrainer` -> `GRPOTrainer` (no vLLM — HF `generate()` path)\n",
        "- Base (recommended for fast winning iterations): `unsloth/Qwen2.5-3B-Instruct` (4-bit) + LoRA r=16 + bf16\n",
        "- Curriculum: **easy -> full** annealing (strong local scaffold early, env-dominant later)\n",
        "- Rewards: four **independent** functions — `env_reward` (live Space) / `format_reward` / `semantic_action_reward` / `anti_idle_reward`\n",
        "\n",
        "### Help Guide phase map (notebook sections mirror `[Participant Help Guide] §18`)\n",
        "| Phase | Where |\n",
        "|---|---|\n",
        "| 1 Pick a narrow task | section 1 |\n",
        "| 2 Build the environment | section 2 (already deployed; health check here) |\n",
        "| 3 Build rewards | section 3 |\n",
        "| 4 Deploy | section 4 (confirm) |\n",
        "| 5 Train small | section 5 (SFT + Stage B) |\n",
        "| 6 Inspect for hacking | section 6 |\n",
        "| 7 Add curriculum | section 7 (Stages C + D) |\n",
        "| 8 Train bigger | section 8 (knobs, not action) |\n",
        "| 9 Save and demo | section 9 |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Phase 1 — Pick a narrow task\n",
        "\n",
        "Single-step action selection from a plain-text executive briefing. The model reads the briefing from `/reset` and must emit exactly one JSON action matching `GhostexecAction`. The deployed Space scores that action and returns a reward from `/step`. That reward is the learning signal.\n",
        "\n",
        "Legal `action_type` values: `reply_email, archive_email, reschedule_meeting, cancel_meeting, complete_task, delegate_task, send_message, do_nothing`.\n",
        "\n",
        "The scenario is fixed on the deployed Space (`phase2_core`), so the curriculum is an **exploration schedule** (temperature / num_generations / learning rate) across three training stages rather than a scenario switch."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Phase 2 — Build the environment (already deployed on HF Spaces)\n",
        "\n",
        "The next cell is the exact Unsloth install snippet. Restart the runtime after it finishes if Colab asks you to."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%%capture\n",
        "import os, importlib.util\n",
        "!pip install --upgrade -qqq uv\n",
        "if importlib.util.find_spec(\"torch\") is None or \"COLAB_\" in \"\".join(os.environ.keys()):\n",
        "    try: import numpy; get_numpy = f\"numpy=={numpy.__version__}\"\n",
        "    except: get_numpy = \"numpy\"\n",
        "    !uv pip install -qqq \\\n",
        "        \"torch>=2.8.0\" \"triton>=3.4.0\" {get_numpy} torchvision bitsandbytes \"transformers==4.56.2\" trackio \\\n",
        "        \"unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo\" \\\n",
        "        \"unsloth[base] @ git+https://github.com/unslothai/unsloth\" \\\n",
        "        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\n",
        "elif importlib.util.find_spec(\"unsloth\") is None:\n",
        "    !uv pip install -qqq unsloth trackio\n",
        "!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%pip install -q requests pydantic matplotlib pandas tqdm huggingface_hub datasets\n",
        "print(\"aux deps installed\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "88474159",
      "metadata": {},
      "outputs": [],
      "source": [
        "import os, sys, json, time, random, re, math, pathlib\n",
        "from typing import Any\n",
        "\n",
        "GHOSTEXEC_ENV_URL = os.environ.get(\"GHOSTEXEC_ENV_URL\", \"https://modelbuilderhq-ghostexec.hf.space\")\n",
        "# Small-model-first default for rapid iteration and higher success probability.\n",
        "MODEL_ID          = os.environ.get(\"MODEL_ID\", \"unsloth/Qwen2.5-3B-Instruct\")\n",
        "RUN_NAME          = os.environ.get(\"RUN_NAME\", \"ghostexec-unsloth-grpo\")\n",
        "HUB_REPO_ID       = os.environ.get(\"HUB_REPO_ID\", \"\")\n",
        "OUT = pathlib.Path(\"/content/ghostexec_out\") if os.path.exists(\"/content\") else pathlib.Path(\"./ghostexec_out\")\n",
        "OUT.mkdir(parents=True, exist_ok=True)\n",
        "\n",
        "try:\n",
        "    from google.colab import userdata  # type: ignore\n",
        "    if not os.environ.get(\"HF_TOKEN\"):\n",
        "        try: os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\") or \"\"\n",
        "        except Exception: pass\n",
        "except Exception:\n",
        "    pass\n",
        "\n",
        "print(\"Endpoint :\", GHOSTEXEC_ENV_URL)\n",
        "print(\"Model    :\", MODEL_ID)\n",
        "print(\"Output   :\", OUT)\n",
        "print(\"HF token :\", \"set\" if os.environ.get(\"HF_TOKEN\") else \"missing (needed only for push_to_hub)\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 2.1 HTTP client to the deployed Space\n",
        "\n",
        "Every reward in this notebook comes from this class — we never run Ghostexec locally."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import requests\n",
        "\n",
        "class GhostexecSpace:\n",
        "    def __init__(self, url: str, timeout: float = 60.0, max_retries: int = 4):\n",
        "        self.url = url.rstrip(\"/\")\n",
        "        self.timeout = timeout\n",
        "        self.max_retries = max_retries\n",
        "        self.latency_ms: list[float] = []\n",
        "\n",
        "    def _post(self, path: str, payload: dict) -> dict:\n",
        "        last_err: Exception | None = None\n",
        "        for attempt in range(self.max_retries):\n",
        "            try:\n",
        "                t0 = time.perf_counter()\n",
        "                r = requests.post(f\"{self.url}{path}\", json=payload, timeout=self.timeout)\n",
        "                self.latency_ms.append((time.perf_counter() - t0) * 1000.0)\n",
        "                r.raise_for_status()\n",
        "                return r.json()\n",
        "            except Exception as e:\n",
        "                last_err = e\n",
        "                time.sleep(min(2 ** attempt, 8.0))\n",
        "        raise RuntimeError(f\"POST {path} failed after {self.max_retries} tries: {last_err}\")\n",
        "\n",
        "    def reset(self) -> dict:\n",
        "        return self._post(\"/reset\", {})\n",
        "\n",
        "    def step(self, action: dict) -> tuple[float, dict]:\n",
        "        raw = self._post(\"/step\", {\"action\": action})\n",
        "        reward = raw.get(\"reward\")\n",
        "        if reward is None:\n",
        "            reward = (raw.get(\"observation\") or {}).get(\"reward\", 0.0)\n",
        "        try:    return float(reward), raw\n",
        "        except Exception: return 0.0, raw\n",
        "\n",
        "env = GhostexecSpace(GHOSTEXEC_ENV_URL)\n",
        "print(\"Health reset ...\")\n",
        "_obs = env.reset()\n",
        "print(\"reset keys:\", sorted(_obs.keys()))\n",
        "_brief = ((_obs.get(\"observation\") or _obs).get(\"echoed_message\") or \"\")[:400]\n",
        "print(\"briefing preview:\\n\", _brief)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b747bc4e",
      "metadata": {},
      "source": [
        "### 2.2 Verifier sanity check (Help Guide §8)\n",
        "\n",
        "**Colab / stale cells:** If the traceback mentions **`do_nothing is not the worst/floor`** on **line ~28**, you are running **old cached notebook code** (that assert was removed). Use **Runtime → Disconnect and delete runtime**, then **re-clone** the repo or **re-download** this notebook from GitHub and run from the top.\n",
        "\n",
        "**If every proactive action prints `-0.25` and only `do_nothing` is `-0.15`:** every non-idle smoke is an **invalid step** (wrong ids like `email_01`, or an outdated `_smoke_action`). This cell expects **real `phase2_core` ids** (`e01`, `e09`, `m02`, …) — see `_smoke_action` below.\n",
        "\n",
        "Fire every legal `action_type` once with **semantically valid** payloads (real ids from `scenarios/phase2_core.json`). Fake ids deserialize but fail validation (−0.25 invalid-step) and are not a fair probe. Also: **`do_nothing` is not guaranteed to be the lowest reward** — a valid but harmful action (e.g. cancelling an important meeting) can push the weighted score below the idle penalty. We instead assert **non-idle smokes are `step_ok=True`** and **`do_nothing` scores below a benign `reply_email` on `e01`**. If rewards are all identical, abort — GRPO cannot learn from a degenerate verifier."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5ed1a9bc",
      "metadata": {},
      "outputs": [],
      "source": [
        "LEGAL_ACTION_TYPES = [\n",
        "    \"reply_email\", \"archive_email\", \"reschedule_meeting\", \"cancel_meeting\",\n",
        "    \"complete_task\", \"delegate_task\", \"send_message\", \"do_nothing\",\n",
        "]\n",
        "\n",
        "def _smoke_action(action_type: str) -> dict:\n",
        "    # Real IDs from phase2_core scenario\n",
        "    base = {\"action_type\": action_type, \"message\": \"\"}\n",
        "\n",
        "    if action_type == \"reply_email\":\n",
        "        return {**base, \"email_id\": \"e01\", \"message_body\": \"Acknowledged — on it now.\"}\n",
        "    if action_type == \"archive_email\":\n",
        "        return {**base, \"email_id\": \"e09\"}\n",
        "    if action_type == \"reschedule_meeting\":\n",
        "        return {\n",
        "            **base,\n",
        "            \"meeting_id\": \"m02\",\n",
        "            \"new_time\": \"2026-04-21T18:00:00\",\n",
        "            \"reason\": \"freeing the morning block\",\n",
        "        }\n",
        "    if action_type == \"cancel_meeting\":\n",
        "        return {**base, \"meeting_id\": \"m10\", \"reason\": \"smoke test cancel\"}\n",
        "    if action_type == \"complete_task\":\n",
        "        return {**base, \"task_id\": \"t07\"}\n",
        "    if action_type == \"delegate_task\":\n",
        "        return {**base, \"task_id\": \"t08\", \"contact_name\": \"Jordan Lee\"}\n",
        "    if action_type == \"send_message\":\n",
        "        return {\n",
        "            **base,\n",
        "            \"contact_name\": \"Jamie Liu\",\n",
        "            \"message_body\": \"Quick sync when you have a minute.\",\n",
        "        }\n",
        "\n",
        "    # do_nothing\n",
        "    return {\n",
        "        **base,\n",
        "        \"email_id\": \"\",\n",
        "        \"message_body\": \"\",\n",
        "        \"meeting_id\": \"\",\n",
        "        \"new_time\": \"\",\n",
        "        \"reason\": \"\",\n",
        "        \"task_id\": \"\",\n",
        "        \"contact_name\": \"\",\n",
        "    }\n",
        "\n",
        "rewards_by_action: dict[str, float] = {}\n",
        "step_ok_by_action: dict[str, bool | None] = {}\n",
        "\n",
        "for at in LEGAL_ACTION_TYPES:\n",
        "    env.reset()\n",
        "    r, raw = env.step(_smoke_action(at))\n",
        "    rewards_by_action[at] = round(r, 4)\n",
        "    obs = raw.get(\"observation\") or {}\n",
        "    step_ok_by_action[at] = (obs.get(\"metadata\") or {}).get(\"step_ok\")\n",
        "\n",
        "print(json.dumps({\"reward\": rewards_by_action, \"step_ok\": step_ok_by_action}, indent=2))\n",
        "\n",
        "uniq = set(rewards_by_action.values())\n",
        "assert len(uniq) > 1, \"Verifier is constant across actions — env can't teach anything.\"\n",
        "\n",
        "# All non-idle smokes must be valid\n",
        "for at in LEGAL_ACTION_TYPES:\n",
        "    if at == \"do_nothing\":\n",
        "        continue\n",
        "    assert step_ok_by_action.get(at) is True, f\"{at} smoke is invalid; check IDs.\"\n",
        "\n",
        "# Idle should be worse than benign good action\n",
        "assert rewards_by_action[\"do_nothing\"] < rewards_by_action[\"reply_email\"] - 1e-6, \\\n",
        "    \"do_nothing should score below reply_email(e01).\"\n",
        "\n",
        "print(\"\\nverifier OK — rewards vary, smokes are valid, do_nothing < reply_email(e01).\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Phase 3 — Build rewards\n",
        "\n",
        "Three independent reward functions per Help Guide §7. Keeping them independent means we can plot each component, watch their correlations, and catch hacking (e.g. env reward climbs while format reward collapses)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pydantic import BaseModel\n",
        "from typing import Literal\n",
        "\n",
        "GhostexecActionType = Literal[\n",
        "    \"reply_email\", \"archive_email\", \"reschedule_meeting\", \"cancel_meeting\",\n",
        "    \"complete_task\", \"delegate_task\", \"send_message\", \"do_nothing\",\n",
        "]\n",
        "\n",
        "class GhostexecAction(BaseModel):\n",
        "    action_type:   GhostexecActionType = \"do_nothing\"\n",
        "    email_id:      str = \"\"\n",
        "    message_body:  str = \"\"\n",
        "    meeting_id:    str = \"\"\n",
        "    new_time:      str = \"\"\n",
        "    reason:        str = \"\"\n",
        "    task_id:       str = \"\"\n",
        "    contact_name: str = \"\"\n",
        "    message:       str = \"\"\n",
        "\n",
        "def _extract_json(text: str) -> dict:\n",
        "    s = text.strip()\n",
        "    s = re.sub(r\"^```(?:json)?\\s*|\\s*```$\", \"\", s, flags=re.IGNORECASE | re.MULTILINE).strip()\n",
        "    start, end = s.find(\"{\"), s.rfind(\"}\")\n",
        "    if start == -1 or end <= start: raise ValueError(\"no json object\")\n",
        "    return json.loads(s[start:end+1])\n",
        "\n",
        "def parse_action_strict(text: str) -> dict:\n",
        "    obj = _extract_json(text)\n",
        "    GhostexecAction(**obj)\n",
        "    return obj\n",
        "\n",
        "def parse_action(text: str) -> dict:\n",
        "    try: return parse_action_strict(text)\n",
        "    except Exception: return {\"action_type\": \"do_nothing\"}\n",
        "\n",
        "LEGAL_ACTION_TYPES = {\n",
        "    \"reply_email\", \"archive_email\", \"reschedule_meeting\", \"cancel_meeting\",\n",
        "    \"complete_task\", \"delegate_task\", \"send_message\", \"do_nothing\",\n",
        "}\n",
        "LEGAL_ACTION_KEYS = {\n",
        "    \"action_type\", \"email_id\", \"message_body\", \"meeting_id\",\n",
        "    \"new_time\", \"reason\", \"task_id\", \"contact_name\", \"message\",\n",
        "}\n",
        "\n",
        "\n",
        "def sanitize_action(raw: dict) -> dict:\n",
        "    \"\"\"Keep only legal Ghostexec fields and coerce malformed IDs/actions safely.\"\"\"\n",
        "    action = {k: v for k, v in (raw or {}).items() if k in LEGAL_ACTION_KEYS}\n",
        "\n",
        "    at = str(action.get(\"action_type\", \"do_nothing\"))\n",
        "    if at not in LEGAL_ACTION_TYPES:\n",
        "        at = \"do_nothing\"\n",
        "    action[\"action_type\"] = at\n",
        "\n",
        "    # Common model mistake: writes message text into `message` instead of `message_body`.\n",
        "    if at in {\"reply_email\", \"send_message\"}:\n",
        "        if not action.get(\"message_body\") and action.get(\"message\"):\n",
        "            action[\"message_body\"] = action[\"message\"]\n",
        "\n",
        "    if \"email_id\" in action and not re.fullmatch(r\"e\\d{2}\", str(action[\"email_id\"])):\n",
        "        action[\"email_id\"] = \"\"\n",
        "    if \"meeting_id\" in action and not re.fullmatch(r\"m\\d{2}\", str(action[\"meeting_id\"])):\n",
        "        action[\"meeting_id\"] = \"\"\n",
        "    if \"task_id\" in action and not re.fullmatch(r\"t\\d{2}\", str(action[\"task_id\"])):\n",
        "        action[\"task_id\"] = \"\"\n",
        "\n",
        "    return action\n",
        "\n",
        "assert parse_action_strict('```json\\n{\"action_type\":\"archive_email\",\"email_id\":\"email_01\"}\\n```')[\"action_type\"] == \"archive_email\"\n",
        "assert parse_action(\"garbage\")[\"action_type\"] == \"do_nothing\"\n",
        "print(\"parser OK\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3bd66b49",
      "metadata": {},
      "outputs": [],
      "source": [
        "def _completion_text(c) -> str:\n",
        "    if isinstance(c, list) and c and isinstance(c[0], dict):\n",
        "        return c[0].get(\"content\", \"\")\n",
        "    return c if isinstance(c, str) else str(c)\n",
        "\n",
        "\n",
        "def _prompt_to_text(p) -> str:\n",
        "    if isinstance(p, list) and p and isinstance(p[-1], dict):\n",
        "        return str(p[-1].get(\"content\", \"\"))\n",
        "    if isinstance(p, dict):\n",
        "        return str(p.get(\"content\", \"\"))\n",
        "    return str(p)\n",
        "\n",
        "\n",
        "# Curriculum scalars are updated per stage: easy -> full.\n",
        "CURRENT_ENV_SCALE = 0.85\n",
        "CURRENT_LOCAL_SCALE = 0.60\n",
        "\n",
        "\n",
        "def env_reward(completions, prompts=None, **_) -> list[float]:\n",
        "    out: list[float] = []\n",
        "    for c in completions:\n",
        "        text = _completion_text(c)\n",
        "        action = sanitize_action(parse_action(text))\n",
        "        try:\n",
        "            env.reset()\n",
        "            r, _ = env.step(action)\n",
        "        except Exception:\n",
        "            r = -1.0\n",
        "        out.append(float(r) * CURRENT_ENV_SCALE)\n",
        "    return out\n",
        "\n",
        "\n",
        "def format_reward(completions, **_) -> list[float]:\n",
        "    out: list[float] = []\n",
        "    for c in completions:\n",
        "        text = _completion_text(c)\n",
        "        try:\n",
        "            parse_action_strict(text)\n",
        "            out.append(0.12 * CURRENT_LOCAL_SCALE)\n",
        "        except Exception:\n",
        "            out.append(-0.20 * CURRENT_LOCAL_SCALE)\n",
        "    return out\n",
        "\n",
        "\n",
        "def semantic_action_reward(completions, prompts=None, **_) -> list[float]:\n",
        "    \"\"\"\n",
        "    Reward canonical, briefing-grounded action payloads before env call.\n",
        "    Scaled by CURRENT_LOCAL_SCALE for easy->full curriculum annealing.\n",
        "    \"\"\"\n",
        "    out: list[float] = []\n",
        "    for i, c in enumerate(completions):\n",
        "        text = _completion_text(c)\n",
        "        act = parse_action(text)\n",
        "        at = act.get(\"action_type\", \"do_nothing\")\n",
        "\n",
        "        prompt_text = \"\"\n",
        "        if prompts is not None and i < len(prompts):\n",
        "            prompt_text = _prompt_to_text(prompts[i])\n",
        "\n",
        "        def present(tok: str) -> bool:\n",
        "            return bool(tok) and re.search(rf\"\\b{re.escape(tok)}\\b\", prompt_text) is not None\n",
        "\n",
        "        r = -0.30\n",
        "        if at == \"do_nothing\":\n",
        "            r = -0.05\n",
        "        elif at == \"reply_email\":\n",
        "            eid = act.get(\"email_id\", \"\")\n",
        "            mb = (act.get(\"message_body\", \"\") or \"\").strip()\n",
        "            r = 0.30 if present(eid) and bool(re.fullmatch(r\"e\\d{2}\", eid)) and mb else -0.30\n",
        "        elif at == \"archive_email\":\n",
        "            eid = act.get(\"email_id\", \"\")\n",
        "            r = 0.30 if present(eid) and bool(re.fullmatch(r\"e\\d{2}\", eid)) else -0.30\n",
        "        elif at == \"reschedule_meeting\":\n",
        "            mid = act.get(\"meeting_id\", \"\")\n",
        "            nt = (act.get(\"new_time\", \"\") or \"\").strip()\n",
        "            r = 0.30 if present(mid) and bool(re.fullmatch(r\"m\\d{2}\", mid)) and nt else -0.30\n",
        "        elif at == \"cancel_meeting\":\n",
        "            mid = act.get(\"meeting_id\", \"\")\n",
        "            r = 0.30 if present(mid) and bool(re.fullmatch(r\"m\\d{2}\", mid)) else -0.30\n",
        "        elif at == \"complete_task\":\n",
        "            tid = act.get(\"task_id\", \"\")\n",
        "            r = 0.30 if present(tid) and bool(re.fullmatch(r\"t\\d{2}\", tid)) else -0.30\n",
        "        elif at == \"delegate_task\":\n",
        "            tid = act.get(\"task_id\", \"\")\n",
        "            cn = (act.get(\"contact_name\", \"\") or \"\").strip()\n",
        "            r = 0.30 if present(tid) and bool(re.fullmatch(r\"t\\d{2}\", tid)) and (cn in prompt_text) else -0.30\n",
        "        elif at == \"send_message\":\n",
        "            cn = (act.get(\"contact_name\", \"\") or \"\").strip()\n",
        "            mb = (act.get(\"message_body\", \"\") or \"\").strip()\n",
        "            r = 0.30 if cn and (cn in prompt_text) and mb else -0.30\n",
        "\n",
        "        out.append(float(r) * CURRENT_LOCAL_SCALE)\n",
        "    return out\n",
        "\n",
        "\n",
        "def anti_idle_reward(completions, **_) -> list[float]:\n",
        "    out: list[float] = []\n",
        "    for c in completions:\n",
        "        text = _completion_text(c)\n",
        "        act = parse_action(text)\n",
        "        out.append((-0.28 if act.get(\"action_type\") == \"do_nothing\" else 0.03) * CURRENT_LOCAL_SCALE)\n",
        "    return out\n",
        "\n",
        "\n",
        "_dummy = '{\"action_type\":\"archive_email\",\"email_id\":\"email_01\"}'\n",
        "print(\"env      :\", env_reward([_dummy]))\n",
        "print(\"format   :\", format_reward([_dummy]))\n",
        "print(\"semantic :\", semantic_action_reward([_dummy], prompts=[\"... e01 e09 t07 m02 Jamie Liu ...\"]))\n",
        "print(\"anti_idle:\", anti_idle_reward([_dummy]))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a6a37dad",
      "metadata": {},
      "outputs": [],
      "source": [
        "# GRPO stages B / C / D: no early-stop callbacks — each stage runs the full `max_steps`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "92744e0a",
      "metadata": {},
      "source": [
        "## Phase 4 — Deploy\n",
        "\n",
        "Already done. Live Space: [`modelbuilderhq/ghostexec`](https://huggingface.co/spaces/modelbuilderhq/ghostexec). The health-check cell above confirmed `/reset` + `/step` are green."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "428d6377",
      "metadata": {},
      "source": [
        "## Phase 5 — Train small (SFT warmup -> GRPO)\n",
        "\n",
        "Load `unsloth/Llama-3.2-3B-Instruct` in 4-bit with Unsloth, attach LoRA, run a **short SFT warmup first**, then run GRPO. vLLM is not used anywhere in this notebook — rollouts go through the standard HF `generate()` path inside `GRPOTrainer`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ad55b1af",
      "metadata": {},
      "outputs": [],
      "source": [
        "# IMPORTANT: import unsloth before transformers so its kernels patch cleanly.\n",
        "from unsloth import FastLanguageModel\n",
        "import torch\n",
        "\n",
        "MAX_SEQ_LENGTH = 2048\n",
        "\n",
        "policy, tokenizer = FastLanguageModel.from_pretrained(\n",
        "    model_name=MODEL_ID,\n",
        "    max_seq_length=MAX_SEQ_LENGTH,\n",
        "    load_in_4bit=True,\n",
        "    dtype=None,                 # auto (bf16 on T4 compute via bnb)\n",
        ")\n",
        "\n",
        "policy = FastLanguageModel.get_peft_model(\n",
        "    policy,\n",
        "    r=16, lora_alpha=32, lora_dropout=0.0,\n",
        "    target_modules=[\"q_proj\",\"k_proj\",\"v_proj\",\"o_proj\",\"gate_proj\",\"up_proj\",\"down_proj\"],\n",
        "    bias=\"none\",\n",
        "    use_gradient_checkpointing=\"unsloth\",\n",
        "    random_state=3407,\n",
        ")\n",
        "\n",
        "if tokenizer.pad_token is None:\n",
        "    tokenizer.pad_token = tokenizer.eos_token\n",
        "tokenizer.padding_side = \"left\"\n",
        "\n",
        "print(\"policy loaded:\", MODEL_ID)\n",
        "policy.print_trainable_parameters()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "883dce70",
      "metadata": {},
      "outputs": [],
      "source": [
        "SYSTEM_PROMPT = (\n",
        "    \"You are Ghostexec, an AI Chief of Staff. You receive a plain-text briefing of an executive's \"\n",
        "    \"inbox, calendar and tasks. You must choose the single best next action.\\n\\n\"\n",
        "    \"Legal action_type values: reply_email, archive_email, reschedule_meeting, cancel_meeting, \"\n",
        "    \"complete_task, delegate_task, send_message, do_nothing.\\n\\n\"\n",
        "    \"Output ONLY a compact JSON object with these keys (no prose, no code fences):\\n\"\n",
        "    \"{\\\"action_type\\\": \\\"\\\", \\\"email_id\\\": \\\"\\\", \\\"message_body\\\": \\\"\\\", \"\n",
        "    \"\\\"meeting_id\\\": \\\"\\\", \\\"new_time\\\": \\\"\\\", \\\"reason\\\": \\\"\\\", \\\"task_id\\\": \\\"\\\", \"\n",
        "    \"\\\"contact_name\\\": \\\"\\\", \\\"message\\\": \\\"\\\"}.\\n\\n\"\n",
        "    \"ID RULES:\\n\"\n",
        "    \"- email_id must be an email token from briefing like e01, e02, ...\\n\"\n",
        "    \"- meeting_id must be a meeting token like m01, m02, ...\\n\"\n",
        "    \"- task_id must be a task token like t01, t02, ...\\n\"\n",
        "    \"- contact_name must exactly match a contact shown in briefing.\\n\"\n",
        "    \"- Never use subject/body/description text as an ID.\\n\"\n",
        "    \"- If you cannot find a valid ID for your chosen action, output {\\\"action_type\\\":\\\"do_nothing\\\"}.\\n\\n\"\n",
        "    \"Prefer high-impact valid actions; avoid do_nothing when critical items are unresolved.\"\n",
        ")\n",
        "\n",
        "def build_prompt(briefing: str) -> list[dict]:\n",
        "    return [\n",
        "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
        "        {\"role\": \"user\",   \"content\": f\"BRIEFING:\\n{briefing}\\n\\nReturn one JSON action.\"},\n",
        "    ]\n",
        "\n",
        "def render_chat(messages: list[dict]) -> str:\n",
        "    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "adebc81c",
      "metadata": {},
      "outputs": [],
      "source": [
        "from datasets import Dataset\n",
        "from tqdm.auto import tqdm\n",
        "\n",
        "def fetch_briefing() -> str:\n",
        "    obs = env.reset()\n",
        "    inner = obs.get(\"observation\") or obs\n",
        "    brief = inner.get(\"echoed_message\") or inner.get(\"message\") or \"\"\n",
        "    if not brief:\n",
        "        raise RuntimeError(f\"Space returned no briefing: keys={list(inner.keys())}\")\n",
        "    return brief\n",
        "\n",
        "N_BRIEFINGS = int(os.environ.get(\"N_BRIEFINGS\", \"24\"))\n",
        "briefings: list[str] = []\n",
        "for _ in tqdm(range(N_BRIEFINGS), desc=\"sampling /reset\"):\n",
        "    briefings.append(fetch_briefing())\n",
        "\n",
        "print(f\"fetched {len(briefings)} briefings ({len(set(briefings))} unique)\")\n",
        "train_ds = Dataset.from_list([{\"prompt\": build_prompt(b)} for b in briefings])\n",
        "print(train_ds)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e80ea6ed",
      "metadata": {},
      "source": [
        "### 5.1 Baselines — weak random vs frozen (pre-pipeline) vs trained (Help Guide §19)\n",
        "\n",
        "- **Random**: junk actions (bad IDs / not in scenario) so the mean stays low on purpose.\n",
        "- **Frozen**: same LoRA policy **before** the SFT+GRPO cells below; should beat random clearly.\n",
        "- **Trained**: re-eval after GRPO; Phase 9 asserts **random mean < frozen mean** and **frozen mean + margin < trained mean** (margins: env vars `MIN_GAP_RANDOM_FROZEN`, `MIN_GAP_FROZEN_TRAINED`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9c2ff7d6",
      "metadata": {},
      "outputs": [],
      "source": [
        "N_EVAL = int(os.environ.get(\"N_EVAL\", \"8\"))\n",
        "\n",
        "def random_policy_reward() -> list[float]:\n",
        "    rs: list[float] = []\n",
        "    for _ in range(N_EVAL):\n",
        "        at = random.choice(list(LEGAL_ACTION_TYPES))\n",
        "        env.reset()\n",
        "        r, _ = env.step(_smoke_action(at))\n",
        "        rs.append(r)\n",
        "    return rs\n",
        "\n",
        "@torch.no_grad()\n",
        "def evaluate_policy(model, n: int = N_EVAL, temperature: float = 0.2) -> list[float]:\n",
        "    FastLanguageModel.for_inference(model)\n",
        "    rs: list[float] = []\n",
        "    for i in range(n):\n",
        "        brief = briefings[i % len(briefings)]\n",
        "        prompt_text = render_chat(build_prompt(brief))\n",
        "        inputs = tokenizer(prompt_text, return_tensors=\"pt\", truncation=True, max_length=MAX_SEQ_LENGTH).to(model.device)\n",
        "        out = model.generate(\n",
        "            **inputs,\n",
        "            max_new_tokens=128,\n",
        "            do_sample=(temperature > 0),\n",
        "            temperature=max(temperature, 1e-5),\n",
        "            pad_token_id=tokenizer.pad_token_id,\n",
        "        )\n",
        "        completion = tokenizer.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
        "        action = sanitize_action(parse_action(completion))\n",
        "        env.reset()\n",
        "        try:\n",
        "            r, _ = env.step(action)\n",
        "        except RuntimeError:\n",
        "            r, _ = env.step({\"action_type\": \"do_nothing\"})\n",
        "        rs.append(r)\n",
        "    FastLanguageModel.for_training(model)\n",
        "    return rs\n",
        "\n",
        "print(\"Random baseline ...\")\n",
        "random_rewards = random_policy_reward()\n",
        "print(\" mean:\", sum(random_rewards) / len(random_rewards))\n",
        "\n",
        "print(\"Frozen-base baseline ...\")\n",
        "frozen_rewards = evaluate_policy(policy, n=N_EVAL, temperature=0.2)\n",
        "print(\" mean:\", sum(frozen_rewards) / len(frozen_rewards))\n",
        "\n",
        "baselines = {\"random\": random_rewards, \"frozen\": frozen_rewards}"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "018d2c7c",
      "metadata": {},
      "source": [
        "### 5.2 Stage B — first GRPO stage (easy->full curriculum starts here)\n",
        "\n",
        "We run a short SFT warmup first, then GRPO Stage B with stronger local scaffold weights (`CURRENT_LOCAL_SCALE`) and slightly lower env scale (`CURRENT_ENV_SCALE`).\n",
        "\n",
        "As stages progress (B -> C -> D), the notebook anneals toward full env-dominant training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "10b073d0",
      "metadata": {},
      "outputs": [],
      "source": [
        "from trl import GRPOConfig, GRPOTrainer, SFTConfig, SFTTrainer\n",
        "\n",
        "reward_funcs = [env_reward, format_reward, semantic_action_reward, anti_idle_reward]\n",
        "stage_logs: dict[str, list[dict]] = {}\n",
        "\n",
        "# -------- SFT warmup --------\n",
        "def _heuristic_action_for_sft(briefing: str) -> dict:\n",
        "    b = briefing.lower()\n",
        "    if \"e01\" in b:\n",
        "        return {\"action_type\": \"reply_email\", \"email_id\": \"e01\", \"message_body\": \"Acknowledged, sharing an update shortly.\"}\n",
        "    if \"m02\" in b:\n",
        "        return {\"action_type\": \"reschedule_meeting\", \"meeting_id\": \"m02\", \"new_time\": \"2026-04-21T18:00:00\", \"reason\": \"resolve overlap\"}\n",
        "    if \"t06\" in b:\n",
        "        return {\"action_type\": \"complete_task\", \"task_id\": \"t06\"}\n",
        "    return {\"action_type\": \"do_nothing\"}\n",
        "\n",
        "sft_rows = []\n",
        "for b in briefings:\n",
        "    msgs = build_prompt(b)\n",
        "    prompt_txt = render_chat(msgs)\n",
        "    completion_txt = json.dumps(_heuristic_action_for_sft(b), ensure_ascii=True)\n",
        "    sft_rows.append({\"prompt_text\": prompt_txt, \"completion_text\": completion_txt})\n",
        "\n",
        "sft_ds = Dataset.from_list(sft_rows)\n",
        "sft_cfg = SFTConfig(\n",
        "    output_dir=str(OUT / \"sft_warmup\"),\n",
        "    max_steps=30,\n",
        "    per_device_train_batch_size=1,\n",
        "    gradient_accumulation_steps=4,\n",
        "    learning_rate=2e-5,\n",
        "    logging_steps=5,\n",
        "    report_to=\"none\",\n",
        ")\n",
        "sft_trainer = SFTTrainer(\n",
        "    model=policy,\n",
        "    processing_class=tokenizer,\n",
        "    train_dataset=sft_ds,\n",
        "    args=sft_cfg,\n",
        "    dataset_text_field=\"prompt_text\",\n",
        "    formatting_func=lambda ex: [f\"{p}{c}\" for p, c in zip(ex[\"prompt_text\"], ex[\"completion_text\"])],\n",
        ")\n",
        "print(\"\\n=== SFT warmup ===\")\n",
        "sft_trainer.train()\n",
        "policy = sft_trainer.model\n",
        "\n",
        "\n",
        "def grpo_config(name: str, *, temperature: float, num_generations: int, max_steps: int, lr: float) -> GRPOConfig:\n",
        "    return GRPOConfig(\n",
        "        output_dir=str(OUT / f\"stage_{name}\"),\n",
        "        per_device_train_batch_size=1,\n",
        "        gradient_accumulation_steps=4,\n",
        "        num_generations=num_generations,\n",
        "        max_prompt_length=1920,\n",
        "        max_completion_length=48,\n",
        "        temperature=temperature,\n",
        "        learning_rate=lr,\n",
        "        beta=0.04,\n",
        "        max_steps=max_steps,\n",
        "        logging_steps=1,\n",
        "        bf16=False,\n",
        "        fp16=True,\n",
        "        report_to=\"none\",\n",
        "        save_strategy=\"no\",\n",
        "        remove_unused_columns=False,\n",
        "        log_completions=True,\n",
        "    )\n",
        "\n",
        "\n",
        "def set_curriculum_scales(stage_name: str) -> None:\n",
        "    global CURRENT_ENV_SCALE, CURRENT_LOCAL_SCALE\n",
        "    # easy -> full complexity curriculum\n",
        "    if stage_name == \"B\":\n",
        "        CURRENT_ENV_SCALE = 0.85\n",
        "        CURRENT_LOCAL_SCALE = 0.60\n",
        "    elif stage_name == \"C\":\n",
        "        CURRENT_ENV_SCALE = 0.95\n",
        "        CURRENT_LOCAL_SCALE = 0.40\n",
        "    else:\n",
        "        CURRENT_ENV_SCALE = 1.00\n",
        "        CURRENT_LOCAL_SCALE = 0.25\n",
        "    print(f\"curriculum[{stage_name}] env={CURRENT_ENV_SCALE:.2f} local={CURRENT_LOCAL_SCALE:.2f}\")\n",
        "\n",
        "\n",
        "def run_stage(name: str, **kw) -> None:\n",
        "    set_curriculum_scales(name)\n",
        "    print(f\"\\n=== Stage {name} → {kw} ===\")\n",
        "    trainer = GRPOTrainer(\n",
        "        model=policy,\n",
        "        args=grpo_config(name, **kw),\n",
        "        train_dataset=train_ds,\n",
        "        reward_funcs=reward_funcs,\n",
        "        processing_class=tokenizer,\n",
        "    )\n",
        "    trainer.train()\n",
        "    stage_logs[name] = list(trainer.state.log_history)\n",
        "    adapter_dir = OUT / f\"adapter_stage_{name}\"\n",
        "    trainer.model.save_pretrained(adapter_dir)\n",
        "    tokenizer.save_pretrained(adapter_dir)\n",
        "    print(f\"stage {name} adapter → {adapter_dir}\")\n",
        "\n",
        "\n",
        "run_stage(\"B\", temperature=0.8, num_generations=2, max_steps=20, lr=5e-6)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0fe81938",
      "metadata": {},
      "source": [
        "## Phase 6 — Inspect for hacking\n",
        "\n",
        "Don't trust the mean reward alone. Sample six post-Stage-B completions, parse them, hit the Space live, and print the full trio (completion / parsed action / reward). Look for obviously pathological outputs (repeated identical JSON, prose-only outputs, empty fields)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "815b4416",
      "metadata": {},
      "outputs": [],
      "source": [
        "FastLanguageModel.for_inference(policy)\n",
        "for i in range(6):\n",
        "    brief = briefings[i % len(briefings)]\n",
        "    prompt_text = render_chat(build_prompt(brief))\n",
        "    inputs = tokenizer(prompt_text, return_tensors=\"pt\", truncation=True, max_length=MAX_SEQ_LENGTH).to(policy.device)\n",
        "    out = policy.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7,\n",
        "                         pad_token_id=tokenizer.pad_token_id)\n",
        "    completion = tokenizer.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
        "    act = parse_action(completion)\n",
        "    env.reset(); r, _ = env.step(act)\n",
        "    print(f\"\\n--- sample {i} ---\")\n",
        "    print(\"completion:\", completion.strip()[:200])\n",
        "    print(\"parsed    :\", json.dumps(act))\n",
        "    print(\"reward    :\", round(r, 4))\n",
        "FastLanguageModel.for_training(policy)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "524f6691",
      "metadata": {},
      "source": [
        "## Phase 7 — Add curriculum\n",
        "\n",
        "The deployed Space scenario is fixed, so curriculum is applied through both:\n",
        "\n",
        "1. **Exploration schedule** (temperature/lr across stages)\n",
        "2. **Complexity curriculum (easy -> full)** via reward scales:\n",
        "   - Stage B: stronger local scaffold guidance\n",
        "   - Stage C: mixed guidance\n",
        "   - Stage D: env-dominant optimization"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e1cdb870",
      "metadata": {},
      "outputs": [],
      "source": [
        "run_stage(\"C\", temperature=0.7, num_generations=2, max_steps=25, lr=5e-6)\n",
        "run_stage(\"D\", temperature=0.5, num_generations=2, max_steps=15, lr=2e-6)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d124a745",
      "metadata": {},
      "source": [
        "## Phase 8 — Train bigger (knobs, not action)\n",
        "\n",
        "Only after the loop is stable should you scale. If you rent an L4 or A100 with HF credits:\n",
        "\n",
        "- `MODEL_ID` → `unsloth/Qwen3-4B-Instruct-2507` or `unsloth/Llama-3.1-8B-Instruct`\n",
        "- `N_BRIEFINGS` ↑ (more prompt diversity)\n",
        "- `num_generations` ↑ and `max_steps` ↑ (more rollouts per prompt, more updates)\n",
        "\n",
        "All other cells are unchanged. Don't add features until you've watched a full stable run on this small config."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f8e59da9",
      "metadata": {},
      "source": [
        "## Phase 9 — Save and demo\n",
        "\n",
        "Re-evaluate on the same `N_EVAL` prompts, plot the before/after + reward curves, save the LoRA adapter (no 4-bit merge per Help Guide §16), and write a compliance manifest."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ebd9908f",
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"Evaluating trained policy ...\")\n",
        "trained_rewards = evaluate_policy(policy, n=N_EVAL, temperature=0.2)\n",
        "print(\" trained mean:\", sum(trained_rewards) / len(trained_rewards))\n",
        "\n",
        "def _mean(xs): return sum(xs) / max(len(xs), 1)\n",
        "summary = {\n",
        "    \"random\":  _mean(baselines[\"random\"]),\n",
        "    \"frozen\":  _mean(baselines[\"frozen\"]),\n",
        "    \"trained\": _mean(trained_rewards),\n",
        "}\n",
        "print(json.dumps(summary, indent=2))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5ccb3832",
      "metadata": {},
      "outputs": [],
      "source": [
        "import pandas as pd, matplotlib.pyplot as plt\n",
        "\n",
        "plt.figure(figsize=(6, 4))\n",
        "plt.bar(list(summary.keys()), list(summary.values()), color=[\"#888\", \"#1f77b4\", \"#2ca02c\"])\n",
        "plt.title(\"Ghostexec: mean reward vs deployed HF Space\")\n",
        "plt.ylabel(\"mean episode reward (higher is better)\")\n",
        "plt.axhline(0.0, color=\"black\", linewidth=0.5)\n",
        "plt.tight_layout()\n",
        "plt.savefig(OUT / \"before_after.png\", dpi=150)\n",
        "plt.show()\n",
        "\n",
        "rows = []\n",
        "loss_rows = []\n",
        "step_counter = 0\n",
        "for name, log in stage_logs.items():\n",
        "    for entry in log:\n",
        "        r = entry.get(\"rewards/env_reward/mean\", entry.get(\"reward\"))\n",
        "        if \"loss\" in entry:\n",
        "            loss_rows.append({\"stage\": name, \"global_step\": step_counter + 1, \"loss\": entry[\"loss\"]})\n",
        "        if r is None:\n",
        "            continue\n",
        "        step_counter += 1\n",
        "        rows.append({\n",
        "            \"stage\": name,\n",
        "            \"global_step\": step_counter,\n",
        "            \"env\": r,\n",
        "            \"fmt\":  entry.get(\"rewards/format_reward/mean\", 0.0),\n",
        "            \"semantic\": entry.get(\"rewards/semantic_action_reward/mean\", 0.0),\n",
        "            \"idle\": entry.get(\"rewards/anti_idle_reward/mean\", 0.0),\n",
        "        })\n",
        "\n",
        "df = pd.DataFrame(rows)\n",
        "df.to_csv(OUT / \"reward_log.csv\", index=False)\n",
        "\n",
        "loss_df = pd.DataFrame(loss_rows)\n",
        "if not loss_df.empty:\n",
        "    plt.figure(figsize=(8, 4))\n",
        "    for name, sub in loss_df.groupby(\"stage\"):\n",
        "        plt.plot(sub[\"global_step\"], sub[\"loss\"], label=f\"stage {name}\")\n",
        "    plt.xlabel(\"global step\"); plt.ylabel(\"loss\")\n",
        "    plt.title(\"Ghostexec SFT+GRPO — loss vs step\")\n",
        "    plt.legend(); plt.tight_layout()\n",
        "    plt.savefig(OUT / \"loss_curve.png\", dpi=150); plt.show()\n",
        "\n",
        "if not df.empty:\n",
        "    plt.figure(figsize=(8, 4))\n",
        "    for name, sub in df.groupby(\"stage\"):\n",
        "        plt.plot(sub[\"global_step\"], sub[\"env\"], label=f\"stage {name}\")\n",
        "    plt.xlabel(\"global step\"); plt.ylabel(\"mean env_reward\")\n",
        "    plt.title(\"Ghostexec GRPO — reward vs step (Unsloth)\")\n",
        "    plt.legend(); plt.tight_layout()\n",
        "    plt.savefig(OUT / \"reward_curve.png\", dpi=150); plt.show()\n",
        "\n",
        "    plt.figure(figsize=(8, 4))\n",
        "    plt.plot(df[\"global_step\"], df[\"env\"],  label=\"env_reward\")\n",
        "    plt.plot(df[\"global_step\"], df[\"fmt\"],  label=\"format_reward\")\n",
        "    plt.plot(df[\"global_step\"], df[\"semantic\"], label=\"semantic_action_reward\")\n",
        "    plt.plot(df[\"global_step\"], df[\"idle\"], label=\"anti_idle_reward\")\n",
        "    plt.xlabel(\"global step\"); plt.ylabel(\"mean component reward\")\n",
        "    plt.title(\"Reward components — hacking-watch\")\n",
        "    plt.legend(); plt.tight_layout()\n",
        "    plt.savefig(OUT / \"components.png\", dpi=150); plt.show()\n",
        "else:\n",
        "    print(\"No numeric reward log found — skipping curve plots.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "07648f04",
      "metadata": {},
      "outputs": [],
      "source": [
        "final_adapter = OUT / \"adapter_final\"\n",
        "policy.save_pretrained(final_adapter)\n",
        "tokenizer.save_pretrained(final_adapter)\n",
        "print(\"final adapter →\", final_adapter)\n",
        "\n",
        "if HUB_REPO_ID and os.environ.get(\"HF_TOKEN\"):\n",
        "    from huggingface_hub import HfApi, login\n",
        "    login(token=os.environ[\"HF_TOKEN\"], add_to_git_credential=False)\n",
        "    policy.push_to_hub(HUB_REPO_ID, commit_message=f\"ghostexec GRPO adapter ({RUN_NAME})\")\n",
        "    tokenizer.push_to_hub(HUB_REPO_ID)\n",
        "    api = HfApi()\n",
        "    for fname in (\"reward_log.csv\", \"before_after.png\", \"reward_curve.png\", \"components.png\"):\n",
        "        p = OUT / fname\n",
        "        if p.exists():\n",
        "            api.upload_file(path_or_fileobj=str(p), path_in_repo=fname, repo_id=HUB_REPO_ID)\n",
        "    print(\"pushed adapter + artefacts →\", HUB_REPO_ID)\n",
        "else:\n",
        "    print(\"HUB_REPO_ID / HF_TOKEN not set — skipping push.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "81fdfca3",
      "metadata": {},
      "outputs": [],
      "source": [
        "manifest = {\n",
        "    \"env_url\":    GHOSTEXEC_ENV_URL,\n",
        "    \"model\":      MODEL_ID,\n",
        "    \"run\":        RUN_NAME,\n",
        "    \"stack\":      {\"unsloth\": True, \"trl\": \"0.22.2\", \"pipeline\": \"SFT->GRPO\"},\n",
        "    \"rewards\": {\n",
        "        \"random_mean\":  summary[\"random\"],\n",
        "        \"frozen_mean\":  summary[\"frozen\"],\n",
        "        \"trained_mean\": summary[\"trained\"],\n",
        "        \"improvement_vs_frozen\": summary[\"trained\"] - summary[\"frozen\"],\n",
        "    },\n",
        "    \"stages\":       [\"SFT\"] + list(stage_logs.keys()),\n",
        "    \"reward_fns\":   [\"env_reward\", \"format_reward\", \"semantic_action_reward\", \"anti_idle_reward\"],\n",
        "    \"curriculum\":   {\n",
        "        \"type\": \"easy_to_full\",\n",
        "        \"stage_scales\": {\n",
        "            \"B\": {\"env\": 0.85, \"local\": 0.60},\n",
        "            \"C\": {\"env\": 0.95, \"local\": 0.40},\n",
        "            \"D\": {\"env\": 1.00, \"local\": 0.25},\n",
        "        },\n",
        "    },\n",
        "    \"adapter_path\": str(final_adapter),\n",
        "    \"mean_space_latency_ms\": round(sum(env.latency_ms) / max(len(env.latency_ms), 1), 1),\n",
        "    \"n_space_calls\":         len(env.latency_ms),\n",
        "}\n",
        "print(json.dumps(manifest, indent=2))\n",
        "(OUT / \"manifest.json\").write_text(json.dumps(manifest, indent=2))\n",
        "print(\"\\nmanifest →\", OUT / \"manifest.json\")"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.10"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}