Spaces:

anugrah55
/

opensleuth-colab

Runtime error

App Files Files Community

anugrah55 commited on 12 days ago

Commit

e8f2f91

verified ·

1 Parent(s): 44d68be

Initial commit: OpenSleuth Colab quickstart notebook + Gradio landing page

Browse files

Files changed (4) hide show

README.md +42 -5
app.py +74 -0
requirements.txt +1 -0
train_opensleuth_grpo.ipynb +821 -0

README.md CHANGED Viewed

@@ -1,12 +1,49 @@
 ---
-title: Opensleuth Colab
-emoji: 👁
-colorFrom: red
 colorTo: green
 sdk: gradio
-sdk_version: 6.13.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: OpenSleuth Colab
+emoji: 🕵️
+colorFrom: indigo
 colorTo: green
 sdk: gradio
+sdk_version: 4.44.0
 app_file: app.py
 pinned: false
+license: apache-2.0
 ---
+# OpenSleuth — Colab quickstart Space
+This Space is a thin landing page for the [`train_opensleuth_grpo.ipynb`](./train_opensleuth_grpo.ipynb) notebook — the **minimum reproducible Colab** for training an OpenSleuth agent end-to-end against the live env Space.
+## What is OpenSleuth?
+An **Algorithmic Detective** RL environment. An LLM agent reverse-engineers an unknown black-box Python function by **probing** it with inputs and then **submitting** a Python replica. The environment scores submissions by domain-aware fuzz-testing against the hidden reference, with a complexity penalty so the agent can't just memorise its probes inside a giant `if/else`.
+## Try it
+Click the badge to open the notebook in Google Colab:
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/spaces/anugrah55/opensleuth-colab/blob/main/train_opensleuth_grpo.ipynb)
+Or download `train_opensleuth_grpo.ipynb` from the **Files** tab and upload it to Colab manually. Set the runtime to **GPU → T4** and hit **Runtime → Run all** — end-to-end training completes in roughly 15 – 25 minutes on a free-tier T4 with the default Qwen2.5-0.5B-Instruct config.
+## What the notebook does
+1. Pip-installs the pinned trainer stack (`transformers==4.51.3`, `trl==0.16.1`, `peft==0.14.0`, `accelerate==1.4.0`, `bitsandbytes==0.45.5`, `datasets==3.3.2`).
+2. Hits the live env Space [`anugrah55/opensleuth-env-gemini-cli`](https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli) at `https://anugrah55-opensleuth-env-gemini-cli.hf.space` to discover all 15 tasks (9 builtins + 6 from the Hub task dataset).
+3. Builds a synthesis dataset where each row is `(signature + observed probes) → expected python implementation`.
+4. Loads `Qwen2.5-0.5B-Instruct` in 4-bit + LoRA so it fits on a T4.
+5. Trains with HF TRL's `GRPOTrainer` using a two-part reward:
+   - **env-verifier reward**: real fuzz-tested correctness against the hidden reference, with a complexity penalty.
+   - **format reward**: tiny shaping signal for emitting a fenced ```python``` code block with the right function name.
+6. Optionally pushes the trained LoRA adapter to your own Hub account.
+7. Runs a 3-episode smoke eval and prints the agent's emitted code.
+## Links
+- **Env Space (REST API the notebook calls):** https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli
+- **Training Space (full 3B retrain):** https://huggingface.co/spaces/anugrah55/opensleuth-training-gemini-cli
+- **Open-ended task catalog (Hub dataset):** https://huggingface.co/datasets/anugrah55/opensleuth-tasks
+## License
+Apache-2.0.

app.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""Tiny Gradio landing page for the OpenSleuth Colab notebook Space.
+The actual training happens in the notebook (`train_opensleuth_grpo.ipynb` in
+this same repo, downloadable from the Files tab). This app just renders a
+clickable Open-In-Colab card so visitors can launch it in one click.
+"""
+from __future__ import annotations
+import gradio as gr
+NOTEBOOK_PATH = "train_opensleuth_grpo.ipynb"
+SPACE_ID = "anugrah55/opensleuth-colab"
+COLAB_URL = (
+    "https://colab.research.google.com/#fileId="
+    f"https%3A//huggingface.co/spaces/{SPACE_ID}/blob/main/{NOTEBOOK_PATH}"
+)
+LANDING_MD = f"""
+# OpenSleuth — Colab quickstart
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]({COLAB_URL})
+OpenSleuth is an *Algorithmic Detective* RL environment. An LLM agent reverse-engineers an unknown black-box Python function by probing it and then submitting a Python replica. The env fuzz-tests the submission against the hidden reference (with a complexity penalty) and returns a scalar reward.
+This Space hosts the **minimum reproducible Colab notebook** for training an
+agent against the live env Space using **HF TRL's `GRPOTrainer`** + **bnb-4bit**
++ **LoRA** on a free-tier Colab T4. End-to-end runtime: ~15 – 25 minutes.
+### One-click training
+1. Click the **Open in Colab** badge above (or grab `{NOTEBOOK_PATH}` from the **Files** tab and upload it to Colab manually).
+2. In Colab: `Runtime → Change runtime type → GPU → T4`.
+3. `Runtime → Run all`.
+### Defaults
+| Knob | Value |
+|------|-------|
+| Model | `Qwen/Qwen2.5-0.5B-Instruct` |
+| Quant | bnb-4bit (nf4 + double-quant) |
+| LoRA  | r=16, alpha=32, q/k/v/o |
+| Tasks | all 15 from `anugrah55/opensleuth-tasks` |
+| GRPO `num_generations` | 4 |
+| Epochs | 1 |
+### Links
+- **Env Space (REST API the notebook calls):** https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli
+- **Training Space (full 3B retrain):** https://huggingface.co/spaces/anugrah55/opensleuth-training-gemini-cli
+- **Open-ended task catalog:** https://huggingface.co/datasets/anugrah55/opensleuth-tasks
+"""
+def _open_colab() -> str:
+    return f"Opening Colab: {COLAB_URL}"
+with gr.Blocks(title="OpenSleuth — Colab quickstart") as demo:
+    gr.Markdown(LANDING_MD)
+    with gr.Row():
+        gr.Button(
+            value="Open in Google Colab",
+            link=COLAB_URL,
+            variant="primary",
+        )
+        gr.Button(
+            value="View notebook in Files tab",
+            link=f"https://huggingface.co/spaces/{SPACE_ID}/blob/main/{NOTEBOOK_PATH}",
+        )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ gradio==4.44.0

train_opensleuth_grpo.ipynb ADDED Viewed

	@@ -0,0 +1,821 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7086c037",
+   "metadata": {},
+   "source": [
+    "# OpenSleuth — GRPO training on a free-tier Colab T4\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/anugrah55/opensleuth/blob/main/colab/train_opensleuth_grpo.ipynb)\n",
+    "\n",
+    "**OpenSleuth** is an *Algorithmic Detective* RL environment. An LLM agent reverse-engineers an unknown black-box Python function by **probing** it with inputs and then **submitting** a Python replica. The environment scores submissions by domain-aware fuzz-testing against the hidden reference, with a complexity penalty so the agent can't just memorise its probes inside a giant `if/else`.\n",
+    "\n",
+    "This notebook trains a small open-weights model with HF TRL's [`GRPOTrainer`](https://huggingface.co/docs/trl/en/grpo_trainer) against the **live** OpenSleuth environment Space. It is sized to complete end-to-end on a **free-tier Colab T4** (16 GB GPU) in roughly **15 – 25 minutes**.\n",
+    "\n",
+    "### Links\n",
+    "\n",
+    "- **Live env Space (this notebook calls it directly):** https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli — REST API at `https://anugrah55-opensleuth-env-gemini-cli.hf.space`\n",
+    "- **Open-ended task catalog (Hub dataset, 15 tasks):** https://huggingface.co/datasets/anugrah55/opensleuth-tasks\n",
+    "- **Repo / Spaces:** training Space `anugrah55/opensleuth-training-gemini-cli`, env Space `anugrah55/opensleuth-env-gemini-cli`\n",
+    "- **Blog (if/when published):** https://huggingface.co/blog/anugrah55/opensleuth\n",
+    "\n",
+    "### What this notebook does\n",
+    "\n",
+    "1. Installs pinned versions of `transformers`, `trl`, `peft`, `bitsandbytes`, `accelerate`, `datasets`.\n",
+    "2. Hits the env's `/tasks` endpoint to discover all 15 tasks (9 builtins + 6 Hub-driven, both open-ended).\n",
+    "3. Builds a synthesis dataset where each row is `(signature + observed probes) → expected python implementation`.\n",
+    "4. Loads **Qwen2.5-0.5B-Instruct** in 4-bit + LoRA so it fits comfortably on a T4.\n",
+    "5. Trains with GRPO using a **two-part reward**: env-verifier score (real fuzz-tested correctness, capped by complexity) plus a tiny formatting shaping reward.\n",
+    "6. Optionally pushes the trained adapter to the Hub.\n",
+    "7. Runs a 3-episode smoke eval against the live env and prints the agent's emitted code.\n",
+    "\n",
+    "> The full 3B retrain runs separately on a Hugging Face Space; this notebook is the **minimum reproducible Colab** required by the hackathon spec."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9307eb3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pinned to /training/requirements.txt so the env-side reward stays in lockstep.\n",
+    "# trl 0.16.x is required for the modern GRPOTrainer / GRPOConfig API.\n",
+    "!pip install --quiet \\\n",
+    "    \"transformers==4.51.3\" \\\n",
+    "    \"trl==0.16.1\" \\\n",
+    "    \"peft==0.14.0\" \\\n",
+    "    \"accelerate==1.4.0\" \\\n",
+    "    \"bitsandbytes==0.45.5\" \\\n",
+    "    \"datasets==3.3.2\" \\\n",
+    "    \"huggingface_hub>=0.30.2,<1.0\" \\\n",
+    "    \"requests>=2.32.3\"\n",
+    "print(\"deps installed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb6ecbad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# OPTIONAL: log in so you can push the trained adapter to your own HF account.\n",
+    "# Skip this cell entirely if you only want to train + smoke-eval locally in Colab.\n",
+    "from huggingface_hub import notebook_login\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c81d26f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "import os\n",
+    "import random\n",
+    "import re\n",
+    "import sys\n",
+    "import time\n",
+    "from typing import Any, Dict, Iterable, List, Optional, Sequence\n",
+    "\n",
+    "import requests\n",
+    "import torch\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "logging.basicConfig(\n",
+    "    level=logging.INFO,\n",
+    "    format=\"%(asctime)s %(levelname)s %(name)s: %(message)s\",\n",
+    "    stream=sys.stdout,\n",
+    ")\n",
+    "log = logging.getLogger(\"opensleuth.colab\")\n",
+    "\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Constants you might tweak\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "# Live OpenSleuth env Space.\n",
+    "ENV_URL = \"https://anugrah55-opensleuth-env-gemini-cli.hf.space\"\n",
+    "\n",
+    "# Default to the 0.5B Qwen so this completes in ~15-25 min on a free-tier T4.\n",
+    "# Bump to \"Qwen/Qwen2.5-1.5B-Instruct\" or \"Qwen/Qwen2.5-3B-Instruct\" if you have\n",
+    "# Colab Pro / a beefier GPU. The reward + dataset code is model-agnostic.\n",
+    "MODEL_NAME = \"Qwen/Qwen2.5-0.5B-Instruct\"\n",
+    "\n",
+    "# Where to dump checkpoints + the final adapter.\n",
+    "OUTPUT_DIR = \"./opensleuth-grpo-colab\"\n",
+    "\n",
+    "# Set to your own Hub repo to push (e.g. \"your-username/opensleuth-grpo-colab\").\n",
+    "# Leave as None to skip pushing. Pushing requires a write token from the login\n",
+    "# cell above.\n",
+    "PUSH_TO_HUB_REPO: Optional[str] = None  # e.g. \"anugrah55/opensleuth-grpo-colab\"\n",
+    "\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Hyperparameters (sized for free-tier T4: 16GB GPU, ~12hr session)\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "# Per-task rollouts. Small so the dataset-build phase (which has to call the env\n",
+    "# /probe endpoint many times) finishes in a few minutes.\n",
+    "N_PER_FUNCTION = 8\n",
+    "N_PROBES = 6\n",
+    "\n",
+    "# GRPO knobs. num_generations=4 + per_device_batch_size=4 means each optimisation\n",
+    "# step uses one prompt and 4 sampled completions for the relative advantage,\n",
+    "# which is the minimum sensible GRPO batch.\n",
+    "NUM_GENERATIONS = 4\n",
+    "PER_DEVICE_BATCH_SIZE = 4\n",
+    "GRADIENT_ACCUMULATION_STEPS = 2\n",
+    "NUM_TRAIN_EPOCHS = 1.0\n",
+    "LEARNING_RATE = 1e-5\n",
+    "MAX_PROMPT_LENGTH = 1024\n",
+    "MAX_COMPLETION_LENGTH = 384\n",
+    "\n",
+    "SEED = 42\n",
+    "random.seed(SEED)\n",
+    "\n",
+    "print(\"env_url       =\", ENV_URL)\n",
+    "print(\"model_name    =\", MODEL_NAME)\n",
+    "print(\"output_dir    =\", OUTPUT_DIR)\n",
+    "print(\"push_to_hub   =\", PUSH_TO_HUB_REPO)\n",
+    "print(\"cuda_available =\", torch.cuda.is_available())\n",
+    "if torch.cuda.is_available():\n",
+    "    print(\"gpu           =\", torch.cuda.get_device_name(0))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fdd9c63b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Self-contained copy of /training/opensleuth_train/client.py so this notebook\n",
+    "# does not depend on pip-installing the trainer package.\n",
+    "\n",
+    "class EnvClient:\n",
+    "    \"\"\"Thin HTTP client for the live OpenSleuth env Space.\"\"\"\n",
+    "\n",
+    "    def __init__(self, base_url: str = ENV_URL, timeout: float = 60.0, retries: int = 4):\n",
+    "        self.base_url = base_url.rstrip(\"/\")\n",
+    "        self.timeout = timeout\n",
+    "        self.retries = retries\n",
+    "\n",
+    "    def _post(self, path: str, payload: Dict[str, Any]) -> Dict[str, Any]:\n",
+    "        last_exc: Optional[Exception] = None\n",
+    "        for attempt in range(self.retries):\n",
+    "            try:\n",
+    "                r = requests.post(\n",
+    "                    f\"{self.base_url}{path}\", json=payload, timeout=self.timeout\n",
+    "                )\n",
+    "                r.raise_for_status()\n",
+    "                return r.json()\n",
+    "            except (requests.RequestException, ValueError) as e:\n",
+    "                last_exc = e\n",
+    "                wait = 0.5 * (2 ** attempt)\n",
+    "                log.warning(\"env POST %s failed (%s); retrying in %.1fs\", path, e, wait)\n",
+    "                time.sleep(wait)\n",
+    "        raise RuntimeError(\n",
+    "            f\"env POST {path} failed after {self.retries} retries: {last_exc}\"\n",
+    "        )\n",
+    "\n",
+    "    def _get(self, path: str, params: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:\n",
+    "        last_exc: Optional[Exception] = None\n",
+    "        for attempt in range(self.retries):\n",
+    "            try:\n",
+    "                r = requests.get(\n",
+    "                    f\"{self.base_url}{path}\", params=params, timeout=self.timeout\n",
+    "                )\n",
+    "                r.raise_for_status()\n",
+    "                return r.json()\n",
+    "            except (requests.RequestException, ValueError) as e:\n",
+    "                last_exc = e\n",
+    "                wait = 0.5 * (2 ** attempt)\n",
+    "                log.warning(\"env GET %s failed (%s); retrying in %.1fs\", path, e, wait)\n",
+    "                time.sleep(wait)\n",
+    "        raise RuntimeError(\n",
+    "            f\"env GET {path} failed after {self.retries} retries: {last_exc}\"\n",
+    "        )\n",
+    "\n",
+    "    def health(self) -> Dict[str, Any]:\n",
+    "        r = requests.get(f\"{self.base_url}/health\", timeout=self.timeout)\n",
+    "        r.raise_for_status()\n",
+    "        return r.json()\n",
+    "\n",
+    "    def list_functions(self) -> List[Dict[str, str]]:\n",
+    "        \"\"\"Legacy /functions endpoint -- only the 9 builtin functions.\"\"\"\n",
+    "        r = requests.get(f\"{self.base_url}/functions\", timeout=self.timeout)\n",
+    "        r.raise_for_status()\n",
+    "        return r.json()[\"functions\"]\n",
+    "\n",
+    "    def list_tasks(\n",
+    "        self,\n",
+    "        source: str = \"all\",\n",
+    "        difficulty: Optional[str] = None,\n",
+    "    ) -> List[Dict[str, Any]]:\n",
+    "        \"\"\"Live catalog: builtins + Hub-driven tasks.\"\"\"\n",
+    "        params: Dict[str, Any] = {\"source\": source}\n",
+    "        if difficulty:\n",
+    "            params[\"difficulty\"] = difficulty\n",
+    "        return self._get(\"/tasks\", params=params)[\"tasks\"]\n",
+    "\n",
+    "    def sample_inputs(self, target_name: str, n: int = 8, seed: int = 0) -> List[str]:\n",
+    "        \"\"\"Pull `n` ready-to-probe input_repr strings from the env's auto-fuzzer.\"\"\"\n",
+    "        resp = self._get(\n",
+    "            f\"/tasks/{target_name}/sample_inputs\",\n",
+    "            params={\"n\": n, \"seed\": seed},\n",
+    "        )\n",
+    "        return list(resp[\"inputs\"])\n",
+    "\n",
+    "    def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Dict[str, Any]:\n",
+    "        return self._post(\n",
+    "            \"/reset\",\n",
+    "            {\"target_name\": target_name, \"seed\": seed, \"max_steps\": max_steps},\n",
+    "        )\n",
+    "\n",
+    "    def step(self, episode_id: str, action: Dict[str, Any]) -> Dict[str, Any]:\n",
+    "        return self._post(\n",
+    "            \"/step\", {\"episode_id\": episode_id, \"action\": action}\n",
+    "        )\n",
+    "\n",
+    "    def submit(self, episode_id: str, code: str) -> Dict[str, Any]:\n",
+    "        return self.step(episode_id, {\"action_type\": \"submit\", \"code\": code})\n",
+    "\n",
+    "    def probe(self, episode_id: str, input_repr: str) -> Dict[str, Any]:\n",
+    "        return self.step(episode_id, {\"action_type\": \"probe\", \"input_repr\": input_repr})\n",
+    "\n",
+    "    def score_submission(self, target_name: str, code: str, seed: int = 0) -> float:\n",
+    "        \"\"\"One-shot: open an episode, submit the code, return total reward.\"\"\"\n",
+    "        ep = self.reset(target_name=target_name, seed=seed, max_steps=2)\n",
+    "        resp = self.submit(ep[\"episode_id\"], code)\n",
+    "        return float(resp[\"reward\"])\n",
+    "\n",
+    "\n",
+    "client = EnvClient(base_url=ENV_URL)\n",
+    "print(\"EnvClient ready ->\", client.base_url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2e1c7e5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Self-contained, minimal copies of:\n",
+    "#   /training/opensleuth_train/prompt.py\n",
+    "#   /training/opensleuth_train/reward.py\n",
+    "#   /training/opensleuth_train/dataset.py\n",
+    "# (kept in lockstep with the original modules; the env is the source of truth\n",
+    "# for which tasks exist and what probe inputs to use.)\n",
+    "\n",
+    "# --- prompt --------------------------------------------------------------\n",
+    "\n",
+    "SYSTEM_PROMPT = (\n",
+    "    \"You are an algorithmic detective. You are given the public signature of a \"\n",
+    "    \"hidden Python function plus several (input, output) examples observed by \"\n",
+    "    \"probing it. Your job is to write a Python function that *exactly* \"\n",
+    "    \"reproduces the hidden function's behavior on all valid inputs. Match its \"\n",
+    "    \"return values AND its exception types on invalid inputs. Keep your \"\n",
+    "    \"implementation as simple and clean as possible (it is penalised for \"\n",
+    "    \"being needlessly branchy). Return ONLY the function definition wrapped \"\n",
+    "    \"in a single ```python ... ``` code block.\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def build_prompt(target_name: str, signature: str, probes: Iterable[tuple]) -> str:\n",
+    "    lines = [\n",
+    "        f\"## Hidden function: {target_name}\",\n",
+    "        \"\",\n",
+    "        \"### Public signature & docstring\",\n",
+    "        signature.strip() or \"(no signature provided)\",\n",
+    "        \"\",\n",
+    "        \"### Observed probes\",\n",
+    "    ]\n",
+    "    probe_list = list(probes)\n",
+    "    if not probe_list:\n",
+    "        lines.append(\"(none)\")\n",
+    "    else:\n",
+    "        for inp, out, is_err in probe_list:\n",
+    "            tag = \"raises\" if is_err else \"returns\"\n",
+    "            lines.append(f\"- input={inp}  ->  {tag} {out}\")\n",
+    "    lines += [\n",
+    "        \"\",\n",
+    "        \"### Task\",\n",
+    "        f\"Write a Python function named `{target_name}` that reproduces the hidden \"\n",
+    "        \"function's behaviour. Return ONLY the function definition in a single \"\n",
+    "        \"```python ... ``` code block. Do not add explanations.\",\n",
+    "    ]\n",
+    "    return \"\\n\".join(lines)\n",
+    "\n",
+    "\n",
+    "_CODE_RE = re.compile(r\"```(?:python)?\\s*(.*?)```\", re.DOTALL | re.IGNORECASE)\n",
+    "\n",
+    "\n",
+    "def extract_code(completion: str) -> str:\n",
+    "    m = _CODE_RE.search(completion)\n",
+    "    if m:\n",
+    "        return m.group(1).strip()\n",
+    "    return completion.strip()\n",
+    "\n",
+    "\n",
+    "# --- reward --------------------------------------------------------------\n",
+    "\n",
+    "_FUNC_RE = re.compile(r\"^def\\s+(\\w+)\\s*\\(\", re.MULTILINE)\n",
+    "\n",
+    "\n",
+    "def _extract_text(completion):\n",
+    "    if isinstance(completion, str):\n",
+    "        return completion\n",
+    "    if isinstance(completion, list):\n",
+    "        parts = []\n",
+    "        for msg in completion:\n",
+    "            if isinstance(msg, dict) and \"content\" in msg:\n",
+    "                parts.append(str(msg[\"content\"]))\n",
+    "            else:\n",
+    "                parts.append(str(msg))\n",
+    "        return \"\\n\".join(parts)\n",
+    "    return str(completion)\n",
+    "\n",
+    "\n",
+    "def _index(value, i: int, default):\n",
+    "    if value is None:\n",
+    "        return default\n",
+    "    if isinstance(value, list):\n",
+    "        return value[i] if i < len(value) else default\n",
+    "    return value\n",
+    "\n",
+    "\n",
+    "def make_env_reward(client: EnvClient, *, scale: float = 1.0 / 100.0):\n",
+    "    \"\"\"Verifier-backed reward. Calls /step submit on the env and returns the\n",
+    "    env's reward divided by `scale` (so a perfect submission ~= +1.5 and a bad\n",
+    "    one ~= -0.5; keeps GRPO advantages well-behaved without normalisation).\"\"\"\n",
+    "\n",
+    "    def env_reward(completions, target_function_name=None, row_seed=None, **kwargs):\n",
+    "        rewards: List[float] = []\n",
+    "        for i, completion in enumerate(completions):\n",
+    "            text = _extract_text(completion)\n",
+    "            code = extract_code(text)\n",
+    "            tname = _index(target_function_name, i, default=\"fibonacci\")\n",
+    "            seed = _index(row_seed, i, default=0)\n",
+    "            try:\n",
+    "                env_reward_value = client.score_submission(tname, code, seed=seed)\n",
+    "            except Exception as e:\n",
+    "                log.warning(\"env scoring failed for %s: %s\", tname, e)\n",
+    "                env_reward_value = -50.0\n",
+    "            rewards.append(env_reward_value * scale)\n",
+    "        return rewards\n",
+    "\n",
+    "    env_reward.__name__ = \"env_verifier_reward\"\n",
+    "    return env_reward\n",
+    "\n",
+    "\n",
+    "def format_reward(completions, target_function_name=None, **kwargs):\n",
+    "    \"\"\"Cheap shaping reward: +0.1 if the output has a fenced python block,\n",
+    "    +0.1 more if it defines the right function name. Encourages the model to\n",
+    "    converge on the expected output format quickly so the env-verifier reward\n",
+    "    becomes informative early in training.\"\"\"\n",
+    "    rewards: List[float] = []\n",
+    "    for i, completion in enumerate(completions):\n",
+    "        text = _extract_text(completion)\n",
+    "        score = 0.0\n",
+    "        if \"```python\" in text or \"```\\n\" in text:\n",
+    "            score += 0.1\n",
+    "        code = extract_code(text)\n",
+    "        m = _FUNC_RE.search(code)\n",
+    "        tname = _index(target_function_name, i, default=None)\n",
+    "        if m and (tname is None or m.group(1) == tname):\n",
+    "            score += 0.1\n",
+    "        rewards.append(score)\n",
+    "    return rewards\n",
+    "\n",
+    "\n",
+    "format_reward.__name__ = \"format_reward\"\n",
+    "\n",
+    "\n",
+    "# --- dataset -------------------------------------------------------------\n",
+    "\n",
+    "\n",
+    "def discover_functions(\n",
+    "    client: EnvClient,\n",
+    "    *,\n",
+    "    source: str = \"all\",\n",
+    "    include: Optional[Sequence[str]] = None,\n",
+    "    difficulty: Optional[str] = None,\n",
+    ") -> List[dict]:\n",
+    "    \"\"\"Live task catalog from the env. `include` filters by name;\n",
+    "    `difficulty` filters by easy/medium/hard.\"\"\"\n",
+    "    tasks = client.list_tasks(source=source)\n",
+    "    if difficulty and difficulty.lower() != \"all\":\n",
+    "        tasks = [t for t in tasks if (t.get(\"difficulty\") or \"\").lower() == difficulty.lower()]\n",
+    "    if include:\n",
+    "        wanted = {n.strip() for n in include if n and n.strip()}\n",
+    "        if wanted:\n",
+    "            tasks = [t for t in tasks if t[\"name\"] in wanted]\n",
+    "    if not tasks:\n",
+    "        raise RuntimeError(\n",
+    "            f\"discover_functions filtered to 0 tasks \"\n",
+    "            f\"(source={source!r}, include={include!r}, difficulty={difficulty!r}).\"\n",
+    "        )\n",
+    "    return tasks\n",
+    "\n",
+    "\n",
+    "def _make_probe_inputs(\n",
+    "    target_name: str,\n",
+    "    rng: random.Random,\n",
+    "    n: int,\n",
+    "    *,\n",
+    "    client: EnvClient,\n",
+    "    seed: int,\n",
+    ") -> List[str]:\n",
+    "    \"\"\"Preferred path: ask the env for `n` ready-to-probe inputs via its\n",
+    "    auto-fuzzer. Fallback (if the endpoint hiccups): submit literal \"1\" probes\n",
+    "    so we at least populate `n` rows.\"\"\"\n",
+    "    try:\n",
+    "        return client.sample_inputs(target_name=target_name, n=n, seed=seed)\n",
+    "    except Exception as e:\n",
+    "        log.warning(\n",
+    "            \"env sample_inputs(%s, n=%d, seed=%s) failed: %s; falling back to literals\",\n",
+    "            target_name, n, seed, e,\n",
+    "        )\n",
+    "        return [\"1\"] * n\n",
+    "\n",
+    "\n",
+    "def _sample_probes(\n",
+    "    client: EnvClient,\n",
+    "    target_name: str,\n",
+    "    seed: int,\n",
+    "    n_probes: int,\n",
+    ") -> tuple:\n",
+    "    \"\"\"Open one episode and feed it `n_probes` random valid inputs sourced\n",
+    "    from the env's own auto-fuzzer. Returns `(signature, history)`.\"\"\"\n",
+    "    rng = random.Random(seed)\n",
+    "    ep = client.reset(target_name=target_name, seed=seed, max_steps=n_probes + 5)\n",
+    "    sig = ep[\"target_function_signature\"]\n",
+    "    eid = ep[\"episode_id\"]\n",
+    "\n",
+    "    inputs = _make_probe_inputs(\n",
+    "        target_name, rng, n_probes, client=client, seed=seed,\n",
+    "    )\n",
+    "    history: List[tuple] = []\n",
+    "    for inp_repr in inputs:\n",
+    "        try:\n",
+    "            resp = client.probe(eid, inp_repr)\n",
+    "        except Exception as e:\n",
+    "            log.warning(\"probe failed for %s with %r: %s\", target_name, inp_repr, e)\n",
+    "            continue\n",
+    "        last = resp[\"observation\"][\"probe_history\"][-1]\n",
+    "        history.append(\n",
+    "            (last[\"input_repr\"], last[\"output_repr\"], bool(last[\"is_error\"]))\n",
+    "        )\n",
+    "    return sig, history\n",
+    "\n",
+    "\n",
+    "def build_synthesis_dataset(\n",
+    "    client: EnvClient,\n",
+    "    *,\n",
+    "    n_per_function: int,\n",
+    "    n_probes: int = 6,\n",
+    "    seed: int = 0,\n",
+    "    include: Optional[Sequence[str]] = None,\n",
+    "    difficulty: Optional[str] = None,\n",
+    "    tasks: Optional[Iterable[dict]] = None,\n",
+    ") -> Dataset:\n",
+    "    \"\"\"Build a HuggingFace Dataset of {prompt, target_function_name} rows.\n",
+    "\n",
+    "    Uniform-N variant of /training/opensleuth_train/dataset.py: every task\n",
+    "    gets the same `n_per_function` rollouts. (The full trainer uses a\n",
+    "    difficulty-weighted schedule; we keep the Colab variant simple so the\n",
+    "    dataset-build phase fits in the free-tier session.)\"\"\"\n",
+    "    if tasks is None:\n",
+    "        tasks = discover_functions(\n",
+    "            client, include=include, difficulty=difficulty,\n",
+    "        )\n",
+    "    tasks = list(tasks)\n",
+    "    rows = []\n",
+    "    rng = random.Random(seed)\n",
+    "    log.info(\n",
+    "        \"building dataset over %d task(s); n_per_function=%d n_probes=%d\",\n",
+    "        len(tasks), n_per_function, n_probes,\n",
+    "    )\n",
+    "    for task in tasks:\n",
+    "        fn_name = task[\"name\"]\n",
+    "        diff = (task.get(\"difficulty\") or \"\").lower() or \"?\"\n",
+    "        log.info(\n",
+    "            \"  %-22s difficulty=%-8s rollouts=%d source=%s\",\n",
+    "            fn_name, diff, n_per_function, task.get(\"source\", \"?\"),\n",
+    "        )\n",
+    "        for _ in range(n_per_function):\n",
+    "            row_seed = rng.randrange(0, 2 ** 31)\n",
+    "            try:\n",
+    "                sig, probes = _sample_probes(client, fn_name, row_seed, n_probes)\n",
+    "            except Exception as e:\n",
+    "                log.warning(\n",
+    "                    \"rollout build failed for %s seed=%d: %s; skipping row\",\n",
+    "                    fn_name, row_seed, e,\n",
+    "                )\n",
+    "                continue\n",
+    "            prompt = build_prompt(fn_name, sig, probes)\n",
+    "            rows.append(\n",
+    "                {\n",
+    "                    \"prompt\": prompt,\n",
+    "                    \"target_function_name\": fn_name,\n",
+    "                    \"row_seed\": row_seed,\n",
+    "                    \"difficulty\": diff,\n",
+    "                }\n",
+    "            )\n",
+    "    rng.shuffle(rows)\n",
+    "    log.info(\"built dataset: %d rows total\", len(rows))\n",
+    "    return Dataset.from_list(rows)\n",
+    "\n",
+    "\n",
+    "print(\"prompt + reward + dataset helpers loaded\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88230844",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity-check the env, list every task it exposes, and build the dataset.\n",
+    "print(\"--- env health ---\")\n",
+    "print(client.health())\n",
+    "\n",
+    "print(\"\\n--- legacy /functions (9 builtins) ---\")\n",
+    "for f in client.list_functions():\n",
+    "    print(\" -\", f.get(\"name\"), \"::\", f.get(\"signature\", \"\")[:60])\n",
+    "\n",
+    "print(\"\\n--- /tasks (full open-ended catalog) ---\")\n",
+    "all_tasks = client.list_tasks()\n",
+    "print(f\"total tasks: {len(all_tasks)}\")\n",
+    "for t in all_tasks:\n",
+    "    print(f\"  - {t['name']:<22} difficulty={t.get('difficulty', '?'):<6} source={t.get('source', '?')}\")\n",
+    "\n",
+    "print(\"\\n--- building synthesis dataset ---\")\n",
+    "dataset_raw = build_synthesis_dataset(\n",
+    "    client,\n",
+    "    n_per_function=N_PER_FUNCTION,\n",
+    "    n_probes=N_PROBES,\n",
+    "    seed=SEED,\n",
+    ")\n",
+    "print(f\"\\ndataset rows: {len(dataset_raw)}\")\n",
+    "print(\"\\nsample row 0:\")\n",
+    "print(\"  target_function_name =\", dataset_raw[0][\"target_function_name\"])\n",
+    "print(\"  difficulty           =\", dataset_raw[0][\"difficulty\"])\n",
+    "print(\"  row_seed             =\", dataset_raw[0][\"row_seed\"])\n",
+    "print(\"  prompt:\")\n",
+    "print(\"  \" + dataset_raw[0][\"prompt\"].replace(\"\\n\", \"\\n  \"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "14ca2743",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
+    "\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_compute_dtype=torch.bfloat16,\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    ")\n",
+    "\n",
+    "print(f\"loading tokenizer for {MODEL_NAME} ...\")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "print(f\"loading model {MODEL_NAME} in 4-bit ...\")\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    MODEL_NAME,\n",
+    "    quantization_config=bnb_config,\n",
+    "    torch_dtype=torch.bfloat16,\n",
+    "    trust_remote_code=True,\n",
+    "    device_map=\"auto\",\n",
+    ")\n",
+    "\n",
+    "peft_config = LoraConfig(\n",
+    "    r=16,\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.05,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\"],\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    "    bias=\"none\",\n",
+    ")\n",
+    "print(\"model + LoRA config ready\")\n",
+    "print(\"model device map:\", {k: str(v) for k, v in (model.hf_device_map or {}).items()} if hasattr(model, \"hf_device_map\") and model.hf_device_map else \"single-device\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "202de2fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "\n",
+    "# Wrap each row as a chat-template prompt list. GRPOTrainer applies the chat\n",
+    "# template under the hood when \"prompt\" is a list of messages.\n",
+    "def to_chat(row):\n",
+    "    return {\n",
+    "        \"prompt\": [\n",
+    "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "            {\"role\": \"user\", \"content\": row[\"prompt\"]},\n",
+    "        ],\n",
+    "        \"target_function_name\": row[\"target_function_name\"],\n",
+    "        \"row_seed\": row[\"row_seed\"],\n",
+    "    }\n",
+    "\n",
+    "drop_cols = [c for c in (\"prompt\", \"difficulty\") if c in dataset_raw.column_names]\n",
+    "dataset = dataset_raw.map(to_chat, remove_columns=drop_cols)\n",
+    "print(\"dataset columns after chat-format:\", dataset.column_names)\n",
+    "print(\"rows:\", len(dataset))\n",
+    "\n",
+    "# GRPO requires per_device_train_batch_size to be a multiple of num_generations\n",
+    "# (one prompt is repeated num_generations times in the same forward pass).\n",
+    "assert PER_DEVICE_BATCH_SIZE % NUM_GENERATIONS == 0, (\n",
+    "    f\"PER_DEVICE_BATCH_SIZE ({PER_DEVICE_BATCH_SIZE}) must be a multiple of \"\n",
+    "    f\"NUM_GENERATIONS ({NUM_GENERATIONS}).\"\n",
+    ")\n",
+    "\n",
+    "grpo_config = GRPOConfig(\n",
+    "    output_dir=OUTPUT_DIR,\n",
+    "    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    num_train_epochs=NUM_TRAIN_EPOCHS,\n",
+    "    max_prompt_length=MAX_PROMPT_LENGTH,\n",
+    "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
+    "    num_generations=NUM_GENERATIONS,\n",
+    "    beta=0.04,\n",
+    "    bf16=torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False,\n",
+    "    fp16=False,\n",
+    "    logging_steps=1,\n",
+    "    save_steps=50,\n",
+    "    save_total_limit=2,\n",
+    "    report_to=[],\n",
+    "    seed=SEED,\n",
+    "    push_to_hub=bool(PUSH_TO_HUB_REPO),\n",
+    "    hub_model_id=PUSH_TO_HUB_REPO,\n",
+    "    hub_strategy=\"end\",\n",
+    "    gradient_checkpointing=True,\n",
+    ")\n",
+    "\n",
+    "env_reward_fn = make_env_reward(client)\n",
+    "\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=[env_reward_fn, format_reward],\n",
+    "    args=grpo_config,\n",
+    "    train_dataset=dataset,\n",
+    "    peft_config=peft_config,\n",
+    "    processing_class=tokenizer,\n",
+    ")\n",
+    "print(\"GRPOTrainer ready. Steps per epoch (approx):\",\n",
+    "      max(1, len(dataset) // (PER_DEVICE_BATCH_SIZE // NUM_GENERATIONS) // GRADIENT_ACCUMULATION_STEPS))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "03875ee7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Kick off training. On a free-tier T4 with the defaults above this should\n",
+    "# take roughly 15-25 minutes for one epoch over the 15-task catalog.\n",
+    "# You'll see GRPO logging every step: reward/env_verifier_reward, reward/format_reward,\n",
+    "# rewards/std, kl, loss, etc.\n",
+    "trainer.train()\n",
+    "print(\"training complete.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7bd608a9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.save_model(OUTPUT_DIR)\n",
+    "print(f\"adapter saved to {OUTPUT_DIR}\")\n",
+    "\n",
+    "# Optional push. To enable, set PUSH_TO_HUB_REPO above to e.g.\n",
+    "#   \"your-username/opensleuth-grpo-colab\"\n",
+    "# and re-run this cell after a successful notebook_login() above.\n",
+    "if PUSH_TO_HUB_REPO:\n",
+    "    print(f\"pushing to hub: {PUSH_TO_HUB_REPO}\")\n",
+    "    trainer.push_to_hub()\n",
+    "    print(\"push complete.\")\n",
+    "else:\n",
+    "    print(\"PUSH_TO_HUB_REPO is None -- skipping hub push. \"\n",
+    "          \"Set PUSH_TO_HUB_REPO at the top of the notebook to push.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5ab224e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Smoke-eval: run 3 episodes against the live env using the just-trained\n",
+    "# adapter. Each episode probes a fresh function, generates a candidate\n",
+    "# implementation, submits it, and prints the env's reward + the emitted code.\n",
+    "\n",
+    "EVAL_TASKS = [\"fibonacci\", \"is_palindrome\", \"digit_sum\"]\n",
+    "EVAL_PROBES = 6\n",
+    "EVAL_MAX_NEW_TOKENS = 384\n",
+    "\n",
+    "\n",
+    "def _gen(prompt_text: str) -> str:\n",
+    "    msgs = [\n",
+    "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "        {\"role\": \"user\", \"content\": prompt_text},\n",
+    "    ]\n",
+    "    inputs = tokenizer.apply_chat_template(\n",
+    "        msgs, return_tensors=\"pt\", add_generation_prompt=True\n",
+    "    ).to(model.device)\n",
+    "    with torch.no_grad():\n",
+    "        out = model.generate(\n",
+    "            inputs,\n",
+    "            max_new_tokens=EVAL_MAX_NEW_TOKENS,\n",
+    "            do_sample=False,\n",
+    "            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,\n",
+    "        )\n",
+    "    completion_ids = out[0, inputs.shape[1]:]\n",
+    "    return tokenizer.decode(completion_ids, skip_special_tokens=True)\n",
+    "\n",
+    "\n",
+    "for task_name in EVAL_TASKS:\n",
+    "    print(\"\\n\" + \"=\" * 70)\n",
+    "    print(f\"=== task: {task_name} ===\")\n",
+    "    sig, probes = _sample_probes(client, task_name, seed=SEED + hash(task_name) % 1000, n_probes=EVAL_PROBES)\n",
+    "    user_prompt = build_prompt(task_name, sig, probes)\n",
+    "    completion = _gen(user_prompt)\n",
+    "    code = extract_code(completion)\n",
+    "    try:\n",
+    "        reward = client.score_submission(task_name, code, seed=SEED)\n",
+    "    except Exception as e:\n",
+    "        reward = float(\"nan\")\n",
+    "        print(f\"score_submission failed: {e}\")\n",
+    "    print(f\"env reward: {reward:.3f}\")\n",
+    "    print(\"--- emitted code ---\")\n",
+    "    print(code)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "728aaee9",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "You just trained a tiny LoRA adapter on top of `Qwen2.5-0.5B-Instruct` against the live OpenSleuth env. Some things to try next:\n",
+    "\n",
+    "- **Push to the Hub.** Set `PUSH_TO_HUB_REPO = \"your-username/opensleuth-grpo-colab\"` in the constants cell, re-run the login + save/push cells. The adapter is tiny (LoRA on q/k/v/o), so it pushes in seconds.\n",
+    "- **Train longer.** Bump `N_PER_FUNCTION` to `16-24` and `NUM_TRAIN_EPOCHS` to `2-3`. On a T4 this still fits inside one Colab session.\n",
+    "- **Step up to 3B.** Set `MODEL_NAME = \"Qwen/Qwen2.5-3B-Instruct\"` and drop `PER_DEVICE_BATCH_SIZE` back to `2` (with `NUM_GENERATIONS=2`). You'll need Colab Pro / an A100, or just use the dedicated training Space (`anugrah55/opensleuth-training-gemini-cli`) which is configured to retrain the 3B model end-to-end.\n",
+    "- **Curriculum.** Pass `difficulty=\"easy\"` to `build_synthesis_dataset(...)` for an easier warm-up, then re-run with `difficulty=\"hard\"` once the format reward saturates.\n",
+    "- **Add tasks.** Push a row to the [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks) Hub dataset; the env hot-reloads on its next boot, no redeploy needed, and this notebook's `discover_functions(client)` will pick them up automatically.\n",
+    "- **Eval externally.** The repo's `eval/run_eval.py` runs the same fuzz-tested verification headlessly; point it at your pushed adapter and the live env Space to get an apples-to-apples score against the baseline.\n",
+    "\n",
+    "### Links again\n",
+    "\n",
+    "- Env Space: https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli\n",
+    "- Training Space (full 3B retrain): https://huggingface.co/spaces/anugrah55/opensleuth-training-gemini-cli\n",
+    "- Task dataset (open-ended): https://huggingface.co/datasets/anugrah55/opensleuth-tasks\n",
+    "- Trained adapter (after you push): `https://huggingface.co/<your-username>/opensleuth-grpo-colab`"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "name": "train_opensleuth_grpo.ipynb",
+   "provenance": [],
+   "toc_visible": true
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}