Spaces:

anugrah55
/

opensleuth-training-gemini-cli

Paused

App Files Files Community

anugrah55 commited on 13 days ago

Commit

d597642

verified ·

1 Parent(s): d8aa978

Overhaul trainer: TRL GRPO with env-backed reward, Qwen2.5-0.5B 4bit+LoRA, slim PyTorch CUDA base, heartbeat HTTP for HF Spaces health probe

Browse files

Files changed (11) hide show

Dockerfile +29 -9
README.md +56 -5
__pycache__/train.cpython-313.pyc +0 -0
entrypoint.sh +42 -0
opensleuth_train/__init__.py +14 -0
opensleuth_train/client.py +63 -0
opensleuth_train/dataset.py +112 -0
opensleuth_train/prompt.py +60 -0
opensleuth_train/reward.py +91 -0
requirements.txt +20 -3
train.py +196 -147

Dockerfile CHANGED Viewed

@@ -1,13 +1,33 @@
-# Use a base image with CUDA and Python
-FROM huggingface/transformers-pytorch-gpu:latest
-# Copy all the files from the repo to the container
-COPY . /app
 WORKDIR /app
-# Install dependencies
-RUN pip install -r requirements.txt
-RUN pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" trl peft bitsandbytes requests
-# Run the training script using python3
-CMD ["python3", "hf_train_runner.py"]

+# Slim, well-tested CUDA + PyTorch base. Avoids HF transformers-pytorch-gpu's
+# bloat and unsloth's CUDA-version sensitivity.
+FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-runtime
+ENV PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    HF_HOME=/data/.cache/huggingface \
+    TRANSFORMERS_CACHE=/data/.cache/huggingface/hub \
+    TOKENIZERS_PARALLELISM=false
 WORKDIR /app
+# System deps for bitsandbytes / build
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        git curl build-essential \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt /app/
+RUN pip install --no-cache-dir -r requirements.txt
+# Project code
+COPY opensleuth_train /app/opensleuth_train
+COPY train.py /app/
+COPY entrypoint.sh /app/
+RUN chmod +x /app/entrypoint.sh \
+    && mkdir -p /data/opensleuth-grpo /data/.cache/huggingface
+# HF Spaces health probe expects the container to expose a port; keep it open
+# so the orchestrator considers us alive while training runs.
+EXPOSE 7860
+CMD ["/app/entrypoint.sh"]

README.md CHANGED Viewed

@@ -1,10 +1,61 @@
 ---
-title: Opensleuth Training Gemini Cli
-emoji: 🌖
-colorFrom: purple
-colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: OpenSleuth Trainer
+emoji: 🛰️
+colorFrom: red
+colorTo: yellow
 sdk: docker
+app_port: 7860
 pinned: false
+suggested_hardware: t4-small
 ---
+# OpenSleuth — Trainer
+GPU Space that fine-tunes a small Qwen2.5 model with TRL **GRPO** to do
+in-context program synthesis against the live OpenSleuth env.
+## Pipeline
+1. Wait for the env Space to report healthy.
+2. Build a dataset of synthesis prompts: each row pairs one black-box
+   function with N pre-sampled `(input, output)` probes drawn from the env.
+3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and
+   `peft`.
+4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate
+   completions per prompt and rewarding each against the env's verifier.
+5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is
+   set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`.
+## Reward
+* `env_verifier_reward = env.score_submission(...) / 100` — the headline
+  shaped reward, ranging roughly `[-0.5, +1.5]`.
+* `format_reward` — small bonus for emitting a fenced ```python``` block
+  whose `def` matches the target function name; helps the model converge on
+  parseable output early.
+## Hardware
+`t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train
+faster if available.
+## Required Space secrets
+* `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at
+  the end of training.
+## Tuning knobs
+All knobs are exposed as env vars (defaults shown):
+| Env var | Default | Meaning |
+|---------|---------|---------|
+| `ENV_URL` | env Space URL | OpenSleuth env to target |
+| `MODEL_NAME` | `Qwen/Qwen2.5-0.5B-Instruct` | Base policy |
+| `N_PER_FUNCTION` | `16` | Prompts per black-box function |
+| `N_PROBES` | `6` | Probes per prompt |
+| `NUM_GENERATIONS` | `4` | GRPO group size |
+| `LEARNING_RATE` | `1e-5` | |
+| `NUM_TRAIN_EPOCHS` | `1` | |
+| `PER_DEVICE_BATCH_SIZE` | `1` | |
+| `GRAD_ACCUM` | `8` | |

__pycache__/train.cpython-313.pyc ADDED Viewed

Binary file (11.7 kB). View file

entrypoint.sh ADDED Viewed

	@@ -0,0 +1,42 @@

+#!/usr/bin/env bash
+# OpenSleuth training Space entrypoint.
+#
+# Starts a tiny background HTTP server on $PORT (default 7860) so the HF
+# Spaces health probe is satisfied, then runs the actual training script in
+# the foreground. All training logs go to stdout and are visible in the
+# Space's "Container logs" tab.
+set -euo pipefail
+PORT="${PORT:-7860}"
+log() { echo "[entrypoint $(date -u +%H:%M:%S)] $*"; }
+# 1. Background heartbeat HTTP server. Just returns 200 OK on every request.
+log "starting heartbeat server on :${PORT}"
+python -c "
+import http.server, socketserver, os, threading, time
+class H(http.server.BaseHTTPRequestHandler):
+    def do_GET(self):
+        self.send_response(200)
+        self.send_header('Content-Type','text/plain')
+        self.end_headers()
+        self.wfile.write(b'opensleuth-trainer alive\n')
+    def log_message(self, *a, **kw): pass
+port = int(os.environ.get('PORT','7860'))
+srv = socketserver.TCPServer(('0.0.0.0', port), H)
+threading.Thread(target=srv.serve_forever, daemon=True).start()
+print(f'[heartbeat] listening on :{port}', flush=True)
+while True: time.sleep(3600)
+" &
+HB_PID=$!
+# Give the heartbeat a moment to bind before the orchestrator probes it.
+sleep 2
+# 2. Run training in the foreground. Crash here = container exits, which is
+#    what we want: HF will mark the Space failed and surface the error.
+log "starting training (PID $$)"
+log "GPU info:"
+python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
+exec python /app/train.py

opensleuth_train/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""OpenSleuth training-side helpers (env client, dataset, reward fn)."""
+from .client import EnvClient
+from .dataset import build_synthesis_dataset, FUNCTIONS_FOR_TRAINING
+from .prompt import SYSTEM_PROMPT, build_prompt, extract_code
+__all__ = [
+    "EnvClient",
+    "build_synthesis_dataset",
+    "FUNCTIONS_FOR_TRAINING",
+    "SYSTEM_PROMPT",
+    "build_prompt",
+    "extract_code",
+]

opensleuth_train/client.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Thin HTTP client for the OpenSleuth env Space."""
+from __future__ import annotations
+import logging
+import os
+import time
+from typing import Any, Dict
+import requests
+log = logging.getLogger("opensleuth.client")
+class EnvClient:
+    def __init__(self, base_url: str | None = None, timeout: float = 30.0, retries: int = 3):
+        self.base_url = (base_url or os.environ.get("ENV_URL", "http://127.0.0.1:7860")).rstrip("/")
+        self.timeout = timeout
+        self.retries = retries
+    def _post(self, path: str, payload: Dict[str, Any]) -> Dict[str, Any]:
+        last_exc: Exception | None = None
+        for attempt in range(self.retries):
+            try:
+                r = requests.post(f"{self.base_url}{path}", json=payload, timeout=self.timeout)
+                r.raise_for_status()
+                return r.json()
+            except (requests.RequestException, ValueError) as e:  # noqa: PERF203
+                last_exc = e
+                wait = 0.5 * (2 ** attempt)
+                log.warning("env POST %s failed (%s); retrying in %.1fs", path, e, wait)
+                time.sleep(wait)
+        raise RuntimeError(f"env POST {path} failed after {self.retries} retries: {last_exc}")
+    def health(self) -> Dict[str, Any]:
+        r = requests.get(f"{self.base_url}/health", timeout=self.timeout)
+        r.raise_for_status()
+        return r.json()
+    def list_functions(self) -> list[Dict[str, str]]:
+        r = requests.get(f"{self.base_url}/functions", timeout=self.timeout)
+        r.raise_for_status()
+        return r.json()["functions"]
+    def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Dict[str, Any]:
+        return self._post("/reset", {"target_name": target_name, "seed": seed, "max_steps": max_steps})
+    def step(self, episode_id: str, action: Dict[str, Any]) -> Dict[str, Any]:
+        return self._post("/step", {"episode_id": episode_id, "action": action})
+    # --- High-level helpers used by the reward function --------------------
+    def submit(self, episode_id: str, code: str) -> Dict[str, Any]:
+        return self.step(episode_id, {"action_type": "submit", "code": code})
+    def probe(self, episode_id: str, input_repr: str) -> Dict[str, Any]:
+        return self.step(episode_id, {"action_type": "probe", "input_repr": input_repr})
+    def score_submission(self, target_name: str, code: str, seed: int = 0) -> float:
+        """One-shot: open an episode, submit the code, return total reward."""
+        ep = self.reset(target_name=target_name, seed=seed, max_steps=2)
+        resp = self.submit(ep["episode_id"], code)
+        return float(resp["reward"])

opensleuth_train/dataset.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Build the training dataset of (function_name, signature, probes) → prompt.
+We pre-sample probes server-side with deterministic seeds so the LLM trains
+on a consistent set of in-context examples per task. The actual *reward* is
+computed by re-submitting the model's code against the env with a fresh fuzz
+seed, so the model can't memorise probe outputs.
+"""
+from __future__ import annotations
+import logging
+import random
+from typing import List
+from datasets import Dataset
+from .client import EnvClient
+from .prompt import build_prompt
+log = logging.getLogger("opensleuth.dataset")
+FUNCTIONS_FOR_TRAINING: List[str] = [
+    "fibonacci",
+    "reverse_string",
+    "is_palindrome",
+    "digit_sum",
+    "count_vowels",
+    "gcd",
+    "sort_unique",
+    "caesar_cipher",
+    "is_prime",
+]
+def _sample_probes(client: EnvClient, target_name: str, seed: int, n_probes: int) -> tuple[str, list[tuple[str, str, bool]]]:
+    """Open an episode and feed it `n_probes` random valid inputs sourced from
+    the env's own fuzz generator (we just hit /functions and synthesise inputs
+    locally to avoid coupling to a specific spec API)."""
+    rng = random.Random(seed)
+    ep = client.reset(target_name=target_name, seed=seed, max_steps=n_probes + 5)
+    sig = ep["target_function_signature"]
+    eid = ep["episode_id"]
+    inputs = _make_probe_inputs(target_name, rng, n_probes)
+    history: list[tuple[str, str, bool]] = []
+    for inp_repr in inputs:
+        resp = client.probe(eid, inp_repr)
+        last = resp["observation"]["probe_history"][-1]
+        history.append((last["input_repr"], last["output_repr"], bool(last["is_error"])))
+    return sig, history
+def _make_probe_inputs(target_name: str, rng: random.Random, n: int) -> list[str]:
+    """Generate `n` Python-literal repr strings appropriate for this function.
+    Kept in lock-step (loosely) with the env's fuzz generators so probes
+    almost always land on the function's valid domain, with a few intentional
+    out-of-domain inputs to expose error-handling.
+    """
+    if target_name == "fibonacci":
+        pool = [1, 2, 5, 10, 20, 40, 89, -1, 0, 100]
+    elif target_name == "reverse_string":
+        pool = ['""', "'a'", "'hello'", "'racecar'", "'abc123'", "''", "'ab'"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "is_palindrome":
+        pool = ["'racecar'", "'hello'", "'A man a plan a canal Panama'", "''", "'ab'", "'aba'"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "digit_sum":
+        pool = [0, 1, 9, 10, 99, 100, 12345, -3]
+    elif target_name == "count_vowels":
+        pool = ["'hello'", "''", "'rhythm'", "'AEIOU'", "'xyz'", "'queueing'"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "gcd":
+        pool = ["(12, 8)", "(7, 13)", "(0, 5)", "[15, 25]", "(100, 75)", "[6, 9]"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "sort_unique":
+        pool = ["[3, 1, 2, 1]", "[]", "[5, 5, 5]", "[-1, 0, -1, 2]", "[10]"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "caesar_cipher":
+        pool = ["'hello'", "'abc'", "'xyz'", "''", "'Hello!'", "'a b c'"]
+        return [rng.choice(pool) for _ in range(n)]
+    elif target_name == "is_prime":
+        pool = [2, 3, 4, 7, 9, 11, 25, 29, 0, 1, -3]
+    else:
+        return ["1"] * n
+    return [repr(rng.choice(pool)) for _ in range(n)]
+def build_synthesis_dataset(
+    client: EnvClient,
+    *,
+    n_per_function: int = 24,
+    n_probes: int = 6,
+    seed: int = 0,
+) -> Dataset:
+    """Build a HuggingFace Dataset of {prompt, target_function_name} rows."""
+    rows = []
+    rng = random.Random(seed)
+    for fn_name in FUNCTIONS_FOR_TRAINING:
+        for k in range(n_per_function):
+            row_seed = rng.randrange(0, 2**31)
+            sig, probes = _sample_probes(client, fn_name, row_seed, n_probes)
+            prompt = build_prompt(fn_name, sig, probes)
+            rows.append(
+                {
+                    "prompt": prompt,
+                    "target_function_name": fn_name,
+                    "row_seed": row_seed,
+                }
+            )
+    rng.shuffle(rows)
+    return Dataset.from_list(rows)

opensleuth_train/prompt.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""Prompt construction + code extraction for the OpenSleuth agent."""
+from __future__ import annotations
+import re
+from typing import Iterable
+SYSTEM_PROMPT = (
+    "You are an algorithmic detective. You are given the public signature of a hidden "
+    "Python function plus several (input, output) examples observed by probing it. "
+    "Your job is to write a Python function that *exactly* reproduces the hidden "
+    "function's behavior on all valid inputs. Match its return values AND its "
+    "exception types on invalid inputs. Keep your implementation as simple and clean "
+    "as possible (it is penalised for being needlessly branchy). Return ONLY the "
+    "function definition wrapped in a single ```python ... ``` code block."
+)
+def build_prompt(target_name: str, signature: str, probes: Iterable[tuple[str, str, bool]]) -> str:
+    """Build the user-side prompt.
+    `probes` is an iterable of `(input_repr, output_repr, is_error)` tuples,
+    typically pre-sampled by the dataset builder.
+    """
+    lines = [
+        f"## Hidden function: {target_name}",
+        "",
+        f"### Public signature & docstring",
+        signature.strip() or "(no signature provided)",
+        "",
+        "### Observed probes",
+    ]
+    probe_list = list(probes)
+    if not probe_list:
+        lines.append("(none)")
+    else:
+        for inp, out, is_err in probe_list:
+            tag = "raises" if is_err else "returns"
+            lines.append(f"- input={inp}  ->  {tag} {out}")
+    lines += [
+        "",
+        "### Task",
+        f"Write a Python function named `{target_name}` that reproduces the hidden "
+        "function's behaviour. Return ONLY the function definition in a single "
+        "```python ... ``` code block. Do not add explanations.",
+    ]
+    return "\n".join(lines)
+_CODE_RE = re.compile(r"```(?:python)?\s*(.*?)```", re.DOTALL | re.IGNORECASE)
+def extract_code(completion: str) -> str:
+    """Pull the python source from a model completion. If no fenced block is
+    present we fall back to the whole completion (the verifier will then judge
+    it on its own)."""
+    m = _CODE_RE.search(completion)
+    if m:
+        return m.group(1).strip()
+    return completion.strip()

opensleuth_train/reward.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""Reward functions for GRPO: env-backed verification + cheap shaping signals.
+GRPO takes a list of `reward_funcs`. Each must accept `completions` and any
+columns from the dataset as kwargs, and return one scalar per completion.
+"""
+from __future__ import annotations
+import logging
+import re
+from typing import List
+from .client import EnvClient
+from .prompt import extract_code
+log = logging.getLogger("opensleuth.reward")
+def make_env_reward(client: EnvClient, *, scale: float = 1.0 / 100.0) -> callable:
+    """Verifier-backed reward. Calls the env's `submit` and returns the env's
+    reward divided by `scale` (default: divide by 100 so a perfect submission
+    is ~+1.5 and a bad one is around -0.5; this keeps GRPO advantages well
+    behaved without needing reward normalisation).
+    """
+    def env_reward(completions, target_function_name=None, row_seed=None, **kwargs):  # noqa: ANN001
+        rewards: List[float] = []
+        # GRPO calls the reward fn once per completion; both target_function_name
+        # and row_seed come in as lists of length len(completions).
+        for i, completion in enumerate(completions):
+            text = _extract_text(completion)
+            code = extract_code(text)
+            tname = _index(target_function_name, i, default="fibonacci")
+            seed = _index(row_seed, i, default=0)
+            try:
+                env_reward_value = client.score_submission(tname, code, seed=seed)
+            except Exception as e:  # noqa: BLE001
+                log.warning("env scoring failed for %s: %s", tname, e)
+                env_reward_value = -50.0
+            rewards.append(env_reward_value * scale)
+        return rewards
+    return env_reward
+_FUNC_RE = re.compile(r"^def\s+(\w+)\s*\(", re.MULTILINE)
+def format_reward(completions, target_function_name=None, **kwargs):  # noqa: ANN001
+    """Cheap shaping reward: +0.2 if the completion contains a fenced python
+    block AND defines a function with the right name. Encourages the model to
+    converge on the expected output format quickly so the env reward becomes
+    informative early in training."""
+    rewards: List[float] = []
+    for i, completion in enumerate(completions):
+        text = _extract_text(completion)
+        score = 0.0
+        if "```python" in text or "```\n" in text:
+            score += 0.1
+        code = extract_code(text)
+        m = _FUNC_RE.search(code)
+        tname = _index(target_function_name, i, default=None)
+        if m and (tname is None or m.group(1) == tname):
+            score += 0.1
+        rewards.append(score)
+    return rewards
+def _extract_text(completion):  # noqa: ANN001
+    """GRPO can pass either a string or an OpenAI-style chat list of dicts.
+    Normalise to a single string."""
+    if isinstance(completion, str):
+        return completion
+    if isinstance(completion, list):
+        # [{role:..., content:...}, ...]
+        parts = []
+        for msg in completion:
+            if isinstance(msg, dict) and "content" in msg:
+                parts.append(str(msg["content"]))
+            else:
+                parts.append(str(msg))
+        return "\n".join(parts)
+    return str(completion)
+def _index(value, i: int, default):
+    if value is None:
+        return default
+    if isinstance(value, list):
+        return value[i] if i < len(value) else default
+    return value

requirements.txt CHANGED Viewed

@@ -1,3 +1,20 @@
-fastapi
-uvicorn
-pydantic

+# Core ML stack. torch is provided by the base image.
+transformers==4.46.3
+trl==0.13.0
+peft==0.13.2
+accelerate==1.1.1
+bitsandbytes==0.44.1
+datasets==3.1.0
+# Tokenizer + utility deps
+sentencepiece==0.2.0
+tiktoken==0.8.0
+einops==0.8.0
+safetensors==0.4.5
+# HTTP + Hub
+requests==2.32.3
+huggingface_hub==0.26.2
+# Misc
+numpy==1.26.4

train.py CHANGED Viewed

@@ -1,157 +1,206 @@
 import torch
-import requests
-from transformers import AutoTokenizer
-from unsloth import FastLanguageModel
-from trl import GPPOTrainer, PPOConfig
-import json
-import re
-# == 1. Constants ==
-MAX_STEPS_PER_EPISODE = 15
-ENV_URL = "https://anugrah55-opensleuth-env-gemini-cli.hf.space"
-MODEL_NAME = "unsloth/qwen2-0.5b-instruct-sft-bnb-4bit"
-# == 2. Prompt Engineering ==
-def build_prompt(probe_history):
-    """
-    Creates the prompt for the LLM based on the probe history.
-    """
-    prompt = "You are a reverse-engineering AI. Your goal is to understand a hidden black-box function by probing it and then writing a Python replica.\\n\\n"
-    prompt += "== Probe History ==\\n"
-    if not probe_history:
-        prompt += "No probes yet. Your first action should be a probe.\\n"
-    else:
-        for i, (inp, out) in enumerate(probe_history):
-            prompt += f"{i+1}. IN: {inp} -> OUT: {out}\\n"
-    prompt += "\\n== Your Action ==\\n"
-    prompt += "You can either PROBE or SUBMIT.\\n"
-    prompt += "To probe, respond with: PROBE(input)\\n"
-    prompt += "To submit your code, respond with: SUBMIT\\n```python\\n[your code here]\\n```\\n"
-    prompt += "Your decision: "
-    return prompt
-# == 3. Action Parsing ==
-def parse_action_from_response(response_text):
-    """
-    Parses the model's text response to determine the action.
-    """
-    probe_match = re.search(r"PROBE\\((.*)\\)", response_text)
-    if probe_match:
-        inp = probe_match.group(1).strip()
-        return {"action_type": "probe", "input": inp}
-    submit_match = re.search(r"SUBMIT\\s*```python\\n(.*)```", response_text, re.DOTALL)
-    if submit_match:
-        code = submit_match.group(1).strip()
-        return {"action_type": "submit", "code": code}
-    # Default to a probe if parsing fails
-    return {"action_type": "probe", "input": "1"}
-# == 4. Main Training Script ==
-def main():
-    # --- Initialize Model ---
-    model, tokenizer = FastLanguageModel.from_pretrained(
-        model_name = MODEL_NAME,
-        max_seq_length = 2048,
-        dtype = None,
-        load_in_4bit = True,
     )
-    # LoRA configuration
-    model = FastLanguageModel.get_peft_model(
-        model,
-        r = 16,
-        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
-        lora_alpha = 16,
-        lora_dropout = 0,
-        bias = "none",
-        use_gradient_checkpointing = True,
-        random_state = 3407,
-        use_rslora = False,
-        loftq_config = None,
     )
-    # --- Initialize GPPO Trainer ---
-    # Note: GPPO is a new trainer in TRL and might require specific config.
-    # This is a placeholder configuration.
-    ppo_config = PPOConfig(
-        batch_size=4,
-        mini_batch_size=1,
-        learning_rate=1.41e-5,
-        adap_kl_ctrl=False,
-        log_with="tensorboard",
-        project_kwargs={"logging_dir": "./logs"}
     )
-    # We need a dataset for the trainer, even if it's just a dummy one for initialization
-    # In a real RL loop, we provide the experiences directly to the `step` method.
-    dummy_dataset = [{"query": "dummy"}]
-    gppo_trainer = GPPOTrainer(
-        config=ppo_config,
-        model=model,
-        tokenizer=tokenizer,
-        dataset=dummy_dataset,
     )
-    # --- Training Loop ---
-    for episode in range(10): # Run for 10 episodes for demonstration
-        print(f"--- Episode {episode+1} ---")
-        # Reset environment
-        try:
-            resp = requests.post(f"{ENV_URL}/reset", json={"target_name": "fibonacci"})
-            obs = resp.json()
-        except requests.exceptions.ConnectionError as e:
-            print(f"ERROR: Could not connect to environment at {ENV_URL}. Is it running?")
-            print("Please run 'uvicorn server:app --host 0.0.0.0 --port 8000' in the 'opensleuth_env' directory.")
-            return
-        queries, responses, rewards = [], [], []
-        for step in range(MAX_STEPS_PER_EPISODE):
-            # Build prompt and generate action
-            prompt = build_prompt(obs.get("probe_history", []))
-            query_tensor = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
-            # Generate a response from the model
-            generation_kwargs = {"min_new_tokens": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id, "max_new_tokens": 150}
-            response_tensor = gppo_trainer.generate(query_tensor, **generation_kwargs)
-            response_text = tokenizer.decode(response_tensor[0])
-            # Parse action and execute in environment
-            action = parse_action_from_response(response_text)
-            step_resp = requests.post(f"{ENV_URL}/step", json=action)
-            step_data = step_resp.json()
-            reward = torch.tensor(step_data["reward"], dtype=torch.float32)
-            obs = step_data["observation"]
-            done = step_data["done"]
-            # Store experience
-            queries.append(query_tensor.squeeze())
-            responses.append(response_tensor.squeeze())
-            rewards.append(reward)
-            print(f"Step {step+1}: Action: {action['action_type']}, Reward: {reward.item():.2f}")
-            if done:
-                break
-        # --- Perform PPO Step ---
-        # This is a simplified view. The actual step requires careful handling of tensors.
-        # The `queries`, `responses`, `rewards` lists need to be formatted correctly.
-        try:
-            stats = gppo_trainer.step(queries, responses, rewards)
-            gppo_trainer.log_stats(stats, {}, rewards)
-            print(f"  PPO Step done. Mean reward: {stats['ppo/returns/mean']:.2f}")
-        except Exception as e:
-            print(f"ERROR during trainer.step: {e}")
-            print("  Skipping PPO step for this episode. This might happen if all trajectories are truncated.")
 if __name__ == "__main__":
-    # Ensure the server is running before starting training.
-    # We will run the server in the background from the CLI.
-    main()

+"""OpenSleuth GRPO trainer.
+Trains a small Qwen2.5 model with TRL's GRPOTrainer to do in-context program
+synthesis — given the public signature of a hidden function plus a handful of
+(input, output) probe examples, emit a Python function that reproduces it.
+Reward comes from the live OpenSleuth env Space: the agent's code is executed
+against the hidden reference under domain-aware fuzzing, and the verifier
+returns an `execution_reward - complexity_penalty` score that we hand back to
+GRPO as the per-completion reward (plus a tiny formatting shaping reward).
+"""
+from __future__ import annotations
+import argparse
+import logging
+import os
+import sys
+import time
 import torch
+from peft import LoraConfig
+from transformers import AutoTokenizer, BitsAndBytesConfig
+from trl import GRPOConfig, GRPOTrainer
+from opensleuth_train import (
+    EnvClient,
+    SYSTEM_PROMPT,
+    build_synthesis_dataset,
+)
+from opensleuth_train.reward import format_reward, make_env_reward
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+    stream=sys.stdout,
+)
+log = logging.getLogger("opensleuth.train")
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser()
+    p.add_argument("--env-url", default=os.environ.get("ENV_URL", "https://anugrah55-opensleuth-env-gemini-cli.hf.space"))
+    p.add_argument("--model-name", default=os.environ.get("MODEL_NAME", "Qwen/Qwen2.5-0.5B-Instruct"))
+    p.add_argument("--output-dir", default=os.environ.get("OUTPUT_DIR", "/data/opensleuth-grpo"))
+    p.add_argument("--push-to-hub", default=os.environ.get("PUSH_TO_HUB", "anugrah55/opensleuth-qwen2.5-0.5b-grpo"))
+    p.add_argument("--n-per-function", type=int, default=int(os.environ.get("N_PER_FUNCTION", "16")))
+    p.add_argument("--n-probes", type=int, default=int(os.environ.get("N_PROBES", "6")))
+    p.add_argument("--num-generations", type=int, default=int(os.environ.get("NUM_GENERATIONS", "4")))
+    p.add_argument("--max-completion-length", type=int, default=int(os.environ.get("MAX_COMPLETION_LENGTH", "320")))
+    p.add_argument("--max-prompt-length", type=int, default=int(os.environ.get("MAX_PROMPT_LENGTH", "768")))
+    p.add_argument("--learning-rate", type=float, default=float(os.environ.get("LEARNING_RATE", "1e-5")))
+    p.add_argument("--num-train-epochs", type=float, default=float(os.environ.get("NUM_TRAIN_EPOCHS", "1")))
+    p.add_argument("--per-device-batch-size", type=int, default=int(os.environ.get("PER_DEVICE_BATCH_SIZE", "1")))
+    p.add_argument("--gradient-accumulation-steps", type=int, default=int(os.environ.get("GRAD_ACCUM", "8")))
+    p.add_argument("--no-4bit", action="store_true", default=os.environ.get("NO_4BIT", "0") == "1")
+    p.add_argument("--seed", type=int, default=int(os.environ.get("SEED", "42")))
+    return p.parse_args()
+def wait_for_env(client: EnvClient, max_wait_s: float = 300.0) -> None:
+    log.info("waiting for env at %s ...", client.base_url)
+    start = time.time()
+    last_err = ""
+    while time.time() - start < max_wait_s:
+        try:
+            h = client.health()
+            log.info("env healthy: %s", h)
+            return
+        except Exception as e:  # noqa: BLE001
+            last_err = str(e)
+            time.sleep(5)
+    raise RuntimeError(f"env never became healthy after {max_wait_s}s. Last error: {last_err}")
+def main() -> int:
+    args = parse_args()
+    log.info("args: %s", vars(args))
+    client = EnvClient(base_url=args.env_url, timeout=60.0, retries=4)
+    wait_for_env(client)
+    fns = client.list_functions()
+    log.info("env exposes %d functions: %s", len(fns), [f["name"] for f in fns])
+    log.info("building synthesis dataset (n_per_function=%d, n_probes=%d)", args.n_per_function, args.n_probes)
+    dataset = build_synthesis_dataset(
+        client, n_per_function=args.n_per_function, n_probes=args.n_probes, seed=args.seed
     )
+    log.info("dataset size: %d rows", len(dataset))
+    # GRPO with chat-templated prompts: each row needs a "prompt" field, which
+    # we re-format as a chat message list so the trainer applies the chat
+    # template under the hood.
+    def to_chat(row):
+        return {
+            "prompt": [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": row["prompt"]},
+            ],
+            "target_function_name": row["target_function_name"],
+            "row_seed": row["row_seed"],
+        }
+    dataset = dataset.map(to_chat, remove_columns=["prompt"])
+    # ---- Model + LoRA ----
+    log.info("loading model %s (4bit=%s)", args.model_name, not args.no_4bit)
+    bnb_config = None
+    if not args.no_4bit:
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+        )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    peft_config = LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+        task_type="CAUSAL_LM",
+        bias="none",
     )
+    grpo_config = GRPOConfig(
+        output_dir=args.output_dir,
+        per_device_train_batch_size=args.per_device_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        learning_rate=args.learning_rate,
+        num_train_epochs=args.num_train_epochs,
+        max_prompt_length=args.max_prompt_length,
+        max_completion_length=args.max_completion_length,
+        num_generations=args.num_generations,
+        beta=0.04,
+        bf16=torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False,
+        fp16=False,
+        logging_steps=1,
+        save_steps=50,
+        save_total_limit=2,
+        report_to=[],
+        seed=args.seed,
+        push_to_hub=bool(args.push_to_hub) and bool(os.environ.get("HF_TOKEN")),
+        hub_model_id=args.push_to_hub or None,
+        hub_strategy="end",
+        gradient_checkpointing=True,
     )
+    env_reward_fn = make_env_reward(client)
+    env_reward_fn.__name__ = "env_verifier_reward"
+    format_reward.__name__ = "format_reward"
+    log.info("instantiating GRPOTrainer")
+    # Newer TRL passes the model name and instantiates internally; this works
+    # across recent TRL versions because GRPOTrainer accepts a model id string.
+    trainer_kwargs = dict(
+        model=args.model_name,
+        reward_funcs=[env_reward_fn, format_reward],
+        args=grpo_config,
+        train_dataset=dataset,
+        peft_config=peft_config,
     )
+    if bnb_config is not None:
+        # Some TRL versions accept model_init_kwargs to pass through to from_pretrained.
+        trainer_kwargs.setdefault("model_init_kwargs", {})
+        trainer_kwargs["model_init_kwargs"].update(
+            {"quantization_config": bnb_config, "torch_dtype": torch.bfloat16}
+        )
+    try:
+        trainer = GRPOTrainer(**trainer_kwargs)
+    except TypeError as e:
+        # Older TRL (<0.16) doesn't accept model_init_kwargs at GRPOTrainer level;
+        # fall back to loading model first.
+        log.warning("GRPOTrainer rejected kwargs (%s); falling back to manual model load", e)
+        from transformers import AutoModelForCausalLM
+        model_kwargs = {"trust_remote_code": True, "torch_dtype": torch.bfloat16}
+        if bnb_config is not None:
+            model_kwargs["quantization_config"] = bnb_config
+        model = AutoModelForCausalLM.from_pretrained(args.model_name, **model_kwargs)
+        trainer = GRPOTrainer(
+            model=model,
+            reward_funcs=[env_reward_fn, format_reward],
+            args=grpo_config,
+            train_dataset=dataset,
+            peft_config=peft_config,
+            processing_class=tokenizer,
+        )
+    log.info("starting GRPO training")
+    trainer.train()
+    log.info("training complete; saving to %s", args.output_dir)
+    trainer.save_model(args.output_dir)
+    if grpo_config.push_to_hub:
+        log.info("pushing to hub: %s", args.push_to_hub)
+        trainer.push_to_hub()
+    return 0
 if __name__ == "__main__":
+    sys.exit(main())