| --- |
| title: OpenSleuth Trainer |
| emoji: 🛰️ |
| colorFrom: red |
| colorTo: yellow |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| suggested_hardware: t4-small |
| --- |
| |
| # OpenSleuth — Trainer |
|
|
| GPU Space that fine-tunes a small Qwen2.5 model with TRL **GRPO** to do |
| in-context program synthesis against the live OpenSleuth env. |
|
|
| ## Pipeline |
|
|
| 1. Wait for the env Space to report healthy. |
| 2. Build a dataset of synthesis prompts: each row pairs one black-box |
| function with N pre-sampled `(input, output)` probes drawn from the env. |
| 3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and |
| `peft`. |
| 4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate |
| completions per prompt and rewarding each against the env's verifier. |
| 5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is |
| set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`. |
|
|
| ## Reward |
|
|
| * `env_verifier_reward = env.score_submission(...) / 100` — the headline |
| shaped reward, ranging roughly `[-0.5, +1.5]`. |
| * `format_reward` — small bonus for emitting a fenced ```python``` block |
| whose `def` matches the target function name; helps the model converge on |
| parseable output early. |
|
|
| ## Hardware |
|
|
| `t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train |
| faster if available. |
|
|
| ## Required Space secrets |
|
|
| * `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at |
| the end of training. |
|
|
| ## Tuning knobs |
|
|
| All knobs are exposed as env vars (defaults shown): |
|
|
| | Env var | Default | Meaning | |
| |---------|---------|---------| |
| | `ENV_URL` | env Space URL | OpenSleuth env to target | |
| | `MODEL_NAME` | `Qwen/Qwen2.5-0.5B-Instruct` | Base policy | |
| | `N_PER_FUNCTION` | `16` | Prompts per black-box function | |
| | `N_PROBES` | `6` | Probes per prompt | |
| | `NUM_GENERATIONS` | `4` | GRPO group size | |
| | `LEARNING_RATE` | `1e-5` | | |
| | `NUM_TRAIN_EPOCHS` | `1` | | |
| | `PER_DEVICE_BATCH_SIZE` | `1` | | |
| | `GRAD_ACCUM` | `8` | | |
|
|