File size: 1,983 Bytes
af50ed7 d597642 af50ed7 d597642 af50ed7 d597642 af50ed7 d597642 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | ---
title: OpenSleuth Trainer
emoji: 🛰️
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: t4-small
---
# OpenSleuth — Trainer
GPU Space that fine-tunes a small Qwen2.5 model with TRL **GRPO** to do
in-context program synthesis against the live OpenSleuth env.
## Pipeline
1. Wait for the env Space to report healthy.
2. Build a dataset of synthesis prompts: each row pairs one black-box
function with N pre-sampled `(input, output)` probes drawn from the env.
3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and
`peft`.
4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate
completions per prompt and rewarding each against the env's verifier.
5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is
set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`.
## Reward
* `env_verifier_reward = env.score_submission(...) / 100` — the headline
shaped reward, ranging roughly `[-0.5, +1.5]`.
* `format_reward` — small bonus for emitting a fenced ```python``` block
whose `def` matches the target function name; helps the model converge on
parseable output early.
## Hardware
`t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train
faster if available.
## Required Space secrets
* `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at
the end of training.
## Tuning knobs
All knobs are exposed as env vars (defaults shown):
| Env var | Default | Meaning |
|---------|---------|---------|
| `ENV_URL` | env Space URL | OpenSleuth env to target |
| `MODEL_NAME` | `Qwen/Qwen2.5-0.5B-Instruct` | Base policy |
| `N_PER_FUNCTION` | `16` | Prompts per black-box function |
| `N_PROBES` | `6` | Probes per prompt |
| `NUM_GENERATIONS` | `4` | GRPO group size |
| `LEARNING_RATE` | `1e-5` | |
| `NUM_TRAIN_EPOCHS` | `1` | |
| `PER_DEVICE_BATCH_SIZE` | `1` | |
| `GRAD_ACCUM` | `8` | |
|