--- title: OpenSleuth Trainer emoji: 🛰️ colorFrom: red colorTo: yellow sdk: docker app_port: 7860 pinned: false suggested_hardware: t4-small --- # OpenSleuth — Trainer GPU Space that fine-tunes a small Qwen2.5 model with TRL **GRPO** to do in-context program synthesis against the live OpenSleuth env. ## Pipeline 1. Wait for the env Space to report healthy. 2. Build a dataset of synthesis prompts: each row pairs one black-box function with N pre-sampled `(input, output)` probes drawn from the env. 3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and `peft`. 4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate completions per prompt and rewarding each against the env's verifier. 5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`. ## Reward * `env_verifier_reward = env.score_submission(...) / 100` — the headline shaped reward, ranging roughly `[-0.5, +1.5]`. * `format_reward` — small bonus for emitting a fenced ```python``` block whose `def` matches the target function name; helps the model converge on parseable output early. ## Hardware `t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train faster if available. ## Required Space secrets * `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at the end of training. ## Tuning knobs All knobs are exposed as env vars (defaults shown): | Env var | Default | Meaning | |---------|---------|---------| | `ENV_URL` | env Space URL | OpenSleuth env to target | | `MODEL_NAME` | `Qwen/Qwen2.5-0.5B-Instruct` | Base policy | | `N_PER_FUNCTION` | `16` | Prompts per black-box function | | `N_PROBES` | `6` | Probes per prompt | | `NUM_GENERATIONS` | `4` | GRPO group size | | `LEARNING_RATE` | `1e-5` | | | `NUM_TRAIN_EPOCHS` | `1` | | | `PER_DEVICE_BATCH_SIZE` | `1` | | | `GRAD_ACCUM` | `8` | |