metadata
title: OpenSleuth Trainer
emoji: 🛰️
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: t4-small
OpenSleuth — Trainer
GPU Space that fine-tunes a small Qwen2.5 model with TRL GRPO to do in-context program synthesis against the live OpenSleuth env.
Pipeline
- Wait for the env Space to report healthy.
- Build a dataset of synthesis prompts: each row pairs one black-box
function with N pre-sampled
(input, output)probes drawn from the env. - Load
Qwen/Qwen2.5-0.5B-Instructin 4-bit + LoRA viabitsandbytesandpeft. - Train with
trl.GRPOTrainer, generatingnum_generations=4candidate completions per prompt and rewarding each against the env's verifier. - Persist the LoRA adapter to
/data/opensleuth-grpoand (ifHF_TOKENis set as a Space secret) push toanugrah55/opensleuth-qwen2.5-0.5b-grpo.
Reward
env_verifier_reward = env.score_submission(...) / 100— the headline shaped reward, ranging roughly[-0.5, +1.5].format_reward— small bonus for emitting a fencedpythonblock whosedefmatches the target function name; helps the model converge on parseable output early.
Hardware
t4-small is sufficient for 0.5B + LoRA + bnb-4bit. a10g-small will train
faster if available.
Required Space secrets
HF_TOKEN— write token if you want the LoRA adapter pushed to the Hub at the end of training.
Tuning knobs
All knobs are exposed as env vars (defaults shown):
| Env var | Default | Meaning |
|---|---|---|
ENV_URL |
env Space URL | OpenSleuth env to target |
MODEL_NAME |
Qwen/Qwen2.5-0.5B-Instruct |
Base policy |
N_PER_FUNCTION |
16 |
Prompts per black-box function |
N_PROBES |
6 |
Probes per prompt |
NUM_GENERATIONS |
4 |
GRPO group size |
LEARNING_RATE |
1e-5 |
|
NUM_TRAIN_EPOCHS |
1 |
|
PER_DEVICE_BATCH_SIZE |
1 |
|
GRAD_ACCUM |
8 |