File size: 1,983 Bytes
af50ed7
d597642
 
 
 
af50ed7
d597642
af50ed7
d597642
af50ed7
 
d597642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: OpenSleuth Trainer
emoji: 🛰️
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: t4-small
---

# OpenSleuth — Trainer

GPU Space that fine-tunes a small Qwen2.5 model with TRL **GRPO** to do
in-context program synthesis against the live OpenSleuth env.

## Pipeline

1. Wait for the env Space to report healthy.
2. Build a dataset of synthesis prompts: each row pairs one black-box
   function with N pre-sampled `(input, output)` probes drawn from the env.
3. Load `Qwen/Qwen2.5-0.5B-Instruct` in 4-bit + LoRA via `bitsandbytes` and
   `peft`.
4. Train with `trl.GRPOTrainer`, generating `num_generations=4` candidate
   completions per prompt and rewarding each against the env's verifier.
5. Persist the LoRA adapter to `/data/opensleuth-grpo` and (if `HF_TOKEN` is
   set as a Space secret) push to `anugrah55/opensleuth-qwen2.5-0.5b-grpo`.

## Reward

* `env_verifier_reward = env.score_submission(...) / 100` — the headline
  shaped reward, ranging roughly `[-0.5, +1.5]`.
* `format_reward` — small bonus for emitting a fenced ```python``` block
  whose `def` matches the target function name; helps the model converge on
  parseable output early.

## Hardware

`t4-small` is sufficient for 0.5B + LoRA + bnb-4bit. `a10g-small` will train
faster if available.

## Required Space secrets

* `HF_TOKEN` — write token if you want the LoRA adapter pushed to the Hub at
  the end of training.

## Tuning knobs

All knobs are exposed as env vars (defaults shown):

| Env var | Default | Meaning |
|---------|---------|---------|
| `ENV_URL` | env Space URL | OpenSleuth env to target |
| `MODEL_NAME` | `Qwen/Qwen2.5-0.5B-Instruct` | Base policy |
| `N_PER_FUNCTION` | `16` | Prompts per black-box function |
| `N_PROBES` | `6` | Probes per prompt |
| `NUM_GENERATIONS` | `4` | GRPO group size |
| `LEARNING_RATE` | `1e-5` | |
| `NUM_TRAIN_EPOCHS` | `1` | |
| `PER_DEVICE_BATCH_SIZE` | `1` | |
| `GRAD_ACCUM` | `8` | |