Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
6206e8a
·
1 Parent(s): dec12b4

Add training/TRAINING.md — end-to-end reproduction recipe

Browse files

Self-contained guide to reproducing every artifact in the project:
hero training, multi-step training (with the warm-start gotcha),
base+trained eval, demo CSV build, plots, Trackio replay. Includes a
copy-pasteable end-to-end checklist + common pitfalls we hit
(whoami rate limit, adapter-pushed-to-subfolder breaks PEFT, OOM on
multi-step without gradient checkpointing).

README links the new doc from the Links section and the training-
pipeline file table. .gitignore: drop a stray local eval dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. .gitignore +1 -0
  2. README.md +8 -2
  3. training/TRAINING.md +234 -0
.gitignore CHANGED
@@ -18,3 +18,4 @@ checkpoints/
18
  wandb/
19
  runs/
20
  .ipynb_checkpoints/
 
 
18
  wandb/
19
  runs/
20
  .ipynb_checkpoints/
21
+ before_after_prompts.csv
README.md CHANGED
@@ -29,6 +29,7 @@ tags:
29
  - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
30
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
31
  - 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
 
32
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
33
 
34
  ### Trained adapters & data
@@ -42,14 +43,19 @@ tags:
42
 
43
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
44
 
 
 
45
  | File | Role |
46
  |---|---|
47
- | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
 
 
48
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
49
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
50
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
 
51
  | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
52
- | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
53
 
54
  ---
55
 
 
29
  - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
30
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
31
  - 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
32
+ - 📖 **How to train end-to-end:** [`training/TRAINING.md`](./training/TRAINING.md) — step-by-step recipe to reproduce hero + multi-step + evals + plots
33
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
34
 
35
  ### Trained adapters & data
 
43
 
44
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
45
 
46
+ > **📖 Reproduction recipe: [`training/TRAINING.md`](./training/TRAINING.md)** — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.
47
+
48
  | File | Role |
49
  |---|---|
50
+ | [`training/TRAINING.md`](./training/TRAINING.md) | Step-by-step reproduction recipe (start here) |
51
+ | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer (the hero recipe) |
52
+ | [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | Trajectory-level GRPO for 3-turn episodes |
53
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
54
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
55
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
56
+ | [`training/make_plots.py`](./training/make_plots.py) | reward / length / breakdown curves from `train_metrics.jsonl` |
57
  | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
58
+ | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
59
 
60
  ---
61
 
training/TRAINING.md ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training & evaluation pipeline
2
+
3
+ End-to-end recipe for reproducing the Prompt Golf adapters and demo CSVs from scratch — single-step hero, multi-step variant, control runs, plots, and the Trackio dashboard. Every step runs on HuggingFace Jobs (single L40S, 48 GB) so you don't need a local GPU.
4
+
5
+ > **TL;DR:** profile the target → train → eval base + trained → build demo CSV → render plots → replay metrics to Trackio. Each step is a separate script under `training/` with a `hf_job_*.sh` wrapper.
6
+
7
+ ---
8
+
9
+ ## 0. Prerequisites
10
+
11
+ - **HuggingFace account + token** with write access to a destination namespace (yours, not `rishabh16196/...`). Login locally:
12
+ ```bash
13
+ hf auth login
14
+ ```
15
+ - **Python 3.10+** with the OpenEnv repo installed (`pip install -e .` from the repo root).
16
+ - **Llama-3.2-3B-Instruct access**. The model is gated — request access at <https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct> and confirm `hf auth whoami` shows your account before you dispatch jobs.
17
+ - **HF Jobs quota.** A full hero training run is ≈3h on L40S; multi-step is ≈3.5h. Eval pairs are ≈15 min each.
18
+
19
+ All `hf_job_*.sh` launchers honor these env vars; export them once at the top of your shell session and you're done:
20
+
21
+ ```bash
22
+ export PUSH_TO_HUB=your-username/your-adapter-repo # destination for adapter + plots + metrics + evals
23
+ export TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct # frozen target the agent learns to whisper to
24
+ export AGENT_MODEL=Qwen/Qwen3-1.7B # trainable agent base
25
+ ```
26
+
27
+ ---
28
+
29
+ ## 1. (Optional but recommended) Profile target capability per task
30
+
31
+ Before burning GPU hours, check which tasks the target can actually do — tasks where the verbose prompt also fails contribute zero gradient and dilute the budget.
32
+
33
+ ```bash
34
+ bash training/hf_job_profile.sh
35
+ # Pushes profile JSONL to ${PUSH_TO_HUB}/profiles/baseline_<target_slug>.jsonl
36
+ ```
37
+
38
+ The output is a per-task `description_baseline` (verbose-prompt accuracy on that target). Any task at 0.0 means the target genuinely can't do it — drop those from your active task list before training.
39
+
40
+ ---
41
+
42
+ ## 2. Single-step training (the "hero" recipe)
43
+
44
+ `training/train_grpo.py` — vanilla TRL GRPO, single-step rollouts, LoRA r=16/α=32 on Qwen3-1.7B.
45
+
46
+ ```bash
47
+ PUSH_TO_HUB=your-username/your-hero-adapter \
48
+ TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
49
+ ENABLE_THINKING=false \
50
+ bash training/hf_job_train.sh
51
+ ```
52
+
53
+ What this does:
54
+ - 500 GRPO steps × `num_generations=8` × 6 hidden test inputs per task → ≈24,000 target inferences.
55
+ - Reward per rollout: `raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped `[−0.5, 1.3]`.
56
+ - Anti-collapse: `MIN_TOKENS_FLOOR=5` (without it the agent finds 1-token tokenizer attacks).
57
+ - Pushes to the destination repo: `adapter_final/`, `train_metrics.jsonl`, `plots/{reward_curve,length_curve,breakdown}.png`.
58
+ - Wallclock: ≈3h on L40S.
59
+
60
+ Key flags (override via env vars in the launcher script):
61
+ - `--max-completion-length 256` — agent's per-turn output budget. Bump to 768+ if you enable thinking (see below).
62
+ - `--enable-thinking` / `--no-enable-thinking` — Qwen3's `<think>...</think>` chat template. We ship OFF; ON loses on reward and adds 30% tokens.
63
+ - `--num-generations 8` — GRPO group size. Smaller = faster, less stable.
64
+
65
+ ---
66
+
67
+ ## 3. Multi-step training (the 3-turn variant)
68
+
69
+ `training/train_grpo_multistep.py` — hand-rolled trajectory-level GRPO. TRL's GRPO can't do multi-turn credit assignment cleanly (it expects one prompt → one scalar reward), so this is a separate trainer.
70
+
71
+ ```bash
72
+ WARMSTART_ADAPTER=your-username/your-hero-adapter \
73
+ PUSH_TO_HUB=your-username/your-multistep-adapter \
74
+ TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
75
+ bash training/hf_job_train_multistep.sh
76
+ ```
77
+
78
+ What's different:
79
+ - **Warm-start from the hero adapter** (non-optional in practice — cold starts burn 1000+ steps rediscovering single-turn behavior).
80
+ - 150 steps × 8 trajectories × 3 turns = 24 turns/step.
81
+ - Each task's 6 hidden test inputs split into a 2-example *feedback slice* (revealed across turns 1+2 with target outputs) and a 4-example *scoring slice* (final-turn prompt judged on these only).
82
+ - Reward = final-turn additive rubric. REINFORCE-style policy gradient over full-trajectory action tokens, KL penalty against a snapshot of the warm-start LoRA.
83
+ - Memory config (required for L40S): `--gradient-checkpointing` ON, `--update-micro-batch 2`, `--max-prompt-tokens 2048`, `--max-new-tokens 384`.
84
+ - Wallclock: ≈3.5h.
85
+
86
+ After training, the adapter ships to `${PUSH_TO_HUB}/adapter_final/`. **For the eval step to find it, you must promote it to the repo root** (PEFT looks for `adapter_config.json` at root):
87
+
88
+ ```bash
89
+ mkdir -p /tmp/adapter && \
90
+ hf download $PUSH_TO_HUB --include "adapter_final/*" --local-dir /tmp/adapter
91
+ cd /tmp/adapter/adapter_final
92
+ hf upload $PUSH_TO_HUB . . --commit-message "Promote adapter to repo root"
93
+ ```
94
+
95
+ We hit this in the multi-step run; future versions of `train_grpo_multistep.py` should push to root directly.
96
+
97
+ ---
98
+
99
+ ## 4. Eval — base vs trained, on the same target
100
+
101
+ `training/eval_before_after.py` — runs the agent on every task with no adapter (`base`), then again with the trained adapter (`trained`). Output is one JSONL per label, one row per (task × seed). Both pushed to the adapter repo's `evals/` folder.
102
+
103
+ ```bash
104
+ ADAPTER_REPO=your-username/your-hero-adapter \
105
+ TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
106
+ ENABLE_THINKING=false \
107
+ bash training/hf_job_eval.sh both # dispatches base + trained as two jobs
108
+ ```
109
+
110
+ Modes:
111
+ - `base` — eval untrained Qwen3 only (cheaper, run once per target).
112
+ - `trained` — eval the adapter only (when you've added a new adapter and base eval already exists).
113
+ - `both` — convenience for a fresh adapter.
114
+
115
+ Each job is ≈15 min on L40S. Output rows include `task_id`, `category`, `agent_prompt`, `tokens`, `raw_task_score`, `reward`, `leakage_penalty`, `baseline_zero_shot`. Aggregate however you like — see `build_before_after_csv.py` for the canonical roll-up.
116
+
117
+ ---
118
+
119
+ ## 5. Build the demo CSV
120
+
121
+ `training/build_before_after_csv.py` — joins the verbose human prompts (from the env's task definitions) with the eval JSONLs to produce a 90-row demo CSV: `verbose / base / trained` prompts side by side, plus accuracy/token/reward deltas, plus 3 sample test inputs per task for the Gradio demo.
122
+
123
+ ```bash
124
+ python training/build_before_after_csv.py \
125
+ --base-jsonl evals/eval_base.jsonl \
126
+ --trained-jsonl evals/eval_trained.jsonl \
127
+ --out evals/qwen_to_llama_demo.csv \
128
+ --min-verbose-accuracy 0.0 # set to >0 to drop dead-target tasks
129
+ ```
130
+
131
+ Then upload the CSV to the adapter repo so the demo Space can fetch it:
132
+
133
+ ```bash
134
+ hf upload $ADAPTER_REPO evals/qwen_to_llama_demo.csv evals/qwen_to_llama_demo.csv
135
+ ```
136
+
137
+ ---
138
+
139
+ ## 6. Plots (already auto-generated by the trainer)
140
+
141
+ `training/make_plots.py` is invoked by `hf_job_train.sh` at the end of training. If you want to re-render after the fact:
142
+
143
+ ```bash
144
+ python training/make_plots.py \
145
+ --metrics-jsonl train_metrics.jsonl \
146
+ --out-dir plots/
147
+ ```
148
+
149
+ Produces `reward_curve.png`, `length_curve.png`, `breakdown.png`. Same plots are embedded in the BLOG_POST and Trackio dashboard.
150
+
151
+ ---
152
+
153
+ ## 7. Replay training metrics to Trackio (optional dashboard)
154
+
155
+ `training/replay_to_trackio.py` — post-hoc replay of `train_metrics.jsonl` files into the [Trackio](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio) dashboard Space. Useful for cross-run comparison (hero vs multi-step vs Qwen→Qwen control).
156
+
157
+ Edit the `RUNS` dict at the top of the script to point at your adapter repos, then:
158
+
159
+ ```bash
160
+ python training/replay_to_trackio.py
161
+ ```
162
+
163
+ The script logs each run as a separate Trackio experiment with full per-step metrics. The dashboard auto-reflows.
164
+
165
+ ---
166
+
167
+ ## 8. End-to-end checklist
168
+
169
+ For a clean from-scratch reproduction of the hero release:
170
+
171
+ ```bash
172
+ # 0) auth
173
+ hf auth login
174
+
175
+ # 1) (optional) profile
176
+ bash training/hf_job_profile.sh
177
+
178
+ # 2) train hero
179
+ PUSH_TO_HUB=your-name/prompt-golf-hero \
180
+ bash training/hf_job_train.sh # ≈3h
181
+
182
+ # 3) eval base + trained
183
+ ADAPTER_REPO=your-name/prompt-golf-hero \
184
+ bash training/hf_job_eval.sh both # 2 × ≈15min
185
+
186
+ # 4) build demo CSV (pulls eval JSONLs from the repo, joins verbose prompts)
187
+ python training/build_before_after_csv.py \
188
+ --base-jsonl <(hf download your-name/prompt-golf-hero evals/eval_base.jsonl --local-dir - 2>/dev/null) \
189
+ --trained-jsonl <(hf download your-name/prompt-golf-hero evals/eval_trained.jsonl --local-dir - 2>/dev/null) \
190
+ --out qwen_to_llama_demo.csv
191
+
192
+ # 5) (optional) Trackio replay
193
+ python training/replay_to_trackio.py
194
+ ```
195
+
196
+ For the multi-step variant, swap step 2:
197
+
198
+ ```bash
199
+ WARMSTART_ADAPTER=your-name/prompt-golf-hero \
200
+ PUSH_TO_HUB=your-name/prompt-golf-multistep \
201
+ bash training/hf_job_train_multistep.sh # ≈3.5h
202
+ # then promote adapter_final/ to repo root before eval (see §3)
203
+ ```
204
+
205
+ ---
206
+
207
+ ## File index
208
+
209
+ | Script | Role |
210
+ |---|---|
211
+ | [`train_grpo.py`](./train_grpo.py) | Single-step TRL GRPO trainer (the hero recipe). |
212
+ | [`train_grpo_multistep.py`](./train_grpo_multistep.py) | Trajectory-level GRPO for multi-turn episodes. |
213
+ | [`eval_before_after.py`](./eval_before_after.py) | Base + trained-adapter eval harness, one JSONL per label. |
214
+ | [`profile_baseline.py`](./profile_baseline.py) | Per-task target-capability profiler (verbose-prompt accuracy on a target). |
215
+ | [`build_before_after_csv.py`](./build_before_after_csv.py) | Merge eval JSONLs + verbose prompts into the demo CSV. |
216
+ | [`make_plots.py`](./make_plots.py) | Reward / length / breakdown curves from `train_metrics.jsonl`. |
217
+ | [`replay_to_trackio.py`](./replay_to_trackio.py) | Post-hoc Trackio dashboard replay across runs. |
218
+ | [`hf_job_train.sh`](./hf_job_train.sh) | HF Jobs launcher for single-step training. |
219
+ | [`hf_job_train_multistep.sh`](./hf_job_train_multistep.sh) | HF Jobs launcher for multi-step training. |
220
+ | [`hf_job_eval.sh`](./hf_job_eval.sh) | HF Jobs launcher for base + trained eval. |
221
+ | [`hf_job_profile.sh`](./hf_job_profile.sh) | HF Jobs launcher for the capability profiler. |
222
+ | [`requirements.txt`](./requirements.txt) | Pinned deps for the training jobs (matches OpenEnv-official torch/transformers/trl). |
223
+
224
+ ---
225
+
226
+ ## Common pitfalls (we hit all of these)
227
+
228
+ - **HF `whoami` rate limit** blocks back-to-back job dispatches. If you hit it, wait 5–25 min before retrying. Don't poll-loop.
229
+ - **Adapter pushed to a subfolder** (e.g. `adapter_final/`) breaks `PeftModel.from_pretrained()` which expects `adapter_config.json` at the repo root. Promote files to root before running eval.
230
+ - **OOM on L40S during multi-step training** if you forget `--gradient-checkpointing`. Set the four memory flags in §3 from the start.
231
+ - **Stale `TASK_NAMES` constant in `inference.py`** dropping new tasks from training. Use the lazy `_all_task_ids()` helper instead.
232
+ - **Cold-starting multi-turn** wastes compute. Always warm-start from a single-turn hero adapter.
233
+
234
+ If something else breaks, check the [`BLOG_POST.md`](../BLOG_POST.md) "Notes on training (for the curious)" section and the failure-mode bullets in "What the agent learned." Most issues we ran into are documented there.