Spaces:
Sleeping
Sleeping
Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit ·
6206e8a
1
Parent(s): dec12b4
Add training/TRAINING.md — end-to-end reproduction recipe
Browse filesSelf-contained guide to reproducing every artifact in the project:
hero training, multi-step training (with the warm-start gotcha),
base+trained eval, demo CSV build, plots, Trackio replay. Includes a
copy-pasteable end-to-end checklist + common pitfalls we hit
(whoami rate limit, adapter-pushed-to-subfolder breaks PEFT, OOM on
multi-step without gradient checkpointing).
README links the new doc from the Links section and the training-
pipeline file table. .gitignore: drop a stray local eval dump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- .gitignore +1 -0
- README.md +8 -2
- training/TRAINING.md +234 -0
.gitignore
CHANGED
|
@@ -18,3 +18,4 @@ checkpoints/
|
|
| 18 |
wandb/
|
| 19 |
runs/
|
| 20 |
.ipynb_checkpoints/
|
|
|
|
|
|
| 18 |
wandb/
|
| 19 |
runs/
|
| 20 |
.ipynb_checkpoints/
|
| 21 |
+
before_after_prompts.csv
|
README.md
CHANGED
|
@@ -29,6 +29,7 @@ tags:
|
|
| 29 |
- 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
|
| 30 |
- 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
|
| 31 |
- 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
|
|
|
|
| 32 |
- 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
|
| 33 |
|
| 34 |
### Trained adapters & data
|
|
@@ -42,14 +43,19 @@ tags:
|
|
| 42 |
|
| 43 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 44 |
|
|
|
|
|
|
|
| 45 |
| File | Role |
|
| 46 |
|---|---|
|
| 47 |
-
| [`training/
|
|
|
|
|
|
|
| 48 |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
|
| 49 |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
|
| 50 |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
|
|
|
|
| 51 |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
|
| 52 |
-
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
|
| 53 |
|
| 54 |
---
|
| 55 |
|
|
|
|
| 29 |
- 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
|
| 30 |
- 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
|
| 31 |
- 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
|
| 32 |
+
- 📖 **How to train end-to-end:** [`training/TRAINING.md`](./training/TRAINING.md) — step-by-step recipe to reproduce hero + multi-step + evals + plots
|
| 33 |
- 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
|
| 34 |
|
| 35 |
### Trained adapters & data
|
|
|
|
| 43 |
|
| 44 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 45 |
|
| 46 |
+
> **📖 Reproduction recipe: [`training/TRAINING.md`](./training/TRAINING.md)** — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.
|
| 47 |
+
|
| 48 |
| File | Role |
|
| 49 |
|---|---|
|
| 50 |
+
| [`training/TRAINING.md`](./training/TRAINING.md) | Step-by-step reproduction recipe (start here) |
|
| 51 |
+
| [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer (the hero recipe) |
|
| 52 |
+
| [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | Trajectory-level GRPO for 3-turn episodes |
|
| 53 |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
|
| 54 |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
|
| 55 |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
|
| 56 |
+
| [`training/make_plots.py`](./training/make_plots.py) | reward / length / breakdown curves from `train_metrics.jsonl` |
|
| 57 |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
|
| 58 |
+
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
|
| 59 |
|
| 60 |
---
|
| 61 |
|
training/TRAINING.md
ADDED
|
@@ -0,0 +1,234 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training & evaluation pipeline
|
| 2 |
+
|
| 3 |
+
End-to-end recipe for reproducing the Prompt Golf adapters and demo CSVs from scratch — single-step hero, multi-step variant, control runs, plots, and the Trackio dashboard. Every step runs on HuggingFace Jobs (single L40S, 48 GB) so you don't need a local GPU.
|
| 4 |
+
|
| 5 |
+
> **TL;DR:** profile the target → train → eval base + trained → build demo CSV → render plots → replay metrics to Trackio. Each step is a separate script under `training/` with a `hf_job_*.sh` wrapper.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 0. Prerequisites
|
| 10 |
+
|
| 11 |
+
- **HuggingFace account + token** with write access to a destination namespace (yours, not `rishabh16196/...`). Login locally:
|
| 12 |
+
```bash
|
| 13 |
+
hf auth login
|
| 14 |
+
```
|
| 15 |
+
- **Python 3.10+** with the OpenEnv repo installed (`pip install -e .` from the repo root).
|
| 16 |
+
- **Llama-3.2-3B-Instruct access**. The model is gated — request access at <https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct> and confirm `hf auth whoami` shows your account before you dispatch jobs.
|
| 17 |
+
- **HF Jobs quota.** A full hero training run is ≈3h on L40S; multi-step is ≈3.5h. Eval pairs are ≈15 min each.
|
| 18 |
+
|
| 19 |
+
All `hf_job_*.sh` launchers honor these env vars; export them once at the top of your shell session and you're done:
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
export PUSH_TO_HUB=your-username/your-adapter-repo # destination for adapter + plots + metrics + evals
|
| 23 |
+
export TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct # frozen target the agent learns to whisper to
|
| 24 |
+
export AGENT_MODEL=Qwen/Qwen3-1.7B # trainable agent base
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 1. (Optional but recommended) Profile target capability per task
|
| 30 |
+
|
| 31 |
+
Before burning GPU hours, check which tasks the target can actually do — tasks where the verbose prompt also fails contribute zero gradient and dilute the budget.
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
bash training/hf_job_profile.sh
|
| 35 |
+
# Pushes profile JSONL to ${PUSH_TO_HUB}/profiles/baseline_<target_slug>.jsonl
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
The output is a per-task `description_baseline` (verbose-prompt accuracy on that target). Any task at 0.0 means the target genuinely can't do it — drop those from your active task list before training.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 2. Single-step training (the "hero" recipe)
|
| 43 |
+
|
| 44 |
+
`training/train_grpo.py` — vanilla TRL GRPO, single-step rollouts, LoRA r=16/α=32 on Qwen3-1.7B.
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
PUSH_TO_HUB=your-username/your-hero-adapter \
|
| 48 |
+
TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
| 49 |
+
ENABLE_THINKING=false \
|
| 50 |
+
bash training/hf_job_train.sh
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
What this does:
|
| 54 |
+
- 500 GRPO steps × `num_generations=8` × 6 hidden test inputs per task → ≈24,000 target inferences.
|
| 55 |
+
- Reward per rollout: `raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped `[−0.5, 1.3]`.
|
| 56 |
+
- Anti-collapse: `MIN_TOKENS_FLOOR=5` (without it the agent finds 1-token tokenizer attacks).
|
| 57 |
+
- Pushes to the destination repo: `adapter_final/`, `train_metrics.jsonl`, `plots/{reward_curve,length_curve,breakdown}.png`.
|
| 58 |
+
- Wallclock: ≈3h on L40S.
|
| 59 |
+
|
| 60 |
+
Key flags (override via env vars in the launcher script):
|
| 61 |
+
- `--max-completion-length 256` — agent's per-turn output budget. Bump to 768+ if you enable thinking (see below).
|
| 62 |
+
- `--enable-thinking` / `--no-enable-thinking` — Qwen3's `<think>...</think>` chat template. We ship OFF; ON loses on reward and adds 30% tokens.
|
| 63 |
+
- `--num-generations 8` — GRPO group size. Smaller = faster, less stable.
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## 3. Multi-step training (the 3-turn variant)
|
| 68 |
+
|
| 69 |
+
`training/train_grpo_multistep.py` — hand-rolled trajectory-level GRPO. TRL's GRPO can't do multi-turn credit assignment cleanly (it expects one prompt → one scalar reward), so this is a separate trainer.
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
WARMSTART_ADAPTER=your-username/your-hero-adapter \
|
| 73 |
+
PUSH_TO_HUB=your-username/your-multistep-adapter \
|
| 74 |
+
TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
| 75 |
+
bash training/hf_job_train_multistep.sh
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
What's different:
|
| 79 |
+
- **Warm-start from the hero adapter** (non-optional in practice — cold starts burn 1000+ steps rediscovering single-turn behavior).
|
| 80 |
+
- 150 steps × 8 trajectories × 3 turns = 24 turns/step.
|
| 81 |
+
- Each task's 6 hidden test inputs split into a 2-example *feedback slice* (revealed across turns 1+2 with target outputs) and a 4-example *scoring slice* (final-turn prompt judged on these only).
|
| 82 |
+
- Reward = final-turn additive rubric. REINFORCE-style policy gradient over full-trajectory action tokens, KL penalty against a snapshot of the warm-start LoRA.
|
| 83 |
+
- Memory config (required for L40S): `--gradient-checkpointing` ON, `--update-micro-batch 2`, `--max-prompt-tokens 2048`, `--max-new-tokens 384`.
|
| 84 |
+
- Wallclock: ≈3.5h.
|
| 85 |
+
|
| 86 |
+
After training, the adapter ships to `${PUSH_TO_HUB}/adapter_final/`. **For the eval step to find it, you must promote it to the repo root** (PEFT looks for `adapter_config.json` at root):
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
mkdir -p /tmp/adapter && \
|
| 90 |
+
hf download $PUSH_TO_HUB --include "adapter_final/*" --local-dir /tmp/adapter
|
| 91 |
+
cd /tmp/adapter/adapter_final
|
| 92 |
+
hf upload $PUSH_TO_HUB . . --commit-message "Promote adapter to repo root"
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
We hit this in the multi-step run; future versions of `train_grpo_multistep.py` should push to root directly.
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## 4. Eval — base vs trained, on the same target
|
| 100 |
+
|
| 101 |
+
`training/eval_before_after.py` — runs the agent on every task with no adapter (`base`), then again with the trained adapter (`trained`). Output is one JSONL per label, one row per (task × seed). Both pushed to the adapter repo's `evals/` folder.
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
ADAPTER_REPO=your-username/your-hero-adapter \
|
| 105 |
+
TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
| 106 |
+
ENABLE_THINKING=false \
|
| 107 |
+
bash training/hf_job_eval.sh both # dispatches base + trained as two jobs
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
Modes:
|
| 111 |
+
- `base` — eval untrained Qwen3 only (cheaper, run once per target).
|
| 112 |
+
- `trained` — eval the adapter only (when you've added a new adapter and base eval already exists).
|
| 113 |
+
- `both` — convenience for a fresh adapter.
|
| 114 |
+
|
| 115 |
+
Each job is ≈15 min on L40S. Output rows include `task_id`, `category`, `agent_prompt`, `tokens`, `raw_task_score`, `reward`, `leakage_penalty`, `baseline_zero_shot`. Aggregate however you like — see `build_before_after_csv.py` for the canonical roll-up.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## 5. Build the demo CSV
|
| 120 |
+
|
| 121 |
+
`training/build_before_after_csv.py` — joins the verbose human prompts (from the env's task definitions) with the eval JSONLs to produce a 90-row demo CSV: `verbose / base / trained` prompts side by side, plus accuracy/token/reward deltas, plus 3 sample test inputs per task for the Gradio demo.
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
python training/build_before_after_csv.py \
|
| 125 |
+
--base-jsonl evals/eval_base.jsonl \
|
| 126 |
+
--trained-jsonl evals/eval_trained.jsonl \
|
| 127 |
+
--out evals/qwen_to_llama_demo.csv \
|
| 128 |
+
--min-verbose-accuracy 0.0 # set to >0 to drop dead-target tasks
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
Then upload the CSV to the adapter repo so the demo Space can fetch it:
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
hf upload $ADAPTER_REPO evals/qwen_to_llama_demo.csv evals/qwen_to_llama_demo.csv
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## 6. Plots (already auto-generated by the trainer)
|
| 140 |
+
|
| 141 |
+
`training/make_plots.py` is invoked by `hf_job_train.sh` at the end of training. If you want to re-render after the fact:
|
| 142 |
+
|
| 143 |
+
```bash
|
| 144 |
+
python training/make_plots.py \
|
| 145 |
+
--metrics-jsonl train_metrics.jsonl \
|
| 146 |
+
--out-dir plots/
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
Produces `reward_curve.png`, `length_curve.png`, `breakdown.png`. Same plots are embedded in the BLOG_POST and Trackio dashboard.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## 7. Replay training metrics to Trackio (optional dashboard)
|
| 154 |
+
|
| 155 |
+
`training/replay_to_trackio.py` — post-hoc replay of `train_metrics.jsonl` files into the [Trackio](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio) dashboard Space. Useful for cross-run comparison (hero vs multi-step vs Qwen→Qwen control).
|
| 156 |
+
|
| 157 |
+
Edit the `RUNS` dict at the top of the script to point at your adapter repos, then:
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
python training/replay_to_trackio.py
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
The script logs each run as a separate Trackio experiment with full per-step metrics. The dashboard auto-reflows.
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## 8. End-to-end checklist
|
| 168 |
+
|
| 169 |
+
For a clean from-scratch reproduction of the hero release:
|
| 170 |
+
|
| 171 |
+
```bash
|
| 172 |
+
# 0) auth
|
| 173 |
+
hf auth login
|
| 174 |
+
|
| 175 |
+
# 1) (optional) profile
|
| 176 |
+
bash training/hf_job_profile.sh
|
| 177 |
+
|
| 178 |
+
# 2) train hero
|
| 179 |
+
PUSH_TO_HUB=your-name/prompt-golf-hero \
|
| 180 |
+
bash training/hf_job_train.sh # ≈3h
|
| 181 |
+
|
| 182 |
+
# 3) eval base + trained
|
| 183 |
+
ADAPTER_REPO=your-name/prompt-golf-hero \
|
| 184 |
+
bash training/hf_job_eval.sh both # 2 × ≈15min
|
| 185 |
+
|
| 186 |
+
# 4) build demo CSV (pulls eval JSONLs from the repo, joins verbose prompts)
|
| 187 |
+
python training/build_before_after_csv.py \
|
| 188 |
+
--base-jsonl <(hf download your-name/prompt-golf-hero evals/eval_base.jsonl --local-dir - 2>/dev/null) \
|
| 189 |
+
--trained-jsonl <(hf download your-name/prompt-golf-hero evals/eval_trained.jsonl --local-dir - 2>/dev/null) \
|
| 190 |
+
--out qwen_to_llama_demo.csv
|
| 191 |
+
|
| 192 |
+
# 5) (optional) Trackio replay
|
| 193 |
+
python training/replay_to_trackio.py
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
For the multi-step variant, swap step 2:
|
| 197 |
+
|
| 198 |
+
```bash
|
| 199 |
+
WARMSTART_ADAPTER=your-name/prompt-golf-hero \
|
| 200 |
+
PUSH_TO_HUB=your-name/prompt-golf-multistep \
|
| 201 |
+
bash training/hf_job_train_multistep.sh # ≈3.5h
|
| 202 |
+
# then promote adapter_final/ to repo root before eval (see §3)
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## File index
|
| 208 |
+
|
| 209 |
+
| Script | Role |
|
| 210 |
+
|---|---|
|
| 211 |
+
| [`train_grpo.py`](./train_grpo.py) | Single-step TRL GRPO trainer (the hero recipe). |
|
| 212 |
+
| [`train_grpo_multistep.py`](./train_grpo_multistep.py) | Trajectory-level GRPO for multi-turn episodes. |
|
| 213 |
+
| [`eval_before_after.py`](./eval_before_after.py) | Base + trained-adapter eval harness, one JSONL per label. |
|
| 214 |
+
| [`profile_baseline.py`](./profile_baseline.py) | Per-task target-capability profiler (verbose-prompt accuracy on a target). |
|
| 215 |
+
| [`build_before_after_csv.py`](./build_before_after_csv.py) | Merge eval JSONLs + verbose prompts into the demo CSV. |
|
| 216 |
+
| [`make_plots.py`](./make_plots.py) | Reward / length / breakdown curves from `train_metrics.jsonl`. |
|
| 217 |
+
| [`replay_to_trackio.py`](./replay_to_trackio.py) | Post-hoc Trackio dashboard replay across runs. |
|
| 218 |
+
| [`hf_job_train.sh`](./hf_job_train.sh) | HF Jobs launcher for single-step training. |
|
| 219 |
+
| [`hf_job_train_multistep.sh`](./hf_job_train_multistep.sh) | HF Jobs launcher for multi-step training. |
|
| 220 |
+
| [`hf_job_eval.sh`](./hf_job_eval.sh) | HF Jobs launcher for base + trained eval. |
|
| 221 |
+
| [`hf_job_profile.sh`](./hf_job_profile.sh) | HF Jobs launcher for the capability profiler. |
|
| 222 |
+
| [`requirements.txt`](./requirements.txt) | Pinned deps for the training jobs (matches OpenEnv-official torch/transformers/trl). |
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## Common pitfalls (we hit all of these)
|
| 227 |
+
|
| 228 |
+
- **HF `whoami` rate limit** blocks back-to-back job dispatches. If you hit it, wait 5–25 min before retrying. Don't poll-loop.
|
| 229 |
+
- **Adapter pushed to a subfolder** (e.g. `adapter_final/`) breaks `PeftModel.from_pretrained()` which expects `adapter_config.json` at the repo root. Promote files to root before running eval.
|
| 230 |
+
- **OOM on L40S during multi-step training** if you forget `--gradient-checkpointing`. Set the four memory flags in §3 from the start.
|
| 231 |
+
- **Stale `TASK_NAMES` constant in `inference.py`** dropping new tasks from training. Use the lazy `_all_task_ids()` helper instead.
|
| 232 |
+
- **Cold-starting multi-turn** wastes compute. Always warm-start from a single-turn hero adapter.
|
| 233 |
+
|
| 234 |
+
If something else breaks, check the [`BLOG_POST.md`](../BLOG_POST.md) "Notes on training (for the curious)" section and the failure-mode bullets in "What the agent learned." Most issues we ran into are documented there.
|