Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
a56bede
·
1 Parent(s): 3724e90

remove untested Colab notebook + link training/ folder in README

Browse files

- Deleted notebooks/prompt_golf_train_minimal.ipynb — never ran
end-to-end on a real Colab GPU; risk of judges hitting confusing
errors > value of having the file in the repo
- Stripped notebook references from README (Links section, Files
tree) and BLOG_POST.md (TL;DR + Try-it-yourself section)
- Added a top-level "Training pipeline" link in the README Links
section pointing at github.com/.../tree/main/training (folder view)
- Made the existing "Training pipeline (training/)" sub-section
header itself a folder link

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README.md CHANGED
@@ -23,8 +23,8 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
23
  - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
24
  - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
25
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
 
26
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
27
- - 📓 **Colab training notebook:** [`notebooks/prompt_golf_train_minimal.ipynb`](./notebooks/prompt_golf_train_minimal.ipynb)
28
 
29
  ### Trained adapters & data
30
 
@@ -36,7 +36,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
36
  | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
37
  | [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
38
 
39
- ### Training pipeline (`training/`)
40
 
41
  | File | Role |
42
  |---|---|
@@ -166,7 +166,6 @@ prompt_golf_env/
166
  rubrics.py # additive reward composition
167
  tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py # 90-task bank
168
  training/ # see Links → Training pipeline
169
- notebooks/ # Colab smoke training
170
  ui/ + space-demo/ # Gradio demos
171
  BLOG_POST.md # writeup
172
  ```
 
23
  - 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
24
  - 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
25
  - 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
26
+ - 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
27
  - 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)
 
28
 
29
  ### Trained adapters & data
30
 
 
36
  | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
37
  | [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
38
 
39
+ ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
40
 
41
  | File | Role |
42
  |---|---|
 
166
  rubrics.py # additive reward composition
167
  tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py # 90-task bank
168
  training/ # see Links → Training pipeline
 
169
  ui/ + space-demo/ # Gradio demos
170
  BLOG_POST.md # writeup
171
  ```
notebooks/prompt_golf_train_minimal.ipynb DELETED
@@ -1,256 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "metadata": {},
6
- "source": [
7
- "# Prompt Golf — Minimal Training Demo\n",
8
- "\n",
9
- "Train a Qwen3-1.7B **agent** (LoRA) to write short prompts that steer a frozen **target** LLM.\n",
10
- "Cross-family RL on the OpenEnv Prompt Golf environment using TRL GRPO.\n",
11
- "\n",
12
- "**Hardware**\n",
13
- "- Recommended: L4 or A100 (Colab Pro+) — runs the headline `Qwen agent → Llama-3.2-3B target` config.\n",
14
- "- Free T4 (16 GB): downsize the target to `Qwen/Qwen2.5-0.5B-Instruct` so everything fits.\n",
15
- "\n",
16
- "This notebook runs a 30-step smoke training so you can verify the pipeline end-to-end on Colab in ~10 min.\n",
17
- "For the full 500-step training that produced the demo CSVs, use HuggingFace Jobs via `training/hf_job_train.sh`."
18
- ]
19
- },
20
- {
21
- "cell_type": "markdown",
22
- "metadata": {},
23
- "source": [
24
- "## 1. Install dependencies\n",
25
- "\n",
26
- "Mirrors the OpenEnv-official pin set used by HF Jobs (`pytorch/2.4.0-cuda12.4` base + uv upgrade to torch ≥ 2.8 + `trl==0.22.2`)."
27
- ]
28
- },
29
- {
30
- "cell_type": "code",
31
- "execution_count": null,
32
- "metadata": {},
33
- "outputs": [],
34
- "source": [
35
- "!pip install -q -U uv\n",
36
- "!uv pip install --system -q \\\n",
37
- " \"torch>=2.8.0\" \"torchvision>=0.25.0\" \"triton>=3.4.0\" bitsandbytes \\\n",
38
- " \"transformers==4.56.2\" \\\n",
39
- " \"unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo\" \\\n",
40
- " \"unsloth[base] @ git+https://github.com/unslothai/unsloth\"\n",
41
- "!uv pip install --system --upgrade --no-deps -q \\\n",
42
- " \"transformers==4.56.2\" tokenizers \"trl==0.22.2\" unsloth unsloth_zoo\n",
43
- "!pip install -q 'openenv-core[core]>=0.2.2' 'peft>=0.13.0' 'datasets>=3.0.0' 'accelerate>=0.34.0'"
44
- ]
45
- },
46
- {
47
- "cell_type": "markdown",
48
- "metadata": {},
49
- "source": [
50
- "## 2. Clone the env + install the package"
51
- ]
52
- },
53
- {
54
- "cell_type": "code",
55
- "execution_count": null,
56
- "metadata": {},
57
- "outputs": [],
58
- "source": [
59
- "!rm -rf /content/prompt_golf_env\n",
60
- "!git clone --depth 1 https://huggingface.co/spaces/rishabh16196/prompt_golf_env /content/prompt_golf_env\n",
61
- "%cd /content/prompt_golf_env\n",
62
- "!pip install -q --no-deps -e ."
63
- ]
64
- },
65
- {
66
- "cell_type": "markdown",
67
- "metadata": {},
68
- "source": [
69
- "## 3. Log in to HuggingFace\n",
70
- "\n",
71
- "Needed to download Qwen3-1.7B and Llama-3.2-3B-Instruct (the latter is gated — accept the license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct first)."
72
- ]
73
- },
74
- {
75
- "cell_type": "code",
76
- "execution_count": null,
77
- "metadata": {},
78
- "outputs": [],
79
- "source": [
80
- "from huggingface_hub import notebook_login\n",
81
- "notebook_login()"
82
- ]
83
- },
84
- {
85
- "cell_type": "markdown",
86
- "metadata": {},
87
- "source": [
88
- "## 4. Verify the env (mock backend, CPU-only — no model load)\n",
89
- "\n",
90
- "Quick sanity check that the env imports and resets correctly."
91
- ]
92
- },
93
- {
94
- "cell_type": "code",
95
- "execution_count": null,
96
- "metadata": {},
97
- "outputs": [],
98
- "source": [
99
- "import os\n",
100
- "os.environ['PROMPT_GOLF_TARGET_BACKEND'] = 'mock'\n",
101
- "os.environ['PROMPT_GOLF_JUDGE_BACKEND'] = 'mock'\n",
102
- "\n",
103
- "from prompt_golf_env.server.prompt_golf_environment import PromptGolfEnvironment, _ALL_TASKS\n",
104
- "from prompt_golf_env.models import GolfAction\n",
105
- "\n",
106
- "print(f'task bank: {len(_ALL_TASKS)} tasks (20 v1 + 15 v2 + 52 tough)')\n",
107
- "\n",
108
- "env = PromptGolfEnvironment()\n",
109
- "obs = env.reset(task='sentiment_basic', seed=0)\n",
110
- "print(f'\\ntask: {obs.task_id} | budget: {obs.prompt_budget_tokens} tokens')\n",
111
- "print(f'verbose description ({len(obs.task_description)} chars):')\n",
112
- "print(f' {obs.task_description[:140]}...')\n",
113
- "\n",
114
- "# Try a hand-written prompt\n",
115
- "result = env.step(GolfAction(prompt='Classify the sentiment as positive, negative, or neutral. Output the label only.'))\n",
116
- "print(f'\\nhand-written prompt: reward={result.reward:+.3f} raw={result.raw_task_score:.2f} '\n",
117
- " f'tokens={result.submitted_prompt_tokens} leak={result.leakage_penalty:.2f}')"
118
- ]
119
- },
120
- {
121
- "cell_type": "markdown",
122
- "metadata": {},
123
- "source": [
124
- "## 5. Mini training run (30 steps)\n",
125
- "\n",
126
- "Runs the full agent + target pipeline with a small number of steps to verify the loop works on your hardware. Defaults below are sized for **L4 (24 GB)**.\n",
127
- "\n",
128
- "**For free T4 (16 GB)**: change `--target-model` to `Qwen/Qwen2.5-0.5B-Instruct`, drop `--num-generations 4` to `2`, and skip the judge (set `PROMPT_GOLF_JUDGE_BACKEND=mock` in the cell below)."
129
- ]
130
- },
131
- {
132
- "cell_type": "code",
133
- "execution_count": null,
134
- "metadata": {},
135
- "outputs": [],
136
- "source": [
137
- "# Switch off mock backends — we want real model inference now.\n",
138
- "del os.environ['PROMPT_GOLF_TARGET_BACKEND']\n",
139
- "# Keep judge on mock for the smoke run unless you have an A100; the\n",
140
- "# 8B 8-bit judge alone takes ~8 GB on top of agent + target.\n",
141
- "os.environ['PROMPT_GOLF_JUDGE_BACKEND'] = 'mock'"
142
- ]
143
- },
144
- {
145
- "cell_type": "code",
146
- "execution_count": null,
147
- "metadata": {},
148
- "outputs": [],
149
- "source": "# `train_grpo.py` trains on ALL tasks in the bank by default. The\n# --held-out-tasks flag carves out a small eval split that the GRPO\n# trainer reports on each step. With max_steps=30 the loop sees a\n# tiny fraction of the bank — purpose here is to verify the pipeline\n# runs on your hardware, not to converge.\n!python -u training/train_grpo.py \\\n --agent-model Qwen/Qwen3-1.7B \\\n --target-model meta-llama/Llama-3.2-3B-Instruct \\\n --max-steps 30 \\\n --num-generations 4 \\\n --per-device-batch-size 2 \\\n --gradient-accumulation-steps 2 \\\n --seeds-per-task 2 \\\n --learning-rate 5e-6 \\\n --beta 0.04 \\\n --enable-thinking \\\n --max-completion-length 768 \\\n --output-dir /content/outputs/grpo_demo"
150
- },
151
- {
152
- "cell_type": "markdown",
153
- "metadata": {},
154
- "source": [
155
- "## 6. Inspect training metrics"
156
- ]
157
- },
158
- {
159
- "cell_type": "code",
160
- "execution_count": null,
161
- "metadata": {},
162
- "outputs": [],
163
- "source": [
164
- "import json\n",
165
- "import matplotlib.pyplot as plt\n",
166
- "\n",
167
- "metrics_path = '/content/outputs/grpo_demo/train_metrics.jsonl'\n",
168
- "rows = [json.loads(l) for l in open(metrics_path)]\n",
169
- "print(f'{len(rows)} steps logged')\n",
170
- "\n",
171
- "fig, axes = plt.subplots(1, 2, figsize=(11, 3.5))\n",
172
- "steps = [r['step'] for r in rows]\n",
173
- "axes[0].plot(steps, [r['reward'] for r in rows], color='#1f77b4')\n",
174
- "axes[0].axhline(0, color='gray', lw=0.5)\n",
175
- "axes[0].set_title('reward per step'); axes[0].set_xlabel('step'); axes[0].grid(alpha=0.3)\n",
176
- "axes[1].plot(steps, [r.get('avg_tokens', 0) for r in rows], color='#ff7f0e')\n",
177
- "axes[1].set_title('avg prompt tokens per step'); axes[1].set_xlabel('step'); axes[1].grid(alpha=0.3)\n",
178
- "plt.tight_layout(); plt.show()"
179
- ]
180
- },
181
- {
182
- "cell_type": "markdown",
183
- "metadata": {},
184
- "source": [
185
- "## 7. Eval the trained adapter on a few tasks\n",
186
- "\n",
187
- "Loads the LoRA adapter you just trained and prints what it now writes for each task vs the verbose hand-written description."
188
- ]
189
- },
190
- {
191
- "cell_type": "code",
192
- "execution_count": null,
193
- "metadata": {},
194
- "outputs": [],
195
- "source": [
196
- "!python -u training/eval_before_after.py \\\n",
197
- " --agent-model Qwen/Qwen3-1.7B \\\n",
198
- " --adapter /content/outputs/grpo_demo/adapter_final \\\n",
199
- " --target-model meta-llama/Llama-3.2-3B-Instruct \\\n",
200
- " --label trained \\\n",
201
- " --tasks tough_fallacy_classify,sentiment_basic,ner_people,format_uppercase \\\n",
202
- " --output-json /content/outputs/eval_trained.jsonl \\\n",
203
- " --enable-thinking"
204
- ]
205
- },
206
- {
207
- "cell_type": "code",
208
- "execution_count": null,
209
- "metadata": {},
210
- "outputs": [],
211
- "source": [
212
- "rows = [json.loads(l) for l in open('/content/outputs/eval_trained.jsonl')]\n",
213
- "for r in rows:\n",
214
- " print(f\"\\n[{r['task_id']}] reward={r['reward']:+.3f} raw={r['raw_task_score']:.2f} tokens={r['tokens']}\")\n",
215
- " print(f\" trained agent's prompt: {r['agent_prompt'][:140]!r}\")"
216
- ]
217
- },
218
- {
219
- "cell_type": "markdown",
220
- "metadata": {},
221
- "source": [
222
- "## What's next\n",
223
- "\n",
224
- "This notebook ran 30 steps on 4 tasks — enough to verify the pipeline. The adapter checkpoints used in the demo CSVs were produced by 500-step runs over all 87 tasks, which take ~3-4h on L40S.\n",
225
- "\n",
226
- "**To reproduce the full results:**\n",
227
- "1. `bash training/hf_job_train.sh` — same-family Qwen→Qwen baseline (single-turn)\n",
228
- "2. `ENABLE_THINKING=true PUSH_TO_HUB=rishabh16196/prompt-golf-qwen-to-llama bash training/hf_job_train.sh` — cross-family Qwen→Llama (the hero run)\n",
229
- "3. `bash training/hf_job_train_multistep.sh` — multi-turn trajectory-level GRPO (warm-started from #2)\n",
230
- "4. `bash training/hf_job_eval.sh both` — base + trained eval on either adapter\n",
231
- "5. `python training/build_before_after_csv.py ...` — merge eval JSONLs into the demo CSV\n",
232
- "\n",
233
- "Existing artifacts:\n",
234
- "- Qwen→Qwen demo CSV: https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv\n",
235
- "- Capability profiles (per task): https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/tree/main/profiles\n",
236
- "- Plots: https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/tree/main/plots"
237
- ]
238
- }
239
- ],
240
- "metadata": {
241
- "accelerator": "GPU",
242
- "colab": {
243
- "provenance": [],
244
- "gpuType": "L4"
245
- },
246
- "kernelspec": {
247
- "display_name": "Python 3",
248
- "name": "python3"
249
- },
250
- "language_info": {
251
- "name": "python"
252
- }
253
- },
254
- "nbformat": 4,
255
- "nbformat_minor": 0
256
- }