Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
c3e14ba
·
1 Parent(s): f712ee4

README + BLOG: drop multi-step + Llama-self mentions (in-progress runs)

Browse files

Both runs are still training; not surfacing them in public docs
until they land + we have results worth pointing at.

Removed from README:
- prompt-golf-multistep-llama and prompt-golf-llama-self rows in
the trained-adapter table
- train_grpo_multistep.py + hf_job_train_multistep.sh entries in
the training pipeline file table
- "single + multi-turn" annotation on prompt_golf_environment.py
in the Files tree

Removed from BLOG_POST.md:
- "Multi-turn is a research hop" bullet in the lessons section
- "What's next" multi-turn paragraph

Code stays in the repo (training/train_grpo_multistep.py and
hf_job_train_multistep.sh) — runs continue, we just don't
publicize until results land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. BLOG_POST.md +1 -5
  2. README.md +2 -5
BLOG_POST.md CHANGED
@@ -183,8 +183,6 @@ That's **37× compression** on the prompt that's most expensive to ship. The tra
183
  - **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
184
  - **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
185
  - **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
186
- - **Multi-turn is a research hop, not a single sprint.** TRL's `GRPOTrainer` is single-step by design. Multi-turn requires a hand-rolled trajectory-level GRPO loop (we wrote one mirroring the recipe from the [Spaces Pipeline env](https://huggingface.co/spaces/rishabh16196/spaces_pipeline_env)). It works, but the training cost scales with `turn_limit`. Worth it only when you have evidence the policy is bottlenecked by single-turn information.
187
-
188
  ---
189
 
190
  ## Try it yourself
@@ -214,9 +212,7 @@ PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
214
 
215
  ## What's next
216
 
217
- We have the trajectory-level GRPO trainer working but haven't yet trained a multi-turn cross-family adapter at scale — that's the next experiment. The hypothesis: when the agent can see its own prompt's outputs across 2-3 turns, it should be able to *debug* its prompt instead of writing it cold. Whether that beats single-turn at equal compute is the open question.
218
-
219
- Other directions we'd love community help on:
220
 
221
  - **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
222
  - **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.
 
183
  - **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
184
  - **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
185
  - **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
 
 
186
  ---
187
 
188
  ## Try it yourself
 
212
 
213
  ## What's next
214
 
215
+ Directions we'd love community help on:
 
 
216
 
217
  - **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
218
  - **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.
README.md CHANGED
@@ -33,20 +33,17 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
33
  | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
34
  | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
35
  | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
36
- | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
37
- | [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
38
 
39
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
40
 
41
  | File | Role |
42
  |---|---|
43
  | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
44
- | [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | hand-rolled trajectory-level GRPO (multi-turn) |
45
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
46
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
47
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
48
  | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
49
- | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
50
 
51
  ---
52
 
@@ -159,7 +156,7 @@ prompt_golf_env/
159
  Dockerfile # HF Spaces image
160
  server/
161
  app.py # FastAPI app
162
- prompt_golf_environment.py # core Env (single + multi-turn)
163
  target_model.py # frozen-target wrapper (HF + mock)
164
  judge.py # Qwen3-8B 8-bit judge
165
  scorer.py # 21+ scorers (structural + LLM judge)
 
33
  | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
34
  | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
35
  | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
 
 
36
 
37
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
38
 
39
  | File | Role |
40
  |---|---|
41
  | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
 
42
  | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
43
  | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
44
  | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
45
  | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
46
+ | [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
47
 
48
  ---
49
 
 
156
  Dockerfile # HF Spaces image
157
  server/
158
  app.py # FastAPI app
159
+ prompt_golf_environment.py # core Env
160
  target_model.py # frozen-target wrapper (HF + mock)
161
  judge.py # Qwen3-8B 8-bit judge
162
  scorer.py # 21+ scorers (structural + LLM judge)