Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

Don Rishabh Claude Opus 4.7 (1M context) commited on 14 days ago

Commit

c3e14ba

1 Parent(s): f712ee4

README + BLOG: drop multi-step + Llama-self mentions (in-progress runs)

Both runs are still training; not surfacing them in public docs
until they land + we have results worth pointing at.

Removed from README:
- prompt-golf-multistep-llama and prompt-golf-llama-self rows in
the trained-adapter table
- train_grpo_multistep.py + hf_job_train_multistep.sh entries in
the training pipeline file table
- "single + multi-turn" annotation on prompt_golf_environment.py
in the Files tree

Removed from BLOG_POST.md:
- "Multi-turn is a research hop" bullet in the lessons section
- "What's next" multi-turn paragraph

Code stays in the repo (training/train_grpo_multistep.py and
hf_job_train_multistep.sh) — runs continue, we just don't
publicize until results land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

BLOG_POST.md +1 -5
README.md +2 -5

BLOG_POST.md CHANGED Viewed

@@ -183,8 +183,6 @@ That's **37× compression** on the prompt that's most expensive to ship. The tra
 - **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
 - **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
 - **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
-- **Multi-turn is a research hop, not a single sprint.** TRL's `GRPOTrainer` is single-step by design. Multi-turn requires a hand-rolled trajectory-level GRPO loop (we wrote one mirroring the recipe from the [Spaces Pipeline env](https://huggingface.co/spaces/rishabh16196/spaces_pipeline_env)). It works, but the training cost scales with `turn_limit`. Worth it only when you have evidence the policy is bottlenecked by single-turn information.
 ---
 ## Try it yourself
@@ -214,9 +212,7 @@ PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
 ## What's next
-We have the trajectory-level GRPO trainer working but haven't yet trained a multi-turn cross-family adapter at scale — that's the next experiment. The hypothesis: when the agent can see its own prompt's outputs across 2-3 turns, it should be able to *debug* its prompt instead of writing it cold. Whether that beats single-turn at equal compute is the open question.
-Other directions we'd love community help on:
 - **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
 - **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.

 - **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
 - **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
 - **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
 ---
 ## Try it yourself
 ## What's next
+Directions we'd love community help on:
 - **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
 - **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.

README.md CHANGED Viewed

@@ -33,20 +33,17 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
 | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
 | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
 | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
-| [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
-| [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
 ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
 | File | Role |
 |---|---|
 | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
-| [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | hand-rolled trajectory-level GRPO (multi-turn) |
 | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
 | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
 | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
 | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
-| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
 ---
@@ -159,7 +156,7 @@ prompt_golf_env/
   Dockerfile                           # HF Spaces image
   server/
     app.py                             # FastAPI app
-    prompt_golf_environment.py         # core Env (single + multi-turn)
     target_model.py                    # frozen-target wrapper (HF + mock)
     judge.py                           # Qwen3-8B 8-bit judge
     scorer.py                          # 21+ scorers (structural + LLM judge)

 | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
 | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
 | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
 ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
 | File | Role |
 |---|---|
 | [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
 | [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
 | [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
 | [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
 | [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
+| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
 ---
   Dockerfile                           # HF Spaces image
   server/
     app.py                             # FastAPI app
+    prompt_golf_environment.py         # core Env
     target_model.py                    # frozen-target wrapper (HF + mock)
     judge.py                           # Qwen3-8B 8-bit judge
     scorer.py                          # 21+ scorers (structural + LLM judge)