Spaces:
Sleeping
README + BLOG: drop multi-step + Llama-self mentions (in-progress runs)
Browse filesBoth runs are still training; not surfacing them in public docs
until they land + we have results worth pointing at.
Removed from README:
- prompt-golf-multistep-llama and prompt-golf-llama-self rows in
the trained-adapter table
- train_grpo_multistep.py + hf_job_train_multistep.sh entries in
the training pipeline file table
- "single + multi-turn" annotation on prompt_golf_environment.py
in the Files tree
Removed from BLOG_POST.md:
- "Multi-turn is a research hop" bullet in the lessons section
- "What's next" multi-turn paragraph
Code stays in the repo (training/train_grpo_multistep.py and
hf_job_train_multistep.sh) — runs continue, we just don't
publicize until results land.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +1 -5
- README.md +2 -5
|
@@ -183,8 +183,6 @@ That's **37× compression** on the prompt that's most expensive to ship. The tra
|
|
| 183 |
- **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
|
| 184 |
- **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
|
| 185 |
- **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
|
| 186 |
-
- **Multi-turn is a research hop, not a single sprint.** TRL's `GRPOTrainer` is single-step by design. Multi-turn requires a hand-rolled trajectory-level GRPO loop (we wrote one mirroring the recipe from the [Spaces Pipeline env](https://huggingface.co/spaces/rishabh16196/spaces_pipeline_env)). It works, but the training cost scales with `turn_limit`. Worth it only when you have evidence the policy is bottlenecked by single-turn information.
|
| 187 |
-
|
| 188 |
---
|
| 189 |
|
| 190 |
## Try it yourself
|
|
@@ -214,9 +212,7 @@ PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
|
|
| 214 |
|
| 215 |
## What's next
|
| 216 |
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
Other directions we'd love community help on:
|
| 220 |
|
| 221 |
- **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
|
| 222 |
- **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.
|
|
|
|
| 183 |
- **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
|
| 184 |
- **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
|
| 185 |
- **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
|
|
|
|
|
|
|
| 186 |
---
|
| 187 |
|
| 188 |
## Try it yourself
|
|
|
|
| 212 |
|
| 213 |
## What's next
|
| 214 |
|
| 215 |
+
Directions we'd love community help on:
|
|
|
|
|
|
|
| 216 |
|
| 217 |
- **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
|
| 218 |
- **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.
|
|
@@ -33,20 +33,17 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
|
|
| 33 |
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
|
| 34 |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
|
| 35 |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
|
| 36 |
-
| [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama multi-turn | trajectory-level GRPO adapter |
|
| 37 |
-
| [`prompt-golf-llama-self`](https://huggingface.co/rishabh16196/prompt-golf-llama-self) | Llama→Llama self-improvement | adapter where Llama writes prompts for itself |
|
| 38 |
|
| 39 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 40 |
|
| 41 |
| File | Role |
|
| 42 |
|---|---|
|
| 43 |
| [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
|
| 44 |
-
| [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | hand-rolled trajectory-level GRPO (multi-turn) |
|
| 45 |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
|
| 46 |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
|
| 47 |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
|
| 48 |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
|
| 49 |
-
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`
|
| 50 |
|
| 51 |
---
|
| 52 |
|
|
@@ -159,7 +156,7 @@ prompt_golf_env/
|
|
| 159 |
Dockerfile # HF Spaces image
|
| 160 |
server/
|
| 161 |
app.py # FastAPI app
|
| 162 |
-
prompt_golf_environment.py # core Env
|
| 163 |
target_model.py # frozen-target wrapper (HF + mock)
|
| 164 |
judge.py # Qwen3-8B 8-bit judge
|
| 165 |
scorer.py # 21+ scorers (structural + LLM judge)
|
|
|
|
| 33 |
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
|
| 34 |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
|
| 35 |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
|
|
|
|
|
|
|
| 36 |
|
| 37 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 38 |
|
| 39 |
| File | Role |
|
| 40 |
|---|---|
|
| 41 |
| [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer |
|
|
|
|
| 42 |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
|
| 43 |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
|
| 44 |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
|
| 45 |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
|
| 46 |
+
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
|
|
| 156 |
Dockerfile # HF Spaces image
|
| 157 |
server/
|
| 158 |
app.py # FastAPI app
|
| 159 |
+
prompt_golf_environment.py # core Env
|
| 160 |
target_model.py # frozen-target wrapper (HF + mock)
|
| 161 |
judge.py # Qwen3-8B 8-bit judge
|
| 162 |
scorer.py # 21+ scorers (structural + LLM judge)
|