Spaces:
Sleeping
docs: add multi-step variant section to README + BLOG_POST
Browse filesMulti-step training (Qwen→Llama, 3-turn episodes, 150 steps × 8
trajectories × 3 turns, trajectory-level GRPO) landed and evaluated:
- Trained accuracy 0.576 (vs single-step hero 0.523, +6pp)
- Trained reward +0.440 (vs +0.426, slight gain)
- Mean tokens 43.7 (vs 35, +25%)
- Wins on 29/90 tasks vs untrained baseline
Conditional improvement, not a strict upgrade:
- Wins concentrated in reasoning_tough (5/0), classification_tough
(7/2), policy_compression (1/3 — content_moderation 0→0.67)
- Loses on format-strict tasks (2/5) and dead-target tasks (burns
tokens trying)
- Single-step remains the cost-sensitive default; multi-step is the
right pick when accuracy ceilings matter more than tokens.
Adapter: rishabh16196/prompt-golf-multistep-llama
Eval: evals/eval_base.jsonl + evals/eval_trained.jsonl
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +37 -1
- README.md +24 -0
|
@@ -288,11 +288,47 @@ This looks like a "win" for Qwen→Qwen on win-rate, but the framing is misleadi
|
|
| 288 |
|
| 289 |
*Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower — the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
|
| 290 |
|
| 291 |
-
All
|
| 292 |
|
| 293 |
- 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
|
| 294 |
- 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 295 |
- 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
|
| 297 |
---
|
| 298 |
|
|
|
|
| 288 |
|
| 289 |
*Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower — the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
|
| 290 |
|
| 291 |
+
All trained adapters are public, with their own demo CSVs:
|
| 292 |
|
| 293 |
- 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
|
| 294 |
- 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 295 |
- 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
|
| 296 |
+
- 🔁 **[`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama)** (multi-turn, 3 turns/episode) — [base eval](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) · [trained eval](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)
|
| 297 |
+
|
| 298 |
+
### The multi-turn variant: same agent, 3 turns to get it right
|
| 299 |
+
|
| 300 |
+
The hero is single-step — the agent writes one prompt and the episode ends. Multi-turn loosens that. We ran a 3-turn variant where the agent sees its prior prompts plus per-example feedback on a *separate* feedback slice (2 examples), then the **final-turn** prompt is judged on a held-out 4-example scoring slice. Same task bank. Same target. Trained with a hand-rolled trajectory-level GRPO (`train_grpo_multistep.py`) for 150 steps × 8 trajectories × 3 turns.
|
| 301 |
+
|
| 302 |
+
| | Single-step hero | Multi-step (3 turns) |
|
| 303 |
+
|---|---|---|
|
| 304 |
+
| Trained accuracy | 0.523 | **0.576** |
|
| 305 |
+
| Trained reward | +0.426 | **+0.440** |
|
| 306 |
+
| Mean tokens | **35** | 43.7 |
|
| 307 |
+
| Trained beats untrained on | — | 29 / 90 tasks |
|
| 308 |
+
|
| 309 |
+
Six points of accuracy and a slight reward gain, paid for with ~25% more tokens. But the average hides the interesting structure — **multi-step is a conditional improvement, not a strict upgrade**:
|
| 310 |
+
|
| 311 |
+
| Category | Multi-step wins | Multi-step losses |
|
| 312 |
+
|---|---|---|
|
| 313 |
+
| `reasoning_tough` | **5 / 10** | 0 |
|
| 314 |
+
| `classification_tough` | **7 / 10** | 2 |
|
| 315 |
+
| `extraction_tough` | 4 / 10 | 4 (mixed) |
|
| 316 |
+
| `policy_compression` | 1 / 3 (`policy_content_moderation` 0.00 → **0.67**) | 1 (`policy_finreg_communication_review` 17 → 112 tokens, still 0.00) |
|
| 317 |
+
| `format` | 2 / 8 | **5 / 8** |
|
| 318 |
+
|
| 319 |
+
Top single-task wins:
|
| 320 |
+
- `tough_policy_stance`: 0.00 → **1.00** (+0.98 reward)
|
| 321 |
+
- `tough_syllogism_check`: 0.00 → **1.00** (+0.97)
|
| 322 |
+
- `sentiment_basic`: 0.00 → **1.00** (+0.93)
|
| 323 |
+
- `json_key_ordering`: 0.11 → **1.00** (+0.89)
|
| 324 |
+
- `tough_contract_obligation`: 0.06 → **0.89** (+0.69; 43 → 117 tokens)
|
| 325 |
+
- `policy_content_moderation`: 0.00 → **0.67** — multi-step *unlocked* a previously dead policy task by using turn 2/3 to refine the categorization prompt
|
| 326 |
+
|
| 327 |
+
Where it hurts: format-strict tasks where short single-shot prompts already win. `format_uppercase` goes 9 → 24 tokens with no accuracy gain. `format_three_bullets` regresses 1.00 → 0.92. The agent uses its turn-2 / turn-3 budget to bloat prompts that didn't need bloating.
|
| 328 |
+
|
| 329 |
+
**The intuition:** multi-turn relaxes length pressure because the agent has room to debug across turns. That helps tasks where the agent needs reasoning room (tough reasoning, tough classification, complex extraction, policy-style compression). It hurts tasks where the optimal answer is "just say the right 8 words." We predicted this exact split in the v3 design plan before the run completed — it's a clean confirmation that multi-turn isn't a free upgrade, it's a different operating point on the accuracy/length curve.
|
| 330 |
+
|
| 331 |
+
**Operationally:** single-step is still the better default for inference-cost-sensitive deployments. Multi-step is the right pick when accuracy ceilings matter more than token count — and especially when individual tasks reward step-by-step refinement.
|
| 332 |
|
| 333 |
---
|
| 334 |
|
|
@@ -33,6 +33,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
|
|
| 33 |
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
|
| 34 |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
|
| 35 |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
|
|
|
|
| 36 |
|
| 37 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 38 |
|
|
@@ -138,6 +139,29 @@ OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token
|
|
| 138 |
|
| 139 |
📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
---
|
| 142 |
|
| 143 |
## Quick start
|
|
|
|
| 33 |
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
|
| 34 |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
|
| 35 |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
|
| 36 |
+
| [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama, **multi-turn (3 turns)** | adapter + train_metrics + base/trained eval JSONLs |
|
| 37 |
|
| 38 |
### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
|
| 39 |
|
|
|
|
| 139 |
|
| 140 |
📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 141 |
|
| 142 |
+
### Multi-turn variant: 3 turns per episode
|
| 143 |
+
|
| 144 |
+
Same agent + target as the hero, but each episode runs `turn_limit=3` — the agent sees its prior prompts + per-example feedback on a 2-example feedback slice, and only the final-turn prompt is judged on a held-out 4-example scoring slice. Trained with the trajectory-level GRPO trainer (`train_grpo_multistep.py`), 150 steps × 8 trajectories × 3 turns.
|
| 145 |
+
|
| 146 |
+
| | Single-step hero | Multi-step (3 turns) |
|
| 147 |
+
|---|---|---|
|
| 148 |
+
| Trained accuracy | 0.523 | **0.576** |
|
| 149 |
+
| Trained reward | +0.426 | **+0.440** |
|
| 150 |
+
| Mean tokens | **35** | 43.7 |
|
| 151 |
+
| Trained beats untrained on | — | 29 / 90 tasks |
|
| 152 |
+
|
| 153 |
+
**Multi-step is a conditional improvement, not a strict upgrade:**
|
| 154 |
+
|
| 155 |
+
| Where it wins | Where it loses |
|
| 156 |
+
|---|---|
|
| 157 |
+
| `reasoning_tough` (5 wins / 0 losses) | `format` (2 wins / 5 losses) |
|
| 158 |
+
| `classification_tough` (7 / 2) | `format_three_bullets`, `format_uppercase` (bloat with no accuracy gain) |
|
| 159 |
+
| `policy_compression` (1 win — `policy_content_moderation` 0.00 → 0.67) | Dead-target tasks (agent burns tokens trying anyway, e.g. `policy_finreg_communication_review` 17→112 tokens, both 0.00) |
|
| 160 |
+
|
| 161 |
+
The intuition: multi-turn relaxes length pressure because the agent has room to debug across turns. That helps tasks where the agent needs reasoning room (tough reasoning, tough classification, complex extraction). It hurts tasks where short single-shot prompts already win (format-strict tasks). **Single-step is the right default for cost-sensitive deployments; multi-step is the right pick when accuracy ceilings matter more than token count.**
|
| 162 |
+
|
| 163 |
+
📊 **Eval JSONLs:** [`evals/eval_base.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) + [`evals/eval_trained.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)
|
| 164 |
+
|
| 165 |
---
|
| 166 |
|
| 167 |
## Quick start
|