Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
e1e3cbe
·
1 Parent(s): e1f0c20

docs: add multi-step variant section to README + BLOG_POST

Browse files

Multi-step training (Qwen→Llama, 3-turn episodes, 150 steps × 8
trajectories × 3 turns, trajectory-level GRPO) landed and evaluated:
- Trained accuracy 0.576 (vs single-step hero 0.523, +6pp)
- Trained reward +0.440 (vs +0.426, slight gain)
- Mean tokens 43.7 (vs 35, +25%)
- Wins on 29/90 tasks vs untrained baseline

Conditional improvement, not a strict upgrade:
- Wins concentrated in reasoning_tough (5/0), classification_tough
(7/2), policy_compression (1/3 — content_moderation 0→0.67)
- Loses on format-strict tasks (2/5) and dead-target tasks (burns
tokens trying)
- Single-step remains the cost-sensitive default; multi-step is the
right pick when accuracy ceilings matter more than tokens.

Adapter: rishabh16196/prompt-golf-multistep-llama
Eval: evals/eval_base.jsonl + evals/eval_trained.jsonl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. BLOG_POST.md +37 -1
  2. README.md +24 -0
BLOG_POST.md CHANGED
@@ -288,11 +288,47 @@ This looks like a "win" for Qwen→Qwen on win-rate, but the framing is misleadi
288
 
289
  *Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower — the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
290
 
291
- All three trained adapters are public, with their own demo CSVs:
292
 
293
  - 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
294
  - 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
295
  - 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
296
 
297
  ---
298
 
 
288
 
289
  *Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower — the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
290
 
291
+ All trained adapters are public, with their own demo CSVs:
292
 
293
  - 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
294
  - 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
295
  - 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
296
+ - 🔁 **[`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama)** (multi-turn, 3 turns/episode) — [base eval](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) · [trained eval](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)
297
+
298
+ ### The multi-turn variant: same agent, 3 turns to get it right
299
+
300
+ The hero is single-step — the agent writes one prompt and the episode ends. Multi-turn loosens that. We ran a 3-turn variant where the agent sees its prior prompts plus per-example feedback on a *separate* feedback slice (2 examples), then the **final-turn** prompt is judged on a held-out 4-example scoring slice. Same task bank. Same target. Trained with a hand-rolled trajectory-level GRPO (`train_grpo_multistep.py`) for 150 steps × 8 trajectories × 3 turns.
301
+
302
+ | | Single-step hero | Multi-step (3 turns) |
303
+ |---|---|---|
304
+ | Trained accuracy | 0.523 | **0.576** |
305
+ | Trained reward | +0.426 | **+0.440** |
306
+ | Mean tokens | **35** | 43.7 |
307
+ | Trained beats untrained on | — | 29 / 90 tasks |
308
+
309
+ Six points of accuracy and a slight reward gain, paid for with ~25% more tokens. But the average hides the interesting structure — **multi-step is a conditional improvement, not a strict upgrade**:
310
+
311
+ | Category | Multi-step wins | Multi-step losses |
312
+ |---|---|---|
313
+ | `reasoning_tough` | **5 / 10** | 0 |
314
+ | `classification_tough` | **7 / 10** | 2 |
315
+ | `extraction_tough` | 4 / 10 | 4 (mixed) |
316
+ | `policy_compression` | 1 / 3 (`policy_content_moderation` 0.00 → **0.67**) | 1 (`policy_finreg_communication_review` 17 → 112 tokens, still 0.00) |
317
+ | `format` | 2 / 8 | **5 / 8** |
318
+
319
+ Top single-task wins:
320
+ - `tough_policy_stance`: 0.00 → **1.00** (+0.98 reward)
321
+ - `tough_syllogism_check`: 0.00 → **1.00** (+0.97)
322
+ - `sentiment_basic`: 0.00 → **1.00** (+0.93)
323
+ - `json_key_ordering`: 0.11 → **1.00** (+0.89)
324
+ - `tough_contract_obligation`: 0.06 → **0.89** (+0.69; 43 → 117 tokens)
325
+ - `policy_content_moderation`: 0.00 → **0.67** — multi-step *unlocked* a previously dead policy task by using turn 2/3 to refine the categorization prompt
326
+
327
+ Where it hurts: format-strict tasks where short single-shot prompts already win. `format_uppercase` goes 9 → 24 tokens with no accuracy gain. `format_three_bullets` regresses 1.00 → 0.92. The agent uses its turn-2 / turn-3 budget to bloat prompts that didn't need bloating.
328
+
329
+ **The intuition:** multi-turn relaxes length pressure because the agent has room to debug across turns. That helps tasks where the agent needs reasoning room (tough reasoning, tough classification, complex extraction, policy-style compression). It hurts tasks where the optimal answer is "just say the right 8 words." We predicted this exact split in the v3 design plan before the run completed — it's a clean confirmation that multi-turn isn't a free upgrade, it's a different operating point on the accuracy/length curve.
330
+
331
+ **Operationally:** single-step is still the better default for inference-cost-sensitive deployments. Multi-step is the right pick when accuracy ceilings matter more than token count — and especially when individual tasks reward step-by-step refinement.
332
 
333
  ---
334
 
README.md CHANGED
@@ -33,6 +33,7 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
33
  | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
34
  | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
35
  | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
 
36
 
37
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
38
 
@@ -138,6 +139,29 @@ OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token
138
 
139
  📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ---
142
 
143
  ## Quick start
 
33
  | [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
34
  | [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
35
  | [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
36
+ | [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama, **multi-turn (3 turns)** | adapter + train_metrics + base/trained eval JSONLs |
37
 
38
  ### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))
39
 
 
139
 
140
  📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
141
 
142
+ ### Multi-turn variant: 3 turns per episode
143
+
144
+ Same agent + target as the hero, but each episode runs `turn_limit=3` — the agent sees its prior prompts + per-example feedback on a 2-example feedback slice, and only the final-turn prompt is judged on a held-out 4-example scoring slice. Trained with the trajectory-level GRPO trainer (`train_grpo_multistep.py`), 150 steps × 8 trajectories × 3 turns.
145
+
146
+ | | Single-step hero | Multi-step (3 turns) |
147
+ |---|---|---|
148
+ | Trained accuracy | 0.523 | **0.576** |
149
+ | Trained reward | +0.426 | **+0.440** |
150
+ | Mean tokens | **35** | 43.7 |
151
+ | Trained beats untrained on | — | 29 / 90 tasks |
152
+
153
+ **Multi-step is a conditional improvement, not a strict upgrade:**
154
+
155
+ | Where it wins | Where it loses |
156
+ |---|---|
157
+ | `reasoning_tough` (5 wins / 0 losses) | `format` (2 wins / 5 losses) |
158
+ | `classification_tough` (7 / 2) | `format_three_bullets`, `format_uppercase` (bloat with no accuracy gain) |
159
+ | `policy_compression` (1 win — `policy_content_moderation` 0.00 → 0.67) | Dead-target tasks (agent burns tokens trying anyway, e.g. `policy_finreg_communication_review` 17→112 tokens, both 0.00) |
160
+
161
+ The intuition: multi-turn relaxes length pressure because the agent has room to debug across turns. That helps tasks where the agent needs reasoning room (tough reasoning, tough classification, complex extraction). It hurts tasks where short single-shot prompts already win (format-strict tasks). **Single-step is the right default for cost-sensitive deployments; multi-step is the right pick when accuracy ceilings matter more than token count.**
162
+
163
+ 📊 **Eval JSONLs:** [`evals/eval_base.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) + [`evals/eval_trained.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)
164
+
165
  ---
166
 
167
  ## Quick start