Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

Don Rishabh Claude Opus 4.7 (1M context) commited on 16 days ago

Commit

802278c

1 Parent(s): 6206e8a

docs: refresh BLOG_POST stale 87-task numbers; finish README ~ cleanup

BLOG_POST:
- Tagline: "~55% of tokens" → "80% of accuracy at ~40% of tokens".
- "What we built": 35-token / 48-of-87 (55%) → ~39-token / 63-of-90
(70%) best-of-three. Mean verbose tokens 63 → 94.
- Recipe table: same update.
Now consistent with the 90-task Results section further down.

README:
- Replace remaining "~" approximations with "≈" or plain text in
rendered prose (was producing weird double-tilde rendering on HF).
- Bash code blocks left alone (~3h renders literally inside a fence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

BLOG_POST.md +3 -3
README.md +3 -3

BLOG_POST.md CHANGED Viewed

@@ -7,7 +7,7 @@ authors:
 # Prompt Golf
-> *Same accuracy as the human-written prompt at ~55% of the tokens — learned by an RL agent that never saw the target's weights, only its outputs.*
 ## How this started
@@ -50,7 +50,7 @@ The research question and the ad-tech problem are the same problem from differen
 We trained a Qwen3-1.7B **agent** (LoRA + TRL GRPO) to write prompts for a frozen Llama-3.2-3B **target**. Different families on purpose: the agent has no gradient access, no shared tokenizer affordance, no architectural shortcut. Just the same view a human prompt engineer has — *I can see what the target does, I can't see why it does it.*
-After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~63 tokens, up to 737 on long-context policy tasks) into **35-token prompts** that retain **80% of the verbose accuracy** and **beat the human prompt outright on 48 of 87 tasks (55%)**. Peak compression: **30× on long-context policy tasks**.
 Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), the [trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink), the [training pipeline](https://github.com/rishabh16196/prompt_golf_env/tree/main/training), and a [live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) where you can play prompts against the same target the agent was trained on. The [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) has all 90 tasks × verbose / untrained / trained / accuracy side by side.
@@ -59,7 +59,7 @@ Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_
 | **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
 | **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
 | **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
-| **The result** | 35-token prompts → 80% of verbose accuracy. Wins on 55% of tasks. |
 | **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
 ---

 # Prompt Golf
+> *80% of human-written-prompt accuracy at ~40% of the tokens — learned by an RL agent that never saw the target's weights, only its outputs.*
 ## How this started
 We trained a Qwen3-1.7B **agent** (LoRA + TRL GRPO) to write prompts for a frozen Llama-3.2-3B **target**. Different families on purpose: the agent has no gradient access, no shared tokenizer affordance, no architectural shortcut. Just the same view a human prompt engineer has — *I can see what the target does, I can't see why it does it.*
+After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~94 tokens, up to 737 on long-context policy tasks) into **~39-token prompts** that retain **80% of the verbose accuracy**. On a per-task basis, the trained agent's prompt is **the best of three options on 63 of 90 tasks (70%)** — *cheaper and equal-or-better reward* than both the verbose human prompt and the untrained agent. Peak compression: **30× on long-context policy tasks**.
 Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), the [trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink), the [training pipeline](https://github.com/rishabh16196/prompt_golf_env/tree/main/training), and a [live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) where you can play prompts against the same target the agent was trained on. The [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) has all 90 tasks × verbose / untrained / trained / accuracy side by side.
 | **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
 | **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
 | **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
+| **The result** | ~39-token prompts → 80% of verbose accuracy. Best of three on 70% of tasks. |
 | **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
 ---

README.md CHANGED Viewed

@@ -67,10 +67,10 @@ All numbers below are on the same 90-task bank, evaluated against frozen Llama-3
 |---|---|---|---|---|---|
 | **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
 | **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
-| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often, ~3× shorter than verbose at 80% of its accuracy. |
 | **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |
-> **Headline:** Hero retains **80% of verbose accuracy at ~40% of the tokens** and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.
 ### Training curves (hero)
@@ -135,7 +135,7 @@ Each episode is one task:
 1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
 2. Agent outputs a **prompt string** as its action.
-3. Env prepends the prompt to ~6 held-out test inputs, runs the **frozen target LLM**, scores each output with the task scorer.
 4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.
 **Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.

 |---|---|---|---|---|---|
 | **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
 | **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
+| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often, ≈3× shorter than verbose at 80% of its accuracy. |
 | **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |
+> **Headline:** Hero retains **80% of verbose accuracy at ≈40% of the tokens** and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.
 ### Training curves (hero)
 1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
 2. Agent outputs a **prompt string** as its action.
+3. Env prepends the prompt to 6 held-out test inputs, runs the **frozen target LLM**, scores each output with the task scorer.
 4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.
 **Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.