Spaces:
Sleeping
Sleeping
Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit ·
802278c
1
Parent(s): 6206e8a
docs: refresh BLOG_POST stale 87-task numbers; finish README ~ cleanup
Browse filesBLOG_POST:
- Tagline: "~55% of tokens" → "80% of accuracy at ~40% of tokens".
- "What we built": 35-token / 48-of-87 (55%) → ~39-token / 63-of-90
(70%) best-of-three. Mean verbose tokens 63 → 94.
- Recipe table: same update.
Now consistent with the 90-task Results section further down.
README:
- Replace remaining "~" approximations with "≈" or plain text in
rendered prose (was producing weird double-tilde rendering on HF).
- Bash code blocks left alone (~3h renders literally inside a fence).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +3 -3
- README.md +3 -3
BLOG_POST.md
CHANGED
|
@@ -7,7 +7,7 @@ authors:
|
|
| 7 |
|
| 8 |
# Prompt Golf
|
| 9 |
|
| 10 |
-
> *
|
| 11 |
|
| 12 |
## How this started
|
| 13 |
|
|
@@ -50,7 +50,7 @@ The research question and the ad-tech problem are the same problem from differen
|
|
| 50 |
|
| 51 |
We trained a Qwen3-1.7B **agent** (LoRA + TRL GRPO) to write prompts for a frozen Llama-3.2-3B **target**. Different families on purpose: the agent has no gradient access, no shared tokenizer affordance, no architectural shortcut. Just the same view a human prompt engineer has — *I can see what the target does, I can't see why it does it.*
|
| 52 |
|
| 53 |
-
After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~
|
| 54 |
|
| 55 |
Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), the [trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink), the [training pipeline](https://github.com/rishabh16196/prompt_golf_env/tree/main/training), and a [live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) where you can play prompts against the same target the agent was trained on. The [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) has all 90 tasks × verbose / untrained / trained / accuracy side by side.
|
| 56 |
|
|
@@ -59,7 +59,7 @@ Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_
|
|
| 59 |
| **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
|
| 60 |
| **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
|
| 61 |
| **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
|
| 62 |
-
| **The result** |
|
| 63 |
| **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
|
| 64 |
|
| 65 |
---
|
|
|
|
| 7 |
|
| 8 |
# Prompt Golf
|
| 9 |
|
| 10 |
+
> *80% of human-written-prompt accuracy at ~40% of the tokens — learned by an RL agent that never saw the target's weights, only its outputs.*
|
| 11 |
|
| 12 |
## How this started
|
| 13 |
|
|
|
|
| 50 |
|
| 51 |
We trained a Qwen3-1.7B **agent** (LoRA + TRL GRPO) to write prompts for a frozen Llama-3.2-3B **target**. Different families on purpose: the agent has no gradient access, no shared tokenizer affordance, no architectural shortcut. Just the same view a human prompt engineer has — *I can see what the target does, I can't see why it does it.*
|
| 52 |
|
| 53 |
+
After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~94 tokens, up to 737 on long-context policy tasks) into **~39-token prompts** that retain **80% of the verbose accuracy**. On a per-task basis, the trained agent's prompt is **the best of three options on 63 of 90 tasks (70%)** — *cheaper and equal-or-better reward* than both the verbose human prompt and the untrained agent. Peak compression: **30× on long-context policy tasks**.
|
| 54 |
|
| 55 |
Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), the [trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink), the [training pipeline](https://github.com/rishabh16196/prompt_golf_env/tree/main/training), and a [live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) where you can play prompts against the same target the agent was trained on. The [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) has all 90 tasks × verbose / untrained / trained / accuracy side by side.
|
| 56 |
|
|
|
|
| 59 |
| **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
|
| 60 |
| **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
|
| 61 |
| **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
|
| 62 |
+
| **The result** | ~39-token prompts → 80% of verbose accuracy. Best of three on 70% of tasks. |
|
| 63 |
| **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
|
| 64 |
|
| 65 |
---
|
README.md
CHANGED
|
@@ -67,10 +67,10 @@ All numbers below are on the same 90-task bank, evaluated against frozen Llama-3
|
|
| 67 |
|---|---|---|---|---|---|
|
| 68 |
| **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
|
| 69 |
| **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
|
| 70 |
-
| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often,
|
| 71 |
| **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |
|
| 72 |
|
| 73 |
-
> **Headline:** Hero retains **80% of verbose accuracy at
|
| 74 |
|
| 75 |
### Training curves (hero)
|
| 76 |
|
|
@@ -135,7 +135,7 @@ Each episode is one task:
|
|
| 135 |
|
| 136 |
1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
|
| 137 |
2. Agent outputs a **prompt string** as its action.
|
| 138 |
-
3. Env prepends the prompt to
|
| 139 |
4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.
|
| 140 |
|
| 141 |
**Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.
|
|
|
|
| 67 |
|---|---|---|---|---|---|
|
| 68 |
| **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
|
| 69 |
| **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
|
| 70 |
+
| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often, ≈3× shorter than verbose at 80% of its accuracy. |
|
| 71 |
| **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |
|
| 72 |
|
| 73 |
+
> **Headline:** Hero retains **80% of verbose accuracy at ≈40% of the tokens** and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.
|
| 74 |
|
| 75 |
### Training curves (hero)
|
| 76 |
|
|
|
|
| 135 |
|
| 136 |
1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
|
| 137 |
2. Agent outputs a **prompt string** as its action.
|
| 138 |
+
3. Env prepends the prompt to 6 held-out test inputs, runs the **frozen target LLM**, scores each output with the task scorer.
|
| 139 |
4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.
|
| 140 |
|
| 141 |
**Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.
|