Spaces:
Sleeping
Sleeping
Don Rishabh commited on
Commit ·
4ea12d8
1
Parent(s): ea78734
BLOG_POST: full rewrite — research framing, 10 sections, citations, image placeholders
Browse files- BLOG_POST.md +355 -108
BLOG_POST.md
CHANGED
|
@@ -1,223 +1,470 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
thumbnail: /blog/assets/prompt_golf/thumbnail.png
|
| 4 |
authors:
|
| 5 |
-
- user: rishabh16196
|
| 6 |
---
|
| 7 |
|
| 8 |
# Prompt Golf
|
| 9 |
|
| 10 |
-
> *Same accuracy as the human-written prompt
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
- **Trained agent's prompts** (~35 tokens): **52% accuracy at ~half the tokens**
|
| 20 |
-
- **80% accuracy retention at 60% compression**, with peak compressions of **30× on long-context policy tasks**
|
| 21 |
-
- **Cross-family transfer**: the Qwen agent never saw Llama gradients, only its outputs. It still learned format anchors that work for Llama specifically.
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
-
## The
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
-
- **Content moderation**: multi-page community guidelines stuffed into the system prompt of every Llama instance scoring user posts.
|
| 35 |
-
- **Customer support**: a 1500-token persona document that turns every reply into "Hi, this is Bot™ — I'm here to help! 🌟".
|
| 36 |
-
- **Compliance**: FINRA-style review rules that a model has to internalize to flag broker communications.
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
---
|
| 45 |
|
| 46 |
-
##
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
-
2. The agent's action is a **prompt string** (typically wrapped in `<prompt>...</prompt>`).
|
| 52 |
-
3. The env prepends that prompt to ~6 *hidden* test inputs, runs the **frozen target LLM** on each, and scores the outputs with a task-specific scorer.
|
| 53 |
-
4. Reward = `raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped.
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
---
|
| 60 |
|
| 61 |
-
## Why cross-family is the
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
- Which words it can drop ("Please carefully consider…")
|
| 69 |
-
- Which compressions break Llama's output even though they look semantically equivalent
|
| 70 |
-
- That Llama-3.2 needs explicit label vocabularies on classification but Llama-3.2 *doesn't* need them on JSON extraction
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
-
## The 90-task bank
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
| 79 |
|
| 80 |
| Tier | Count | Examples |
|
| 81 |
|---|---|---|
|
| 82 |
-
| **v1** (
|
| 83 |
-
| **v2** (
|
| 84 |
-
| **tough** (
|
| 85 |
-
| **policy** (
|
| 86 |
|
| 87 |
-
Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60
|
| 88 |
|
| 89 |
-
The **policy tasks are the headline workload**: each has a 500
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
-
##
|
| 94 |
|
| 95 |
The recipe:
|
| 96 |
|
| 97 |
-
- **Agent**
|
| 98 |
-
- **Target**
|
| 99 |
-
- **Judge**
|
| 100 |
-
- **GRPO**
|
| 101 |
-
- **Hardware**
|
| 102 |
-
- **
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
---
|
| 112 |
|
| 113 |
-
## Results
|
| 114 |
|
| 115 |
-
### Headline numbers (90-task average)
|
| 116 |
|
| 117 |
| Stage | Mean accuracy | Mean tokens |
|
| 118 |
|---|---|---|
|
| 119 |
| Verbose human-written prompt | **0.65** | ~63 |
|
| 120 |
-
| Untrained Qwen3-1.7B agent | 0.48 | 38 |
|
| 121 |
| **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
|
| 122 |
|
| 123 |
-
→ **80% accuracy retention
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|---|---|---|---|
|
| 129 |
-
| `sentiment_basic` | 27 tok / **0.83** | **18 tok** / **1.00** |
|
| 130 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 131 |
-
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** |
|
| 132 |
-
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
|
|
|
|
| 133 |
|
| 134 |
-
The trained
|
| 135 |
-
- Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
|
| 136 |
-
- Drop ceremonial preamble ("In this task you will…")
|
| 137 |
-
- Keep technical anchors that constrain Llama's output format
|
| 138 |
-
- Match or beat the verbose accuracy ceiling on tasks where the verbose prompt is already near-optimal
|
| 139 |
|
| 140 |
### What the agent actually wrote
|
| 141 |
|
| 142 |
-
For sentiment classification
|
| 143 |
-
|
| 144 |
-
> *"For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."* (27 tokens)
|
| 145 |
-
|
| 146 |
-
The trained agent's compressed version:
|
| 147 |
|
| 148 |
-
> *"
|
|
|
|
|
|
|
| 149 |
|
| 150 |
For YAML extraction with strict nesting:
|
| 151 |
|
| 152 |
-
> Verbose: 74 tokens describing depth requirements, entity coverage, format constraints, output instructions.
|
| 153 |
>
|
| 154 |
-
> Trained agent:
|
| 155 |
|
| 156 |
-
For
|
| 157 |
|
| 158 |
-
> Verbose:
|
| 159 |
>
|
| 160 |
-
> Trained agent:
|
| 161 |
|
| 162 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
|
| 170 |
-
*
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
|
|
|
|
|
|
|
| 175 |
|
| 176 |
---
|
| 177 |
|
| 178 |
-
## What
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
-
- **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
|
| 181 |
-
- **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
|
| 182 |
-
- **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
|
| 183 |
---
|
| 184 |
|
| 185 |
-
## Try it yourself
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
|
| 187 |
-
**Run the env locally:**
|
| 188 |
```bash
|
| 189 |
git clone https://huggingface.co/spaces/rishabh16196/prompt_golf_env
|
| 190 |
cd prompt_golf_env
|
| 191 |
pip install -e . gradio transformers torch
|
| 192 |
|
| 193 |
-
#
|
| 194 |
-
PROMPT_GOLF_TARGET_BACKEND=mock
|
| 195 |
|
| 196 |
-
#
|
| 197 |
-
|
|
|
|
|
|
|
| 198 |
```
|
| 199 |
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
```bash
|
| 202 |
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
|
| 203 |
-
# ~3h on L40S,
|
| 204 |
```
|
| 205 |
|
| 206 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
|
| 208 |
---
|
| 209 |
|
| 210 |
-
## What's next
|
| 211 |
|
| 212 |
Directions we'd love community help on:
|
| 213 |
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
|
|
|
|
|
|
| 218 |
|
| 219 |
---
|
| 220 |
|
| 221 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: 'Prompt Golf: training one LLM to write the shortest prompts that steer another'
|
| 3 |
thumbnail: /blog/assets/prompt_golf/thumbnail.png
|
| 4 |
authors:
|
| 5 |
+
- user: rishabh16196
|
| 6 |
---
|
| 7 |
|
| 8 |
# Prompt Golf
|
| 9 |
|
| 10 |
+
> *Same accuracy as the human-written prompt at ~55% of the tokens — learned by an RL agent that never saw the target's weights, only its outputs.*
|
| 11 |
|
| 12 |
+
We trained an LLM to be a prompt engineer for *another LLM*.
|
| 13 |
+
|
| 14 |
+
The setup: a Qwen3-1.7B **agent** (LoRA-fine-tuned via TRL GRPO) writes prompts. A frozen Llama-3.2-3B **target** runs them. The reward is task success minus prompt length. After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~63 tokens, up to 737 on long-context policy tasks) into **35-token** prompts that retain **80% of the verbose accuracy** and **beat the human prompt outright on 48 of 87 tasks (55%)**.
|
| 15 |
+
|
| 16 |
+
Peak compression: **37×** on long-context policy tasks — a 737-token MSN ad-creative policy compressed to a 20-token classifier prompt.
|
| 17 |
|
| 18 |
+
Everything is open: the OpenEnv environment, three trained adapters, a live Gradio demo where you can play prompts against the same target, a Trackio dashboard with the full training trajectory, and a reproducible HuggingFace Jobs pipeline.
|
| 19 |
|
| 20 |
+
> 🌍 **[Environment Space](https://huggingface.co/spaces/rishabh16196/prompt_golf_env)**
|
| 21 |
+
> 🎛️ **[Live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo)**
|
| 22 |
+
> 📊 **[Trackio dashboard](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio)**
|
| 23 |
+
> 🤗 **[Hero adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)**
|
| 24 |
+
> 🐙 **[GitHub mirror](https://github.com/rishabh16196/prompt_golf_env)**
|
| 25 |
+
|
| 26 |
+
<!-- IMAGE PLACEHOLDER 1 — Hero
|
| 27 |
+
A side-by-side panel:
|
| 28 |
+
LEFT : 737-token MSN ad-creative policy (truncated with "...")
|
| 29 |
+
RIGHT : 20-token trained-agent compression
|
| 30 |
+
Bottom badge: "37× compression"
|
| 31 |
+
This is the strongest single visual you have. Lead with it. -->
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
|
| 35 |
+
## TL;DR
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
| | |
|
| 38 |
+
|---|---|
|
| 39 |
+
| **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
|
| 40 |
+
| **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
|
| 41 |
+
| **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
|
| 42 |
+
| **The result** | 35-token prompts → 80% of verbose accuracy. Wins on 55% of tasks. 37× peak compression on long-context policy tasks. |
|
| 43 |
+
| **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
+
## 1. The capability gap: prompts as folklore
|
| 48 |
+
|
| 49 |
+
Modern LLMs are trained to **follow** prompts. They are not trained to **write** them. But every serious deployment ships a prompt-engineering pipeline anyway:
|
| 50 |
|
| 51 |
+
- **Ad tech:** a 700-token policy describing what creatives can serve, prepended to every classification call.
|
| 52 |
+
- **Content moderation:** multi-page community guidelines stuffed into the system prompt of every Llama instance scoring user posts.
|
| 53 |
+
- **Customer support:** a 1500-token persona document that turns every reply into "Hi, this is Bot™ — I'm here to help! 🌟".
|
| 54 |
+
- **Compliance:** FINRA-style review rules that a model has to internalize to flag broker communications correctly.
|
| 55 |
|
| 56 |
+
These prompts get written, version-controlled, A/B tested — by humans, with intuition. They are also the single largest line item in inference cost. **A 700-token policy on 10M daily requests is 7 billion tokens of prefill compute per day** — and we strongly suspect most of those tokens are decorative, not load-bearing.
|
| 57 |
|
| 58 |
+
There's a deeper research problem hiding underneath the cost. We have **no clean way to distinguish "the model can't do X" from "we haven't found the right prompt."** Modern benchmarks conflate the two. The gap between a *minimum* and a *verbose* prompt that elicit the same behavior is empirical evidence about what's stored in weights vs. what must be supplied via context — but no reusable RL environment exists to study this.
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
There are pieces in the literature that gesture at this. **AutoPrompt** ([Shin et al., 2020](https://arxiv.org/abs/2010.15980)) and **GCG** ([Zou et al., 2023](https://arxiv.org/abs/2307.15043)) search for short prompts but produce gibberish that doesn't generalize. **RLPrompt** ([Deng et al., 2022](https://arxiv.org/abs/2205.12548)) and **PCRL** ([Jung & Kim, 2024](https://arxiv.org/abs/2308.08758)) use RL with length penalties as one-off papers, not reusable environments. **Red-Teaming-with-LMs** ([Perez et al., 2022](https://arxiv.org/abs/2202.03286)) trains an LLM to elicit behaviors from a frozen LLM — exactly our setup — but oriented at safety rather than capability.
|
| 61 |
|
| 62 |
+
Prompt Golf is the missing piece: an open, reusable OpenEnv RL environment for cross-model prompt-writing, with the same algorithmic core but a research framing oriented at capability elicitation, prompt distillation, and behavioral modeling.
|
| 63 |
|
| 64 |
+
The conceptual ancestor we lean on hardest is Rabinowitz et al.'s **[Machine Theory of Mind](https://arxiv.org/abs/1802.07740)** — meta-learn a model of another agent from interaction. That's exactly what the Qwen agent ends up doing. It never sees Llama's gradients. It only sees Llama's outputs. From those, it builds a probabilistic behavioral model of Llama's response surface, encoded in the prompts it learns to write.
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
## 2. The environment
|
| 69 |
+
|
| 70 |
+
Each episode is one task. The agent sees the task, writes a prompt, gets scored. That's the whole loop.
|
| 71 |
|
| 72 |
+
```
|
| 73 |
+
┌─────────────────────────┐
|
| 74 |
+
│ GolfObservation │
|
| 75 |
+
reset() ─────────► task_description │
|
| 76 |
+
│ 3 visible train ex. │
|
| 77 |
+
│ token budget │
|
| 78 |
+
│ baseline_zero_shot │
|
| 79 |
+
└────────────┬────────────┘
|
| 80 |
+
│
|
| 81 |
+
▼
|
| 82 |
+
Agent writes a prompt string
|
| 83 |
+
│
|
| 84 |
+
▼
|
| 85 |
+
Prepend prompt to 6 hidden test inputs
|
| 86 |
+
│
|
| 87 |
+
▼
|
| 88 |
+
Frozen Llama-3.2-3B runs each
|
| 89 |
+
│
|
| 90 |
+
▼
|
| 91 |
+
Task-specific scorer in [0, 1] per input
|
| 92 |
+
│
|
| 93 |
+
▼
|
| 94 |
+
reward composition (additive)
|
| 95 |
+
│
|
| 96 |
+
▼
|
| 97 |
+
GolfStepResult to agent
|
| 98 |
+
```
|
| 99 |
|
| 100 |
+
### Reward
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
```
|
| 103 |
+
reward = raw_task_score
|
| 104 |
+
− 0.5 · baseline_zero_shot ← don't reward what the target already does
|
| 105 |
+
− 0.002 · submitted_tokens ← the golf score
|
| 106 |
+
− leakage_overlap² ← anti-cheat: caught pasting test inputs
|
| 107 |
+
− short_penalty (if tokens < 5) ← anti-collapse to 1-token prompts
|
| 108 |
|
| 109 |
+
clipped to [-0.5, 1.3]
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
Three things about this composition matter.
|
| 113 |
+
|
| 114 |
+
**Additive, not multiplicative.** Earlier versions used `length_factor × leakage_factor × raw_score`, which gave brittle gradients (the multiplicative form has dead zones where one factor is small and gradients vanish). The additive form is smoother and what training actually converges on.
|
| 115 |
+
|
| 116 |
+
**Baseline subtraction is load-bearing.** Without it, the agent gets credit for tasks the target already does well at zero-shot — which means it's rewarded for nothing. With it, the reward signal isolates *additional capability elicited by the prompt*, which is what we actually care about.
|
| 117 |
+
|
| 118 |
+
**Anti-collapse floor.** Without `MIN_TOKENS_FLOOR=5`, GRPO inevitably converges on degenerate 1-2 token prompts that exploit specific tokenization artifacts. These aren't prompts in any meaningful sense — they're attacks on the target's tokenizer. The floor penalty turns the search away from those local optima.
|
| 119 |
+
|
| 120 |
+
### Anti-leakage
|
| 121 |
+
|
| 122 |
+
The 6 held-out test inputs are **never shown to the agent**. A trigram-overlap detector zeros the reward if the agent tries to paste held-out inputs into its prompt. Multi-turn mode (when `turn_limit > 1`) splits the test pool into a 2-example *feedback* slice (revealed across turns with the target's outputs) and a 4-example *scoring* slice (only the final-turn prompt is judged) — so the agent can debug across turns without leaking the inputs that ultimately judge it.
|
| 123 |
+
|
| 124 |
+
### Scorers
|
| 125 |
+
|
| 126 |
+
Each task picks one of 21 scorers, grouped into 7 families:
|
| 127 |
+
|
| 128 |
+
| Family | Scorers | What they check |
|
| 129 |
+
|---|---|---|
|
| 130 |
+
| **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; required substrings; case-strict rewrites |
|
| 131 |
+
| **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
|
| 132 |
+
| **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Required keys/values; key ordering; nesting depth |
|
| 133 |
+
| **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
|
| 134 |
+
| **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
|
| 135 |
+
| **Safety** | `refusal_score` | Whether the output is a refusal (matches expected refuse/comply label) |
|
| 136 |
+
| **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks; deterministic decoding |
|
| 137 |
+
|
| 138 |
+
The scorer is **fixed per task and never seen by the agent** — it has to infer from train examples + task description what gets graded. *Verifiable beats judgeable* is the design principle: every task we can grade with a regex, we do; LLM judges only kick in for genuinely free-form behaviors like persona consistency.
|
| 139 |
|
| 140 |
---
|
| 141 |
|
| 142 |
+
## 3. Why cross-family is the right setup
|
| 143 |
+
|
| 144 |
+
If the agent and target are the same model family, you're really doing self-distillation: the agent has perfect access to its own response surface. We ship that as a control (`prompt-golf-grpo-1.5b`, Qwen→Qwen).
|
| 145 |
|
| 146 |
+
When agent and target are different families, the agent has to **build an empirical model of the target's behavior from outputs alone**. Concretely, it learns:
|
| 147 |
|
| 148 |
+
- Which words Llama needs to constrain its output format (`Output the label only, no punctuation.`)
|
| 149 |
+
- Which words Llama can drop without consequence (`Please carefully consider…`)
|
| 150 |
+
- Which compressions break Llama even when they look semantically equivalent
|
| 151 |
+
- That Llama-3.2 needs explicit label vocabularies on classification but *doesn't* need them on JSON extraction
|
| 152 |
|
| 153 |
+
This is **operationalized behavioral theory-of-mind** — the agent's policy implicitly encodes a probabilistic model of another model's response surface.
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
Cross-family also turned out to be the *easier* setup empirically, for an unexpected reason. Llama-3.2-3B is significantly more cooperative on strict-format tasks than Qwen3-1.7B: **67/87 tasks have non-zero verbose-prompt accuracy on Llama**, vs only 19/87 on Qwen. That changes what training can attempt at all — cross-family Qwen→Llama gives the agent more "real" tasks with reward variance to learn from.
|
| 156 |
|
| 157 |
---
|
| 158 |
|
| 159 |
+
## 4. The 90-task bank
|
| 160 |
+
|
| 161 |
+
Task quality is the single biggest determinant of whether this kind of environment is interesting or boring. A great training loop on bad tasks teaches the wrong thing. We curated each task against three filters:
|
| 162 |
|
| 163 |
+
1. **Empty-prompt baseline must fail.** No free lunch. We ran every task with an empty prompt and dropped the ones where the target succeeded anyway.
|
| 164 |
+
2. **Verbose prompt must succeed.** A capability ceiling has to exist for there to be room to compress. Run `bash training/hf_job_profile.sh` on your fork to do this check yourself.
|
| 165 |
+
3. **Minimum prompt must be non-obvious.** The whole game is closing the gap between (2) and (3).
|
| 166 |
|
| 167 |
| Tier | Count | Examples |
|
| 168 |
|---|---|---|
|
| 169 |
+
| **v1** (`tasks.py`) | 20 | sentiment classification, NER, JSON extraction, translation, refusal |
|
| 170 |
+
| **v2** (`tasks_v2.py`) | 15 | acrostic, no-letter-e, YAML nested depth, pirate persona, terminal session output |
|
| 171 |
+
| **tough** (`tasks_tough.py`) | 52 | logical fallacy ID, FINRA risk classification, Yoda-with-constraint |
|
| 172 |
+
| **policy** (`tasks_policy.py`) | 3 | MSN ad-creative policy (737 tok), content moderation rules (612 tok), FINRA broker-dealer review (550 tok) |
|
| 173 |
|
| 174 |
+
Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60–250).
|
| 175 |
|
| 176 |
+
The **policy tasks are the headline workload**: each has a 500–700-word real-world-style policy as the verbose prompt, and the agent has to compress it into a ≤250-token classifier prompt that still routes inputs to the right `allow / disallow / review` decision. That's where the inference-cost story is most visible — these are prompts that look like real production system prompts.
|
| 177 |
+
|
| 178 |
+
<!-- IMAGE PLACEHOLDER 2 — Task bank diagram
|
| 179 |
+
A 4-panel grid, one panel per tier, showing one example task each:
|
| 180 |
+
v1 : "sentiment_basic" — input review → label
|
| 181 |
+
v2 : "tough_yaml_nested_depth" — input spec → 4-deep YAML
|
| 182 |
+
tough : "tough_fallacy_classify" — input argument → fallacy name
|
| 183 |
+
policy : "policy_msn_ad_creative" — input creative → allow/disallow
|
| 184 |
+
For each: show a 2-line example input → expected output.
|
| 185 |
+
Helps the reader picture what the agent is being graded on. -->
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
+
## 5. Training: GRPO, LoRA, ~3 hours on an L40S
|
| 190 |
|
| 191 |
The recipe:
|
| 192 |
|
| 193 |
+
- **Agent:** Qwen3-1.7B + LoRA (r=16, α=32), trained with TRL GRPO
|
| 194 |
+
- **Target:** `meta-llama/Llama-3.2-3B-Instruct` (frozen)
|
| 195 |
+
- **Judge:** Qwen3-8B in 8-bit via `bitsandbytes` (only for `judge_*` scorers)
|
| 196 |
+
- **GRPO config:** 500 steps, `num_generations=8`, `lr=5e-6`, `β=0.04`, `temperature=0.9`, `max_completion_length=768`
|
| 197 |
+
- **Hardware:** single L40S (48 GB) on HuggingFace Jobs, ~3 hours per run
|
| 198 |
+
- **Anti-collapse guard:** `MIN_TOKENS_FLOOR=5` rubric penalty
|
| 199 |
+
|
| 200 |
+
To reproduce:
|
| 201 |
+
|
| 202 |
+
```bash
|
| 203 |
+
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
A few practical things that mattered along the way:
|
| 207 |
+
|
| 208 |
+
**Pre-flight capability profiling is non-negotiable.** Before committing GPU hours, we ran each task with the verbose hand-written description and recorded `description_baseline` per task. Tasks where the verbose prompt also fails produce zero gradient (no GRPO group variance) and just dilute the budget. Profile first, train second.
|
| 209 |
|
| 210 |
+
**`frac_reward_zero_std` is the diagnostic to watch.** If a GRPO group has zero intra-group reward variance, it contributes no gradient. The "tough" tier gave the most signal because its reward was widely dispersed within each group — that's a feature, not a bug.
|
| 211 |
|
| 212 |
+
**Format anchors emerge before content compression.** Looking at intermediate checkpoints, the agent first learns that certain trigger tokens (`JSON:`, `psql>`, `Yarrr,`) carry enormous behavioral payload. Content-level compression — dropping ceremonial preamble like *"In this task you will…"* — comes later, around step 200+.
|
| 213 |
|
| 214 |
+
### The thinking-mode A/B
|
| 215 |
+
|
| 216 |
+
Qwen3 supports an optional `<think>...</think>` chat template that gives the model free reasoning scratch space before the final output. Hypothesis: free reasoning would let the agent reason about format anchors before emitting the prompt, since the rubric only counts the *extracted* prompt's tokens.
|
| 217 |
+
|
| 218 |
+
We A/B'd identical training setups, thinking ON vs OFF:
|
| 219 |
+
|
| 220 |
+
| | thinking=OFF (hero) | thinking=ON |
|
| 221 |
+
|---|---|---|
|
| 222 |
+
| Trained accuracy | 0.523 | **0.539** |
|
| 223 |
+
| Trained reward | **+0.426** | +0.379 |
|
| 224 |
+
| Mean tokens | **35** | 46 |
|
| 225 |
+
|
| 226 |
+
OFF wins on reward and compression by a clear margin. ON wins on accuracy by 1.6 percentage points at a 30% token cost. **The implicit credit assignment between `<think>` tokens and the final prompt is too weak for GRPO to exploit at this scale** — the gradient just doesn't flow cleanly across the thinking block. We ship OFF as the hero adapter and ON as a different operating point on the accuracy/length frontier.
|
| 227 |
|
| 228 |
---
|
| 229 |
|
| 230 |
+
## 6. Results
|
| 231 |
|
| 232 |
+
### Headline numbers (Qwen → Llama, 90-task average)
|
| 233 |
|
| 234 |
| Stage | Mean accuracy | Mean tokens |
|
| 235 |
|---|---|---|
|
| 236 |
| Verbose human-written prompt | **0.65** | ~63 |
|
| 237 |
+
| Untrained Qwen3-1.7B agent | 0.48 | ~38 |
|
| 238 |
| **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
|
| 239 |
|
| 240 |
+
→ **80% accuracy retention at 55% of the verbose token count**, scored on a frozen Llama target the agent never had gradient access to.
|
| 241 |
+
|
| 242 |
+
The trained agent **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric. On the remaining 39 tasks, the accuracy drop on hard tasks outweighs the length savings — those are cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling. **On those tasks the verbose prompt's extra tokens are doing real cognitive work**, not just adding decoration.
|
| 243 |
+
|
| 244 |
+
### The training curves
|
| 245 |
+
|
| 246 |
+

|
| 247 |
+
|
| 248 |
+
*Mean reward per step, climbing from ~0 to +0.43 over 500 steps. The plateau around step 350 is where length compression saturates against accuracy preservation.*
|
| 249 |
+
|
| 250 |
+

|
| 251 |
+
|
| 252 |
+
*Mean prompt tokens per step. The agent finds the compression frontier within the first ~150 steps and then refines it. The trajectory is monotonic but uneven — bigger compression jumps happen on tasks where the agent discovers a new format anchor.*
|
| 253 |
|
| 254 |
+

|
| 255 |
|
| 256 |
+
*Decomposition of the additive reward. The length penalty stays small because the agent quickly stops paying it; the gain comes from raw task score climbing while baseline-subtracted reward stays positive.*
|
| 257 |
+
|
| 258 |
+
For step-by-step exploration, the **[Trackio dashboard](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio)** has the full per-step metrics replayed from `train_metrics.jsonl`.
|
| 259 |
+
|
| 260 |
+
### Per-task highlights
|
| 261 |
+
|
| 262 |
+
The full row-by-row demo CSV is at **[`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)** — every task with verbose / untrained / trained prompts side by side, plus accuracy and reward deltas. A few representative rows:
|
| 263 |
+
|
| 264 |
+
| Task | Verbose | Trained | Notes |
|
| 265 |
|---|---|---|---|
|
| 266 |
+
| `sentiment_basic` | 27 tok / **0.83** | **18 tok** / **1.00** | Shorter AND more accurate |
|
| 267 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 268 |
+
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
|
| 269 |
+
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
|
| 270 |
+
| `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression — both fail because Llama-3.2-3B can't reason over the policy hierarchy, but the compression doesn't *cost* anything |
|
| 271 |
|
| 272 |
+
The `policy_msn_ad_creative` row is the most interesting. Both prompts get 0 accuracy, so it looks like a draw — but the verbose prompt was charging 737 tokens of prefill on every request to deliver that 0. The trained agent does it for 20. **Pair the compressed prompt with a stronger target and you'd ship the same behavior at 37× lower input-token cost.**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
|
| 274 |
### What the agent actually wrote
|
| 275 |
|
| 276 |
+
For sentiment classification:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
|
| 278 |
+
> *Verbose human prompt:* "For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation." (27 tokens)
|
| 279 |
+
>
|
| 280 |
+
> *Trained agent:* "Classify the input review as positive, negative, or neutral. Output only the label." (18 tokens, **1.00 accuracy**)
|
| 281 |
|
| 282 |
For YAML extraction with strict nesting:
|
| 283 |
|
| 284 |
+
> *Verbose:* 74 tokens describing depth requirements, entity coverage, format constraints, output instructions.
|
| 285 |
>
|
| 286 |
+
> *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
|
| 287 |
|
| 288 |
+
For policy compliance — the long-context money case:
|
| 289 |
|
| 290 |
+
> *Verbose:* 737 tokens of MSN ad-creative policy listing prohibited content, restricted categories, format standards, decision schemas.
|
| 291 |
>
|
| 292 |
+
> *Trained agent:* "Classify the input creative as allow, disallow, or review based on the given policy guidelines." (20 tokens)
|
| 293 |
|
| 294 |
+
### The same-family control: Qwen → Qwen
|
| 295 |
+
|
| 296 |
+
| | Qwen→Qwen (control) | Qwen→Llama (hero) |
|
| 297 |
+
|---|---|---|
|
| 298 |
+
| Trained beats verbose on | **70/87 tasks (80%)** | 48/87 (55%) |
|
| 299 |
+
| Mean reward advantage vs verbose | **+0.085** | -0.057 |
|
| 300 |
+
| Verbose accuracy ceiling | 0.15 | 0.65 |
|
| 301 |
+
|
| 302 |
+
This looks like a "win" for Qwen→Qwen on win-rate, but the framing is misleading. Qwen3-1.7B as a target only achieves 0.15 average accuracy with verbose prompts — the bar to beat is on the floor. **Cross-family Llama is a much harder bar to clear (0.65 verbose ceiling), but the absolute accuracy delivered by the trained agent is dramatically higher.** This is the kind of nuance you only see when you actually run the comparison.
|
| 303 |
+
|
| 304 |
+

|
| 305 |
|
| 306 |
+
*Same training recipe with Qwen3-1.7B as both agent and target. The curve looks great because the verbose baseline is weak — but absolute accuracy is much lower than the cross-family run.*
|
| 307 |
|
| 308 |
+
### The thinking-ON variant
|
| 309 |
|
| 310 |
+

|
| 311 |
|
| 312 |
+
*Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower — the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
|
| 313 |
|
| 314 |
+
All three trained adapters are public, with their own demo CSVs:
|
| 315 |
|
| 316 |
+
- 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
|
| 317 |
+
- 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 318 |
+
- 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
|
| 319 |
|
| 320 |
---
|
| 321 |
|
| 322 |
+
## 7. What the agent learned
|
| 323 |
+
|
| 324 |
+
Some qualitative observations from inspecting trained-agent outputs:
|
| 325 |
+
|
| 326 |
+
**Format cues are tokens, not sentences.** `JSON:` does the job that `Output your response as a JSON object with the following structure` does — at 50× fewer tokens.
|
| 327 |
+
|
| 328 |
+
**Persona triggers are surprisingly small.** `Yarrr,` for pirate. `psql>` for SQL. `Once upon a time,` for fairy-tale. These tokens carry enormous behavioral payload because they're strong prefix-match anchors in the target's training distribution. The trained agent finds them; humans writing prompts almost never do.
|
| 329 |
+
|
| 330 |
+
**Add the label vocabulary the human forgot.** On classification tasks where the verbose prompt described the task but didn't list the labels, the trained agent learned to *insert the label set* even though that increases length. The reward signal pushes toward explicit label vocabularies because the target needs them. The human prompt was too polite to spell them out; the agent has no such instinct.
|
| 331 |
+
|
| 332 |
+
**Show, don't tell.** For tasks like "respond in exactly 4 numbered steps," the agent learned to *demonstrate* the structure (`1.\n2.\n3.\n4.\nAnswer:`) rather than describe it. This is the kind of insight a generic action space enables that operator-based golfers (with `INSERT`, `DELETE`, `REPLACE` actions) would miss.
|
| 333 |
+
|
| 334 |
+
**Drop ceremonial preamble.** *"In this task you will…"* / *"Please carefully consider…"* / *"Your goal is to…"* — gone, every time, with no measurable accuracy cost. The first chunk of most human-written prompts is almost pure decoration.
|
| 335 |
+
|
| 336 |
+
We also saw failure modes worth flagging:
|
| 337 |
+
|
| 338 |
+
- **Mild gibberish convergence on a few adversarial tasks.** A handful of refusal-related tasks pushed the agent toward GCG-style ungrammatical prompts. The leakage penalty caught the worst cases.
|
| 339 |
+
- **Over-compression on tasks where verbose is doing real work.** The 39 losing tasks share a consistent failure mode: the agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling. On these tasks the verbose prompt's extra tokens are doing real cognitive work, not just adding decoration.
|
| 340 |
+
- **The 1-token attractor.** Without `MIN_TOKENS_FLOOR`, RL inevitably found 1-2 token prompts exploiting tokenization artifacts. These weren't prompts in any meaningful sense — they were attacks. The floor penalty is non-optional.
|
| 341 |
|
|
|
|
|
|
|
|
|
|
| 342 |
---
|
| 343 |
|
| 344 |
+
## 8. Try it yourself
|
| 345 |
+
|
| 346 |
+
There's a **[live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo)** where you pick a task, see the verbose human prompt and the trained agent's compressed prompt side by side, and run either against the same Llama-3.2-3B target on real test inputs. Same UI shows accuracy for both.
|
| 347 |
+
|
| 348 |
+
<!-- IMAGE PLACEHOLDER 3 — Demo screenshot
|
| 349 |
+
Clean shot of the Gradio demo UI:
|
| 350 |
+
- Task selector populated with one of the policy or tough tasks
|
| 351 |
+
- Verbose vs trained prompt panels side by side
|
| 352 |
+
- "Run on Llama-3.2-3B" button
|
| 353 |
+
- Per-input accuracy badges visible
|
| 354 |
+
This is what readers click through to. -->
|
| 355 |
+
|
| 356 |
+
### Run the env locally
|
| 357 |
|
|
|
|
| 358 |
```bash
|
| 359 |
git clone https://huggingface.co/spaces/rishabh16196/prompt_golf_env
|
| 360 |
cd prompt_golf_env
|
| 361 |
pip install -e . gradio transformers torch
|
| 362 |
|
| 363 |
+
# CPU smoke test (mock target, no GPU needed)
|
| 364 |
+
PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000
|
| 365 |
|
| 366 |
+
# Real run with the actual Llama target
|
| 367 |
+
PROMPT_GOLF_TARGET_BACKEND=hf \
|
| 368 |
+
PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
| 369 |
+
uvicorn server.app:app --port 8000
|
| 370 |
```
|
| 371 |
|
| 372 |
+
### Use the trained adapter
|
| 373 |
+
|
| 374 |
+
```python
|
| 375 |
+
from peft import PeftModel
|
| 376 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 377 |
+
|
| 378 |
+
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
|
| 379 |
+
agent = PeftModel.from_pretrained(
|
| 380 |
+
base, "rishabh16196/prompt-golf-qwen-to-llama-nothink"
|
| 381 |
+
)
|
| 382 |
+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
|
| 383 |
+
|
| 384 |
+
# Give it a verbose prompt-golf task description; get back a compressed prompt
|
| 385 |
+
```
|
| 386 |
+
|
| 387 |
+
### Reproduce the hero training run
|
| 388 |
+
|
| 389 |
```bash
|
| 390 |
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
|
| 391 |
+
# ~3h on L40S, pushes adapter + plots + train_metrics + eval JSONLs to your repo
|
| 392 |
```
|
| 393 |
|
| 394 |
+
### Hit the env from Python
|
| 395 |
+
|
| 396 |
+
```python
|
| 397 |
+
from prompt_golf_env import GolfAction, PromptGolfEnv
|
| 398 |
+
|
| 399 |
+
async with PromptGolfEnv(base_url="http://localhost:8000") as env:
|
| 400 |
+
result = await env.reset(task="sentiment_basic")
|
| 401 |
+
obs = result.observation
|
| 402 |
+
result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
|
| 403 |
+
print(f"reward={result.reward:.2f} | tokens={result.observation.submitted_prompt_tokens}")
|
| 404 |
+
```
|
| 405 |
+
|
| 406 |
+
---
|
| 407 |
+
|
| 408 |
+
## 9. Why this matters
|
| 409 |
+
|
| 410 |
+
| If you work on… | Prompt Golf gives you… |
|
| 411 |
+
|---|---|
|
| 412 |
+
| **Inference cost in production** | A trained policy that compresses verbose prompts behaviorally — no gradient access to the target needed. Up to 37× compression on real-world policy prompts. |
|
| 413 |
+
| **Capability evaluation** | A black-box minimum-elicitation metric per task per target. Decouples *can the model do X* from *did we find the right prompt*. |
|
| 414 |
+
| **Prompt distillation across targets** | Cross-family training generates a model of the target's response surface. Swap targets, retrain, ship a custom prompt-compressor for your specific deployment. |
|
| 415 |
+
| **Capability elicitation research** | A black-box analog of password-locked-model elicitation ([Greenblatt et al., 2024](https://arxiv.org/abs/2405.19550)). What's the minimum input that surfaces a latent capability? |
|
| 416 |
+
| **Red-teaming / robustness** | Same machinery, different rubric. Adversarial scoring → red-teaming. Refusal rubric → jailbreak hardening. |
|
| 417 |
+
| **LLM ↔ LLM behavioral modeling** | Machine Theory of Mind ([Rabinowitz et al., 2018](https://arxiv.org/abs/1802.07740)) for LLMs as targets. The agent's policy implicitly encodes a model of the target. |
|
| 418 |
+
|
| 419 |
+
### What this is — and isn't
|
| 420 |
+
|
| 421 |
+
- ✅ **Is** the first open OpenEnv RL environment where the agent learns to write prompts for another LLM.
|
| 422 |
+
- ✅ **Is** a calibrated middle: GCG/RLPrompt-style mechanics, Machine ToM-style framing, and reusable infrastructure.
|
| 423 |
+
- ❌ **Isn't** a generative simulator of LLM behavior — we never touch activations.
|
| 424 |
+
- ❌ **Isn't** a new prompt-optimization algorithm. The algorithmic core is RL+length; the contribution is the framing + reusable env + cross-family experiments.
|
| 425 |
+
- ❌ **Isn't** a claim that we've "solved world modeling for LLMs." Episodes are short; the analogy to Dreamer/JEPA/Genie is structural, not algorithmic.
|
| 426 |
|
| 427 |
---
|
| 428 |
|
| 429 |
+
## 10. What's next
|
| 430 |
|
| 431 |
Directions we'd love community help on:
|
| 432 |
|
| 433 |
+
1. **More targets.** We have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's policy portable, or is it Llama-specific? This is the cross-target transfer experiment that would substantiate the Machine ToM framing.
|
| 434 |
+
2. **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (random format constraints, synthetic policies) would scale this to thousands of holes.
|
| 435 |
+
3. **Different reward shapes.** The current additive reward is one choice. KL-as-reward (output-distribution matching the verbose prompt's) is another. Each captures a different definition of "good."
|
| 436 |
+
4. **Real-world deployment study.** Pick an actual production prompt (with permission), train a compressor for it, measure compression-vs-accuracy in shadow traffic. We'd love to hear what breaks and what holds up.
|
| 437 |
+
|
| 438 |
+
If you have a 1000-token prompt that's eating your inference budget, train a compressor for it. **That's the whole point.**
|
| 439 |
|
| 440 |
---
|
| 441 |
|
| 442 |
+
## Acknowledgments & citations
|
| 443 |
+
|
| 444 |
+
This work draws on four converging research lines:
|
| 445 |
+
|
| 446 |
+
- **[Machine Theory of Mind](https://arxiv.org/abs/1802.07740)** (Rabinowitz et al., 2018) — the conceptual ancestor.
|
| 447 |
+
- **[Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)** (Perez et al., 2022) — the direct algorithmic ancestor.
|
| 448 |
+
- **[Stress-Testing Capability Elicitation With Password-Locked Models](https://arxiv.org/abs/2405.19550)** (Greenblatt et al., 2024) — the motivation for treating minimum elicitation as a meaningful capability metric.
|
| 449 |
+
- **[AutoPrompt](https://arxiv.org/abs/2010.15980)**, **[GCG](https://arxiv.org/abs/2307.15043)**, **[RLPrompt](https://arxiv.org/abs/2205.12548)**, **[PCRL](https://arxiv.org/abs/2308.08758)** — the algorithmic toolkit.
|
| 450 |
+
|
| 451 |
+
Built for the [OpenEnv Hackathon](https://pytorch.org/event/openenv-ai-hackathon/) (Meta + Hugging Face + PyTorch, India 2026), using TRL GRPO, HuggingFace Jobs, and the OpenEnv spec.
|
| 452 |
+
|
| 453 |
+
```bibtex
|
| 454 |
+
@misc{promptgolf2026,
|
| 455 |
+
author = {Rishabh},
|
| 456 |
+
title = {Prompt Golf: An OpenEnv RL environment for cross-model prompt compression},
|
| 457 |
+
year = {2026},
|
| 458 |
+
howpublished = {\url{https://huggingface.co/spaces/rishabh16196/prompt_golf_env}}
|
| 459 |
+
}
|
| 460 |
+
```
|
| 461 |
+
|
| 462 |
+
⛳
|
| 463 |
|
| 464 |
+
<!-- IMAGE PLACEHOLDER 4 — Closing graphic (optional)
|
| 465 |
+
Either:
|
| 466 |
+
(a) a wordcloud of the actual short prompts the agent
|
| 467 |
+
discovered, sized by frequency across the bank, or
|
| 468 |
+
(b) a clean trophy/golf-ball graphic with the headline
|
| 469 |
+
number "37×" in the foreground.
|
| 470 |
+
Optional. The post lands fine without it. -->
|