Don Rishabh commited on
Commit
4ea12d8
·
1 Parent(s): ea78734

BLOG_POST: full rewrite — research framing, 10 sections, citations, image placeholders

Browse files
Files changed (1) hide show
  1. BLOG_POST.md +355 -108
BLOG_POST.md CHANGED
@@ -1,223 +1,470 @@
1
  ---
2
- title: "Prompt Golf: Teaching one LLM to write shorter, sharper prompts for another"
3
  thumbnail: /blog/assets/prompt_golf/thumbnail.png
4
  authors:
5
- - user: rishabh16196
6
  ---
7
 
8
  # Prompt Golf
9
 
10
- > *Same accuracy as the human-written prompt, ~40% fewer tokens, learned by an RL agent that never saw the target's weights.*
11
 
12
- ## TL;DR
 
 
 
 
13
 
14
- We built **Prompt Golf**, an OpenEnv environment where an LLM agent's *action* is a prompt and the *reward* is how well that prompt steers a frozen target LLM to do the right thing divided by how long the prompt is.
15
 
16
- We trained a Qwen3-1.7B **agent** to write prompts for a frozen Llama-3.2-3B **target** using TRL GRPO. The result:
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- - **Verbose human-written prompts** (200-700 tokens): 65% accuracy on a 90-task bank
19
- - **Trained agent's prompts** (~35 tokens): **52% accuracy at ~half the tokens**
20
- - **80% accuracy retention at 60% compression**, with peak compressions of **30× on long-context policy tasks**
21
- - **Cross-family transfer**: the Qwen agent never saw Llama gradients, only its outputs. It still learned format anchors that work for Llama specifically.
22
 
23
- Everything is open: [the env](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), [the trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink), [the demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv), [the training pipeline](https://github.com/rishabh16196/prompt_golf_env/tree/main/training), and a [live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo).
 
 
 
 
 
 
24
 
25
  ---
26
 
27
- ## The Problem
 
 
28
 
29
- Modern LLMs are trained to **follow** prompts. They are not trained to **write** them.
 
 
 
30
 
31
- But every serious LLM deployment ends up with a prompt-engineering pipeline anyway:
32
 
33
- - **Ad tech**: a 700-token policy describing what creatives can serve, prepended to every classification call.
34
- - **Content moderation**: multi-page community guidelines stuffed into the system prompt of every Llama instance scoring user posts.
35
- - **Customer support**: a 1500-token persona document that turns every reply into "Hi, this is Bot™ — I'm here to help! 🌟".
36
- - **Compliance**: FINRA-style review rules that a model has to internalize to flag broker communications.
37
 
38
- These prompts get shipped, version-controlled, optimized by humans, with intuition. They are also the single largest line item in inference cost. **Every API call pays for the full prompt every time.**
39
 
40
- There is no standard benchmark for "**can a model learn to write prompts that elicit desired behavior from another model?**" The capabilities this tests cut across red-teaming, system-prompt distillation, jailbreak hardening, behavioral probing, and prompt compression — all on frontier labs' roadmaps, none with a clean RL environment.
41
 
42
- Prompt Golf is the missing environment.
43
 
44
  ---
45
 
46
- ## How it works
 
 
47
 
48
- One episode = one task = one prompt:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- 1. The env hands the agent a **task description** (verbose, hand-written), 3 visible **train examples**, and a **token budget**.
51
- 2. The agent's action is a **prompt string** (typically wrapped in `<prompt>...</prompt>`).
52
- 3. The env prepends that prompt to ~6 *hidden* test inputs, runs the **frozen target LLM** on each, and scores the outputs with a task-specific scorer.
53
- 4. Reward = `raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped.
54
 
55
- The held-out test inputs are **never shown to the agent**. An n-gram leakage detector scales reward toward zero if the agent tries to paste answers into its prompt. Multi-turn mode (turn_limit > 1) splits the test pool into a small *feedback* slice (revealed across turns) and a held-out *scoring* slice (only the final-turn prompt is judged).
 
 
 
 
 
56
 
57
- The agent and the target live in the same process. We picked **Qwen3-1.7B as the agent** (trainable, LoRA fine-tuned) and **Llama-3.2-3B-Instruct as the target** (frozen). The judge for fuzzy scorers is **Qwen3-8B** in 8-bit. Cross-family pairings are deliberate.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ---
60
 
61
- ## Why cross-family is the interesting setup
 
 
62
 
63
- If the agent and target are the same model, you're really doing self-distillation: the agent has perfect access to its own response surface.
64
 
65
- When they're different families, the agent has to **build an empirical model of the target's behavior from outputs alone**. It learns:
 
 
 
66
 
67
- - Which words Llama needs to constrain its output format ("Output the label only, no punctuation.")
68
- - Which words it can drop ("Please carefully consider…")
69
- - Which compressions break Llama's output even though they look semantically equivalent
70
- - That Llama-3.2 needs explicit label vocabularies on classification but Llama-3.2 *doesn't* need them on JSON extraction
71
 
72
- This is **operationalized behavioral theory-of-mind** the agent learns a probabilistic model of another model's response surface, encoded in the prompts it writes.
73
 
74
  ---
75
 
76
- ## The 90-task bank
 
 
77
 
78
- We hand-crafted 90 tasks across 18 categories spanning four difficulty tiers:
 
 
79
 
80
  | Tier | Count | Examples |
81
  |---|---|---|
82
- | **v1** (easy/medium) | 20 | sentiment classification, NER, JSON extraction, translation, refusal |
83
- | **v2** (hard) | 15 | acrostic, no-letter-e, YAML nested depth, pirate persona, terminal session output |
84
- | **tough** (hand-crafted hard) | 52 | logical fallacy ID, FINRA risk classification, Yoda-style with constraint, etc. |
85
- | **policy** (long-context compression) | 3 | MSN ad creative policy (737 tok), content moderation rules (612 tok), FINRA broker-dealer review (550 tok) |
86
 
87
- Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60-250). Scorers are a mix of structural (`exact_label`, `valid_yaml_depth`, `json_contains_fields`) and LLM-judge (`judge_criteria` against Qwen3-8B 8-bit).
88
 
89
- The **policy tasks are the headline workload**: each has a 500-700-word real-world-style policy as the verbose prompt, and the agent has to compress it into a 250-token classifier prompt that still routes inputs to the right `allow / disallow / review` decision.
 
 
 
 
 
 
 
 
 
90
 
91
  ---
92
 
93
- ## What we trained
94
 
95
  The recipe:
96
 
97
- - **Agent**: Qwen3-1.7B + LoRA (r=16, α=32), trained with TRL GRPO
98
- - **Target**: meta-llama/Llama-3.2-3B-Instruct (frozen)
99
- - **Judge**: Qwen3-8B (8-bit via `bitsandbytes`) for fuzzy scorers
100
- - **GRPO**: 500 steps, num_generations=8, lr=5e-6, β=0.04, temperature=0.9
101
- - **Hardware**: single L40S (48 GB) on HuggingFace Jobs, ~3 hours per run
102
- - **Seeds**: 4 per task → ~360 dataset rows
103
- - **Anti-collapse guard**: `MIN_TOKENS_FLOOR=5` rubric penalty against degenerate 1-token policies
 
 
 
 
 
 
 
 
 
104
 
105
- Training cost: about $5-7 of GPU time per run on L40S.
106
 
107
- We also tried Qwen3's **thinking mode** (`<think>...</think>` reasoning before the final prompt). The hypothesis was that free reasoning scratch space would let the agent reason about format anchors before emitting the prompt, since the rubric only counts the *extracted* prompt's tokens. We A/B tested it on the same 90-task bank.
108
 
109
- The verdict: thinking mode **did not help**. Thinking-OFF beat thinking-ON by **+0.05 reward at -23% token count** at end of training. The implicit credit-assignment between `<think>` tokens and the final prompt is too weak for GRPO to exploit at this scale. We use the thinking-OFF adapter for everything else.
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  ---
112
 
113
- ## Results
114
 
115
- ### Headline numbers (90-task average)
116
 
117
  | Stage | Mean accuracy | Mean tokens |
118
  |---|---|---|
119
  | Verbose human-written prompt | **0.65** | ~63 |
120
- | Untrained Qwen3-1.7B agent | 0.48 | 38 |
121
  | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
122
 
123
- → **80% accuracy retention** at **55% of the verbose token count**, scored on a frozen Llama-3.2-3B target the agent never saw the gradients of.
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
- ### Where it shines: per-task highlights
126
 
127
- | Task | Verbose | Trained | Win |
 
 
 
 
 
 
 
 
128
  |---|---|---|---|
129
- | `sentiment_basic` | 27 tok / **0.83** | **18 tok** / **1.00** | shorter AND more accurate |
130
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
131
- | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
132
- | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
 
133
 
134
- The trained adapter learned **a nuanced strategy**, not just "shorter":
135
- - Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
136
- - Drop ceremonial preamble ("In this task you will…")
137
- - Keep technical anchors that constrain Llama's output format
138
- - Match or beat the verbose accuracy ceiling on tasks where the verbose prompt is already near-optimal
139
 
140
  ### What the agent actually wrote
141
 
142
- For sentiment classification, the verbose hand-written prompt is:
143
-
144
- > *"For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."* (27 tokens)
145
-
146
- The trained agent's compressed version:
147
 
148
- > *"Classify the input review as positive, negative, or neutral. Output only the label."* (18 tokens, 1.00 accuracy)
 
 
149
 
150
  For YAML extraction with strict nesting:
151
 
152
- > Verbose: 74 tokens describing depth requirements, entity coverage, format constraints, output instructions.
153
  >
154
- > Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
155
 
156
- For JSON key ordering:
157
 
158
- > Verbose: 47 tokens describing key-order specification format, default-sorting behavior, output rules.
159
  >
160
- > Trained agent: *"Given the input, output a JSON object with keys in the exact order specified. Ignore default sorting."* (38 tokens — 0.78 accuracy vs 0.61 verbose; **the trained prompt is shorter AND more accurate**)
161
 
162
- ---
 
 
 
 
 
 
 
 
 
 
163
 
164
- ## Why you should care
165
 
166
- **1. Inference cost.** If your production pipeline ships a 700-token policy prompt with every request and you're serving 10M requests/day, that's a 10×-100× cost saving on input tokens. At GPT-4o or Llama-API rates that's real money. At hyperscaler-internal-throughput rates it's even more — input tokens dominate the prefill compute that dominates serving cost.
167
 
168
- **2. Prompt-engineering becomes a learned skill.** Today, prompt engineering is a craft: humans iterate, A/B test, write blog posts. Prompt Golf operationalizes it as RL. You can swap targets (Llama → Mistral → Phi → your fine-tune), run training, get a custom prompt-compressor for your specific deployment.
169
 
170
- **3. Behavioral probing for safety.** The same setup, with adversarial scoring, becomes red-teaming. With a refusal rubric, it becomes jailbreak hardening. The env is the substrate; the rubric is the question.
171
 
172
- **4. Cross-model distillation without weights.** Big-target prompts often have to be served by smaller targets at inference. A trained prompt-compressor knows which clauses the smaller target needs vs. can drop, weight-free.
173
 
174
- **5. It's a real RL environment with real signal.** Most LLM RL benchmarks are toy or saturated. Prompt Golf has 90 tasks across structural, judge-based, and long-context regimes. The env is OpenEnv-compliant; you can hook it to TRL, your own trainer, or Unsloth (with the cross-family setup we've validated).
 
 
175
 
176
  ---
177
 
178
- ## What we learned (research-flavored notes)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
- - **Thinking mode doesn't help GRPO at this scale.** Implicit credit assignment between `<think>` and final tokens is too weak for the agent to exploit. Don't pay for the slowdown unless you have stronger trainer signal.
181
- - **Cross-family is harder, but better.** Same-family (Qwen→Qwen) gives you a self-distillation problem; cross-family forces the agent to learn target-specific quirks. Llama-3.2-3B turned out to be far more cooperative on strict-format tasks than Qwen3-1.7B (67/87 solvable vs 19/87 with verbose prompts), which moved the dial dramatically on what training could even *attempt*.
182
- - **Profile before you train.** Running the target on the verbose description of every task ahead of time tells you the headroom: tasks where `description_baseline ≈ 0` will produce zero gradient (no group variance in GRPO) and just dilute the budget.
183
  ---
184
 
185
- ## Try it yourself
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
- **Run the env locally:**
188
  ```bash
189
  git clone https://huggingface.co/spaces/rishabh16196/prompt_golf_env
190
  cd prompt_golf_env
191
  pip install -e . gradio transformers torch
192
 
193
- # Quick smoke test (mock backend, CPU)
194
- PROMPT_GOLF_TARGET_BACKEND=mock python -m server.tasks_tough
195
 
196
- # Local Gradio demo with the trained adapter
197
- python ui/demo_app.py # opens http://localhost:7860
 
 
198
  ```
199
 
200
- **Reproduce the training run** (HF Jobs):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
  ```bash
202
  PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
203
- # ~3h on L40S, push adapter + plots + metrics to your repo
204
  ```
205
 
206
- **See the actual learned prompts:** the [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) has all 90 tasks × verbose / untrained / trained / accuracy columns, side by side.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
 
208
  ---
209
 
210
- ## What's next
211
 
212
  Directions we'd love community help on:
213
 
214
- - **More targets.** Right now we have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's prompt portable?
215
- - **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (e.g. random regex format constraints) would scale this dramatically.
216
- - **Different reward shapes.** The current additive reward `raw - 0.5·baseline - 0.002·tokens - leak²` is one choice. KL-as-reward (output distribution matching the verbose prompt's) is another. Each captures a different definition of "good".
217
- - **Real-world deployment study.** Pick an actual production prompt (with permission), run prompt golf, measure the compression-vs-accuracy tradeoff in shadow traffic. We'd love to hear what breaks and what holds up.
 
 
218
 
219
  ---
220
 
221
- The env, the trained adapter, the demo CSVs, and the Gradio UI are all open. **If you have a 1000-token prompt that's eating your inference budget, train a compressor for it.** That's the whole point.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
 
223
- Connect: [HuggingFace](https://huggingface.co/rishabh16196) · [GitHub mirror](https://github.com/rishabh16196/prompt_golf_env)
 
 
 
 
 
 
 
1
  ---
2
+ title: 'Prompt Golf: training one LLM to write the shortest prompts that steer another'
3
  thumbnail: /blog/assets/prompt_golf/thumbnail.png
4
  authors:
5
+ - user: rishabh16196
6
  ---
7
 
8
  # Prompt Golf
9
 
10
+ > *Same accuracy as the human-written prompt at ~55% of the tokens learned by an RL agent that never saw the target's weights, only its outputs.*
11
 
12
+ We trained an LLM to be a prompt engineer for *another LLM*.
13
+
14
+ The setup: a Qwen3-1.7B **agent** (LoRA-fine-tuned via TRL GRPO) writes prompts. A frozen Llama-3.2-3B **target** runs them. The reward is task success minus prompt length. After 500 GRPO steps on a 90-task bank, the agent compresses verbose human-written prompts (mean ~63 tokens, up to 737 on long-context policy tasks) into **35-token** prompts that retain **80% of the verbose accuracy** and **beat the human prompt outright on 48 of 87 tasks (55%)**.
15
+
16
+ Peak compression: **37×** on long-context policy tasks — a 737-token MSN ad-creative policy compressed to a 20-token classifier prompt.
17
 
18
+ Everything is open: the OpenEnv environment, three trained adapters, a live Gradio demo where you can play prompts against the same target, a Trackio dashboard with the full training trajectory, and a reproducible HuggingFace Jobs pipeline.
19
 
20
+ > 🌍 **[Environment Space](https://huggingface.co/spaces/rishabh16196/prompt_golf_env)**
21
+ > 🎛️ **[Live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo)**
22
+ > 📊 **[Trackio dashboard](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio)**
23
+ > 🤗 **[Hero adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)**
24
+ > 🐙 **[GitHub mirror](https://github.com/rishabh16196/prompt_golf_env)**
25
+
26
+ <!-- IMAGE PLACEHOLDER 1 — Hero
27
+ A side-by-side panel:
28
+ LEFT : 737-token MSN ad-creative policy (truncated with "...")
29
+ RIGHT : 20-token trained-agent compression
30
+ Bottom badge: "37× compression"
31
+ This is the strongest single visual you have. Lead with it. -->
32
+
33
+ ---
34
 
35
+ ## TL;DR
 
 
 
36
 
37
+ | | |
38
+ |---|---|
39
+ | **The capability we're testing** | Can one LLM learn to write the minimum prompt that elicits a specific behavior from a frozen target LLM? |
40
+ | **The environment** | Single-step RL. Agent writes a prompt → frozen target runs it on 6 hidden test inputs → reward = task_success − 0.5·baseline − 0.002·tokens − leakage². |
41
+ | **The recipe** | Qwen3-1.7B (LoRA, r=16) ⟶ Llama-3.2-3B-Instruct (frozen). 500 GRPO steps on a 90-task bank. ~3h on a single L40S. |
42
+ | **The result** | 35-token prompts → 80% of verbose accuracy. Wins on 55% of tasks. 37× peak compression on long-context policy tasks. |
43
+ | **Why care** | First OpenEnv environment for cross-model prompt-writing as a learnable skill. Plugs straight into red-teaming, prompt distillation, capability elicitation. |
44
 
45
  ---
46
 
47
+ ## 1. The capability gap: prompts as folklore
48
+
49
+ Modern LLMs are trained to **follow** prompts. They are not trained to **write** them. But every serious deployment ships a prompt-engineering pipeline anyway:
50
 
51
+ - **Ad tech:** a 700-token policy describing what creatives can serve, prepended to every classification call.
52
+ - **Content moderation:** multi-page community guidelines stuffed into the system prompt of every Llama instance scoring user posts.
53
+ - **Customer support:** a 1500-token persona document that turns every reply into "Hi, this is Bot™ — I'm here to help! 🌟".
54
+ - **Compliance:** FINRA-style review rules that a model has to internalize to flag broker communications correctly.
55
 
56
+ These prompts get written, version-controlled, A/B tested — by humans, with intuition. They are also the single largest line item in inference cost. **A 700-token policy on 10M daily requests is 7 billion tokens of prefill compute per day** — and we strongly suspect most of those tokens are decorative, not load-bearing.
57
 
58
+ There's a deeper research problem hiding underneath the cost. We have **no clean way to distinguish "the model can't do X" from "we haven't found the right prompt."** Modern benchmarks conflate the two. The gap between a *minimum* and a *verbose* prompt that elicit the same behavior is empirical evidence about what's stored in weights vs. what must be supplied via context — but no reusable RL environment exists to study this.
 
 
 
59
 
60
+ There are pieces in the literature that gesture at this. **AutoPrompt** ([Shin et al., 2020](https://arxiv.org/abs/2010.15980)) and **GCG** ([Zou et al., 2023](https://arxiv.org/abs/2307.15043)) search for short prompts but produce gibberish that doesn't generalize. **RLPrompt** ([Deng et al., 2022](https://arxiv.org/abs/2205.12548)) and **PCRL** ([Jung & Kim, 2024](https://arxiv.org/abs/2308.08758)) use RL with length penalties as one-off papers, not reusable environments. **Red-Teaming-with-LMs** ([Perez et al., 2022](https://arxiv.org/abs/2202.03286)) trains an LLM to elicit behaviors from a frozen LLM — exactly our setup — but oriented at safety rather than capability.
61
 
62
+ Prompt Golf is the missing piece: an open, reusable OpenEnv RL environment for cross-model prompt-writing, with the same algorithmic core but a research framing oriented at capability elicitation, prompt distillation, and behavioral modeling.
63
 
64
+ The conceptual ancestor we lean on hardest is Rabinowitz et al.'s **[Machine Theory of Mind](https://arxiv.org/abs/1802.07740)** — meta-learn a model of another agent from interaction. That's exactly what the Qwen agent ends up doing. It never sees Llama's gradients. It only sees Llama's outputs. From those, it builds a probabilistic behavioral model of Llama's response surface, encoded in the prompts it learns to write.
65
 
66
  ---
67
 
68
+ ## 2. The environment
69
+
70
+ Each episode is one task. The agent sees the task, writes a prompt, gets scored. That's the whole loop.
71
 
72
+ ```
73
+ ┌─────────────────────────┐
74
+ │ GolfObservation │
75
+ reset() ─────────► task_description │
76
+ │ 3 visible train ex. │
77
+ │ token budget │
78
+ │ baseline_zero_shot │
79
+ └────────────┬────────────┘
80
+
81
+
82
+ Agent writes a prompt string
83
+
84
+
85
+ Prepend prompt to 6 hidden test inputs
86
+
87
+
88
+ Frozen Llama-3.2-3B runs each
89
+
90
+
91
+ Task-specific scorer in [0, 1] per input
92
+
93
+
94
+ reward composition (additive)
95
+
96
+
97
+ GolfStepResult to agent
98
+ ```
99
 
100
+ ### Reward
 
 
 
101
 
102
+ ```
103
+ reward = raw_task_score
104
+ − 0.5 · baseline_zero_shot ← don't reward what the target already does
105
+ − 0.002 · submitted_tokens ← the golf score
106
+ − leakage_overlap² ← anti-cheat: caught pasting test inputs
107
+ − short_penalty (if tokens < 5) ← anti-collapse to 1-token prompts
108
 
109
+ clipped to [-0.5, 1.3]
110
+ ```
111
+
112
+ Three things about this composition matter.
113
+
114
+ **Additive, not multiplicative.** Earlier versions used `length_factor × leakage_factor × raw_score`, which gave brittle gradients (the multiplicative form has dead zones where one factor is small and gradients vanish). The additive form is smoother and what training actually converges on.
115
+
116
+ **Baseline subtraction is load-bearing.** Without it, the agent gets credit for tasks the target already does well at zero-shot — which means it's rewarded for nothing. With it, the reward signal isolates *additional capability elicited by the prompt*, which is what we actually care about.
117
+
118
+ **Anti-collapse floor.** Without `MIN_TOKENS_FLOOR=5`, GRPO inevitably converges on degenerate 1-2 token prompts that exploit specific tokenization artifacts. These aren't prompts in any meaningful sense — they're attacks on the target's tokenizer. The floor penalty turns the search away from those local optima.
119
+
120
+ ### Anti-leakage
121
+
122
+ The 6 held-out test inputs are **never shown to the agent**. A trigram-overlap detector zeros the reward if the agent tries to paste held-out inputs into its prompt. Multi-turn mode (when `turn_limit > 1`) splits the test pool into a 2-example *feedback* slice (revealed across turns with the target's outputs) and a 4-example *scoring* slice (only the final-turn prompt is judged) — so the agent can debug across turns without leaking the inputs that ultimately judge it.
123
+
124
+ ### Scorers
125
+
126
+ Each task picks one of 21 scorers, grouped into 7 families:
127
+
128
+ | Family | Scorers | What they check |
129
+ |---|---|---|
130
+ | **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; required substrings; case-strict rewrites |
131
+ | **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
132
+ | **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Required keys/values; key ordering; nesting depth |
133
+ | **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
134
+ | **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
135
+ | **Safety** | `refusal_score` | Whether the output is a refusal (matches expected refuse/comply label) |
136
+ | **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks; deterministic decoding |
137
+
138
+ The scorer is **fixed per task and never seen by the agent** — it has to infer from train examples + task description what gets graded. *Verifiable beats judgeable* is the design principle: every task we can grade with a regex, we do; LLM judges only kick in for genuinely free-form behaviors like persona consistency.
139
 
140
  ---
141
 
142
+ ## 3. Why cross-family is the right setup
143
+
144
+ If the agent and target are the same model family, you're really doing self-distillation: the agent has perfect access to its own response surface. We ship that as a control (`prompt-golf-grpo-1.5b`, Qwen→Qwen).
145
 
146
+ When agent and target are different families, the agent has to **build an empirical model of the target's behavior from outputs alone**. Concretely, it learns:
147
 
148
+ - Which words Llama needs to constrain its output format (`Output the label only, no punctuation.`)
149
+ - Which words Llama can drop without consequence (`Please carefully consider…`)
150
+ - Which compressions break Llama even when they look semantically equivalent
151
+ - That Llama-3.2 needs explicit label vocabularies on classification but *doesn't* need them on JSON extraction
152
 
153
+ This is **operationalized behavioral theory-of-mind** the agent's policy implicitly encodes a probabilistic model of another model's response surface.
 
 
 
154
 
155
+ Cross-family also turned out to be the *easier* setup empirically, for an unexpected reason. Llama-3.2-3B is significantly more cooperative on strict-format tasks than Qwen3-1.7B: **67/87 tasks have non-zero verbose-prompt accuracy on Llama**, vs only 19/87 on Qwen. That changes what training can attempt at all cross-family Qwen→Llama gives the agent more "real" tasks with reward variance to learn from.
156
 
157
  ---
158
 
159
+ ## 4. The 90-task bank
160
+
161
+ Task quality is the single biggest determinant of whether this kind of environment is interesting or boring. A great training loop on bad tasks teaches the wrong thing. We curated each task against three filters:
162
 
163
+ 1. **Empty-prompt baseline must fail.** No free lunch. We ran every task with an empty prompt and dropped the ones where the target succeeded anyway.
164
+ 2. **Verbose prompt must succeed.** A capability ceiling has to exist for there to be room to compress. Run `bash training/hf_job_profile.sh` on your fork to do this check yourself.
165
+ 3. **Minimum prompt must be non-obvious.** The whole game is closing the gap between (2) and (3).
166
 
167
  | Tier | Count | Examples |
168
  |---|---|---|
169
+ | **v1** (`tasks.py`) | 20 | sentiment classification, NER, JSON extraction, translation, refusal |
170
+ | **v2** (`tasks_v2.py`) | 15 | acrostic, no-letter-e, YAML nested depth, pirate persona, terminal session output |
171
+ | **tough** (`tasks_tough.py`) | 52 | logical fallacy ID, FINRA risk classification, Yoda-with-constraint |
172
+ | **policy** (`tasks_policy.py`) | 3 | MSN ad-creative policy (737 tok), content moderation rules (612 tok), FINRA broker-dealer review (550 tok) |
173
 
174
+ Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60250).
175
 
176
+ The **policy tasks are the headline workload**: each has a 500700-word real-world-style policy as the verbose prompt, and the agent has to compress it into a 250-token classifier prompt that still routes inputs to the right `allow / disallow / review` decision. That's where the inference-cost story is most visible — these are prompts that look like real production system prompts.
177
+
178
+ <!-- IMAGE PLACEHOLDER 2 — Task bank diagram
179
+ A 4-panel grid, one panel per tier, showing one example task each:
180
+ v1 : "sentiment_basic" — input review → label
181
+ v2 : "tough_yaml_nested_depth" — input spec → 4-deep YAML
182
+ tough : "tough_fallacy_classify" — input argument → fallacy name
183
+ policy : "policy_msn_ad_creative" — input creative → allow/disallow
184
+ For each: show a 2-line example input → expected output.
185
+ Helps the reader picture what the agent is being graded on. -->
186
 
187
  ---
188
 
189
+ ## 5. Training: GRPO, LoRA, ~3 hours on an L40S
190
 
191
  The recipe:
192
 
193
+ - **Agent:** Qwen3-1.7B + LoRA (r=16, α=32), trained with TRL GRPO
194
+ - **Target:** `meta-llama/Llama-3.2-3B-Instruct` (frozen)
195
+ - **Judge:** Qwen3-8B in 8-bit via `bitsandbytes` (only for `judge_*` scorers)
196
+ - **GRPO config:** 500 steps, `num_generations=8`, `lr=5e-6`, `β=0.04`, `temperature=0.9`, `max_completion_length=768`
197
+ - **Hardware:** single L40S (48 GB) on HuggingFace Jobs, ~3 hours per run
198
+ - **Anti-collapse guard:** `MIN_TOKENS_FLOOR=5` rubric penalty
199
+
200
+ To reproduce:
201
+
202
+ ```bash
203
+ PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
204
+ ```
205
+
206
+ A few practical things that mattered along the way:
207
+
208
+ **Pre-flight capability profiling is non-negotiable.** Before committing GPU hours, we ran each task with the verbose hand-written description and recorded `description_baseline` per task. Tasks where the verbose prompt also fails produce zero gradient (no GRPO group variance) and just dilute the budget. Profile first, train second.
209
 
210
+ **`frac_reward_zero_std` is the diagnostic to watch.** If a GRPO group has zero intra-group reward variance, it contributes no gradient. The "tough" tier gave the most signal because its reward was widely dispersed within each group — that's a feature, not a bug.
211
 
212
+ **Format anchors emerge before content compression.** Looking at intermediate checkpoints, the agent first learns that certain trigger tokens (`JSON:`, `psql>`, `Yarrr,`) carry enormous behavioral payload. Content-level compression dropping ceremonial preamble like *"In this task you will…"* comes later, around step 200+.
213
 
214
+ ### The thinking-mode A/B
215
+
216
+ Qwen3 supports an optional `<think>...</think>` chat template that gives the model free reasoning scratch space before the final output. Hypothesis: free reasoning would let the agent reason about format anchors before emitting the prompt, since the rubric only counts the *extracted* prompt's tokens.
217
+
218
+ We A/B'd identical training setups, thinking ON vs OFF:
219
+
220
+ | | thinking=OFF (hero) | thinking=ON |
221
+ |---|---|---|
222
+ | Trained accuracy | 0.523 | **0.539** |
223
+ | Trained reward | **+0.426** | +0.379 |
224
+ | Mean tokens | **35** | 46 |
225
+
226
+ OFF wins on reward and compression by a clear margin. ON wins on accuracy by 1.6 percentage points at a 30% token cost. **The implicit credit assignment between `<think>` tokens and the final prompt is too weak for GRPO to exploit at this scale** — the gradient just doesn't flow cleanly across the thinking block. We ship OFF as the hero adapter and ON as a different operating point on the accuracy/length frontier.
227
 
228
  ---
229
 
230
+ ## 6. Results
231
 
232
+ ### Headline numbers (Qwen → Llama, 90-task average)
233
 
234
  | Stage | Mean accuracy | Mean tokens |
235
  |---|---|---|
236
  | Verbose human-written prompt | **0.65** | ~63 |
237
+ | Untrained Qwen3-1.7B agent | 0.48 | ~38 |
238
  | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
239
 
240
+ → **80% accuracy retention at 55% of the verbose token count**, scored on a frozen Llama target the agent never had gradient access to.
241
+
242
+ The trained agent **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric. On the remaining 39 tasks, the accuracy drop on hard tasks outweighs the length savings — those are cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling. **On those tasks the verbose prompt's extra tokens are doing real cognitive work**, not just adding decoration.
243
+
244
+ ### The training curves
245
+
246
+ ![Reward curve over 500 GRPO steps](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/reward_curve.png)
247
+
248
+ *Mean reward per step, climbing from ~0 to +0.43 over 500 steps. The plateau around step 350 is where length compression saturates against accuracy preservation.*
249
+
250
+ ![Mean prompt length over 500 GRPO steps](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/length_curve.png)
251
+
252
+ *Mean prompt tokens per step. The agent finds the compression frontier within the first ~150 steps and then refines it. The trajectory is monotonic but uneven — bigger compression jumps happen on tasks where the agent discovers a new format anchor.*
253
 
254
+ ![Reward component breakdown](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/breakdown.png)
255
 
256
+ *Decomposition of the additive reward. The length penalty stays small because the agent quickly stops paying it; the gain comes from raw task score climbing while baseline-subtracted reward stays positive.*
257
+
258
+ For step-by-step exploration, the **[Trackio dashboard](https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio)** has the full per-step metrics replayed from `train_metrics.jsonl`.
259
+
260
+ ### Per-task highlights
261
+
262
+ The full row-by-row demo CSV is at **[`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)** — every task with verbose / untrained / trained prompts side by side, plus accuracy and reward deltas. A few representative rows:
263
+
264
+ | Task | Verbose | Trained | Notes |
265
  |---|---|---|---|
266
+ | `sentiment_basic` | 27 tok / **0.83** | **18 tok** / **1.00** | Shorter AND more accurate |
267
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
268
+ | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
269
+ | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
270
+ | `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression — both fail because Llama-3.2-3B can't reason over the policy hierarchy, but the compression doesn't *cost* anything |
271
 
272
+ The `policy_msn_ad_creative` row is the most interesting. Both prompts get 0 accuracy, so it looks like a draw — but the verbose prompt was charging 737 tokens of prefill on every request to deliver that 0. The trained agent does it for 20. **Pair the compressed prompt with a stronger target and you'd ship the same behavior at 37× lower input-token cost.**
 
 
 
 
273
 
274
  ### What the agent actually wrote
275
 
276
+ For sentiment classification:
 
 
 
 
277
 
278
+ > *Verbose human prompt:* "For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation." (27 tokens)
279
+ >
280
+ > *Trained agent:* "Classify the input review as positive, negative, or neutral. Output only the label." (18 tokens, **1.00 accuracy**)
281
 
282
  For YAML extraction with strict nesting:
283
 
284
+ > *Verbose:* 74 tokens describing depth requirements, entity coverage, format constraints, output instructions.
285
  >
286
+ > *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
287
 
288
+ For policy compliance — the long-context money case:
289
 
290
+ > *Verbose:* 737 tokens of MSN ad-creative policy listing prohibited content, restricted categories, format standards, decision schemas.
291
  >
292
+ > *Trained agent:* "Classify the input creative as allow, disallow, or review based on the given policy guidelines." (20 tokens)
293
 
294
+ ### The same-family control: Qwen → Qwen
295
+
296
+ | | Qwen→Qwen (control) | Qwen→Llama (hero) |
297
+ |---|---|---|
298
+ | Trained beats verbose on | **70/87 tasks (80%)** | 48/87 (55%) |
299
+ | Mean reward advantage vs verbose | **+0.085** | -0.057 |
300
+ | Verbose accuracy ceiling | 0.15 | 0.65 |
301
+
302
+ This looks like a "win" for Qwen→Qwen on win-rate, but the framing is misleading. Qwen3-1.7B as a target only achieves 0.15 average accuracy with verbose prompts — the bar to beat is on the floor. **Cross-family Llama is a much harder bar to clear (0.65 verbose ceiling), but the absolute accuracy delivered by the trained agent is dramatically higher.** This is the kind of nuance you only see when you actually run the comparison.
303
+
304
+ ![Qwen→Qwen reward curve](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/resolve/main/plots/reward_curve.png)
305
 
306
+ *Same training recipe with Qwen3-1.7B as both agent and target. The curve looks great because the verbose baseline is weak — but absolute accuracy is much lower than the cross-family run.*
307
 
308
+ ### The thinking-ON variant
309
 
310
+ ![Qwen→Llama thinking-ON reward curve](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/resolve/main/plots/reward_curve.png)
311
 
312
+ *Identical training setup with Qwen3's `<think>...</think>` chat template enabled. Trajectory is similar in shape but absolute reward plateaus lower the extra tokens spent inside `<think>` cost more than the accuracy gain they buy.*
313
 
314
+ All three trained adapters are public, with their own demo CSVs:
315
 
316
+ - 🥇 **[`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)** (thinking=OFF, hero) [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)
317
+ - 🅰️ **[`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)** (thinking=ON variant) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
318
+ - 🎛️ **[`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)** (Qwen→Qwen control) — [demo CSV](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
319
 
320
  ---
321
 
322
+ ## 7. What the agent learned
323
+
324
+ Some qualitative observations from inspecting trained-agent outputs:
325
+
326
+ **Format cues are tokens, not sentences.** `JSON:` does the job that `Output your response as a JSON object with the following structure` does — at 50× fewer tokens.
327
+
328
+ **Persona triggers are surprisingly small.** `Yarrr,` for pirate. `psql>` for SQL. `Once upon a time,` for fairy-tale. These tokens carry enormous behavioral payload because they're strong prefix-match anchors in the target's training distribution. The trained agent finds them; humans writing prompts almost never do.
329
+
330
+ **Add the label vocabulary the human forgot.** On classification tasks where the verbose prompt described the task but didn't list the labels, the trained agent learned to *insert the label set* even though that increases length. The reward signal pushes toward explicit label vocabularies because the target needs them. The human prompt was too polite to spell them out; the agent has no such instinct.
331
+
332
+ **Show, don't tell.** For tasks like "respond in exactly 4 numbered steps," the agent learned to *demonstrate* the structure (`1.\n2.\n3.\n4.\nAnswer:`) rather than describe it. This is the kind of insight a generic action space enables that operator-based golfers (with `INSERT`, `DELETE`, `REPLACE` actions) would miss.
333
+
334
+ **Drop ceremonial preamble.** *"In this task you will…"* / *"Please carefully consider…"* / *"Your goal is to…"* — gone, every time, with no measurable accuracy cost. The first chunk of most human-written prompts is almost pure decoration.
335
+
336
+ We also saw failure modes worth flagging:
337
+
338
+ - **Mild gibberish convergence on a few adversarial tasks.** A handful of refusal-related tasks pushed the agent toward GCG-style ungrammatical prompts. The leakage penalty caught the worst cases.
339
+ - **Over-compression on tasks where verbose is doing real work.** The 39 losing tasks share a consistent failure mode: the agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling. On these tasks the verbose prompt's extra tokens are doing real cognitive work, not just adding decoration.
340
+ - **The 1-token attractor.** Without `MIN_TOKENS_FLOOR`, RL inevitably found 1-2 token prompts exploiting tokenization artifacts. These weren't prompts in any meaningful sense — they were attacks. The floor penalty is non-optional.
341
 
 
 
 
342
  ---
343
 
344
+ ## 8. Try it yourself
345
+
346
+ There's a **[live Gradio demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo)** where you pick a task, see the verbose human prompt and the trained agent's compressed prompt side by side, and run either against the same Llama-3.2-3B target on real test inputs. Same UI shows accuracy for both.
347
+
348
+ <!-- IMAGE PLACEHOLDER 3 — Demo screenshot
349
+ Clean shot of the Gradio demo UI:
350
+ - Task selector populated with one of the policy or tough tasks
351
+ - Verbose vs trained prompt panels side by side
352
+ - "Run on Llama-3.2-3B" button
353
+ - Per-input accuracy badges visible
354
+ This is what readers click through to. -->
355
+
356
+ ### Run the env locally
357
 
 
358
  ```bash
359
  git clone https://huggingface.co/spaces/rishabh16196/prompt_golf_env
360
  cd prompt_golf_env
361
  pip install -e . gradio transformers torch
362
 
363
+ # CPU smoke test (mock target, no GPU needed)
364
+ PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000
365
 
366
+ # Real run with the actual Llama target
367
+ PROMPT_GOLF_TARGET_BACKEND=hf \
368
+ PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
369
+ uvicorn server.app:app --port 8000
370
  ```
371
 
372
+ ### Use the trained adapter
373
+
374
+ ```python
375
+ from peft import PeftModel
376
+ from transformers import AutoModelForCausalLM, AutoTokenizer
377
+
378
+ base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
379
+ agent = PeftModel.from_pretrained(
380
+ base, "rishabh16196/prompt-golf-qwen-to-llama-nothink"
381
+ )
382
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
383
+
384
+ # Give it a verbose prompt-golf task description; get back a compressed prompt
385
+ ```
386
+
387
+ ### Reproduce the hero training run
388
+
389
  ```bash
390
  PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh
391
+ # ~3h on L40S, pushes adapter + plots + train_metrics + eval JSONLs to your repo
392
  ```
393
 
394
+ ### Hit the env from Python
395
+
396
+ ```python
397
+ from prompt_golf_env import GolfAction, PromptGolfEnv
398
+
399
+ async with PromptGolfEnv(base_url="http://localhost:8000") as env:
400
+ result = await env.reset(task="sentiment_basic")
401
+ obs = result.observation
402
+ result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
403
+ print(f"reward={result.reward:.2f} | tokens={result.observation.submitted_prompt_tokens}")
404
+ ```
405
+
406
+ ---
407
+
408
+ ## 9. Why this matters
409
+
410
+ | If you work on… | Prompt Golf gives you… |
411
+ |---|---|
412
+ | **Inference cost in production** | A trained policy that compresses verbose prompts behaviorally — no gradient access to the target needed. Up to 37× compression on real-world policy prompts. |
413
+ | **Capability evaluation** | A black-box minimum-elicitation metric per task per target. Decouples *can the model do X* from *did we find the right prompt*. |
414
+ | **Prompt distillation across targets** | Cross-family training generates a model of the target's response surface. Swap targets, retrain, ship a custom prompt-compressor for your specific deployment. |
415
+ | **Capability elicitation research** | A black-box analog of password-locked-model elicitation ([Greenblatt et al., 2024](https://arxiv.org/abs/2405.19550)). What's the minimum input that surfaces a latent capability? |
416
+ | **Red-teaming / robustness** | Same machinery, different rubric. Adversarial scoring → red-teaming. Refusal rubric → jailbreak hardening. |
417
+ | **LLM ↔ LLM behavioral modeling** | Machine Theory of Mind ([Rabinowitz et al., 2018](https://arxiv.org/abs/1802.07740)) for LLMs as targets. The agent's policy implicitly encodes a model of the target. |
418
+
419
+ ### What this is — and isn't
420
+
421
+ - ✅ **Is** the first open OpenEnv RL environment where the agent learns to write prompts for another LLM.
422
+ - ✅ **Is** a calibrated middle: GCG/RLPrompt-style mechanics, Machine ToM-style framing, and reusable infrastructure.
423
+ - ❌ **Isn't** a generative simulator of LLM behavior — we never touch activations.
424
+ - ❌ **Isn't** a new prompt-optimization algorithm. The algorithmic core is RL+length; the contribution is the framing + reusable env + cross-family experiments.
425
+ - ❌ **Isn't** a claim that we've "solved world modeling for LLMs." Episodes are short; the analogy to Dreamer/JEPA/Genie is structural, not algorithmic.
426
 
427
  ---
428
 
429
+ ## 10. What's next
430
 
431
  Directions we'd love community help on:
432
 
433
+ 1. **More targets.** We have Qwen3-1.7B and Llama-3.2-3B profiled. Phi-3, Mistral, Gemma 2 — what does the per-target prompt look like? Is the trained agent's policy portable, or is it Llama-specific? This is the cross-target transfer experiment that would substantiate the Machine ToM framing.
434
+ 2. **Larger task banks.** 90 hand-crafted tasks is a starting point. Procedural task generation (random format constraints, synthetic policies) would scale this to thousands of holes.
435
+ 3. **Different reward shapes.** The current additive reward is one choice. KL-as-reward (output-distribution matching the verbose prompt's) is another. Each captures a different definition of "good."
436
+ 4. **Real-world deployment study.** Pick an actual production prompt (with permission), train a compressor for it, measure compression-vs-accuracy in shadow traffic. We'd love to hear what breaks and what holds up.
437
+
438
+ If you have a 1000-token prompt that's eating your inference budget, train a compressor for it. **That's the whole point.**
439
 
440
  ---
441
 
442
+ ## Acknowledgments & citations
443
+
444
+ This work draws on four converging research lines:
445
+
446
+ - **[Machine Theory of Mind](https://arxiv.org/abs/1802.07740)** (Rabinowitz et al., 2018) — the conceptual ancestor.
447
+ - **[Red Teaming Language Models with Language Models](https://arxiv.org/abs/2202.03286)** (Perez et al., 2022) — the direct algorithmic ancestor.
448
+ - **[Stress-Testing Capability Elicitation With Password-Locked Models](https://arxiv.org/abs/2405.19550)** (Greenblatt et al., 2024) — the motivation for treating minimum elicitation as a meaningful capability metric.
449
+ - **[AutoPrompt](https://arxiv.org/abs/2010.15980)**, **[GCG](https://arxiv.org/abs/2307.15043)**, **[RLPrompt](https://arxiv.org/abs/2205.12548)**, **[PCRL](https://arxiv.org/abs/2308.08758)** — the algorithmic toolkit.
450
+
451
+ Built for the [OpenEnv Hackathon](https://pytorch.org/event/openenv-ai-hackathon/) (Meta + Hugging Face + PyTorch, India 2026), using TRL GRPO, HuggingFace Jobs, and the OpenEnv spec.
452
+
453
+ ```bibtex
454
+ @misc{promptgolf2026,
455
+ author = {Rishabh},
456
+ title = {Prompt Golf: An OpenEnv RL environment for cross-model prompt compression},
457
+ year = {2026},
458
+ howpublished = {\url{https://huggingface.co/spaces/rishabh16196/prompt_golf_env}}
459
+ }
460
+ ```
461
+
462
+
463
 
464
+ <!-- IMAGE PLACEHOLDER 4 — Closing graphic (optional)
465
+ Either:
466
+ (a) a wordcloud of the actual short prompts the agent
467
+ discovered, sized by frequency across the bank, or
468
+ (b) a clean trophy/golf-ball graphic with the headline
469
+ number "37×" in the foreground.
470
+ Optional. The post lands fine without it. -->