---
title: Prompt Golf Environment Server
emoji: ⛳
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Prompt Golf

> **Can one LLM learn to whisper to another?**
> An OpenEnv RL environment where the agent's *action* is a prompt and the *reward* is how well that prompt steers a *frozen, different-family* target LLM to do the right thing — minus how long the prompt is.

**The result.** A Qwen3-1.7B agent (LoRA + TRL GRPO, \~3h on a single L40S) learns to write **\~39-token prompts** that retain **80% of the accuracy** of \~94-token human-written prompts on a frozen Llama-3.2-3B target — *cross-family, black-box, learned from outputs alone, no gradient access*. On **63/90 (70%) of tasks the agent's compressed prompt is the best of the three** we evaluated (verbose, untrained agent, trained agent) — cheaper *and* equal-or-better reward. [▶ Try the live demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) · [📝 Read the blog](./BLOG_POST.md)

**Why this matters.** Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — **the LLM can find the minimum policy itself.** Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).

**What's in this repo:** a reusable OpenEnv environment, a 90-task bank with 21 scorers, four trained-adapter releases with full eval data, a training pipeline reproducible on HuggingFace Jobs, a live Gradio demo, and a Trackio dashboard with replayed training metrics — all linked below.

## Links

- 🌍 **Env (this Space):** https://huggingface.co/spaces/rishabh16196/prompt_golf_env
- 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
- 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
- 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
- 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
- 📖 **How to train end-to-end:** [`training/TRAINING.md`](./training/TRAINING.md) — step-by-step recipe to reproduce hero + multi-step + evals + plots
- 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)

### Trained adapters & data

| Repo | Setup | What's in it |
|---|---|---|
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
| [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama, **multi-turn (3 turns)** | adapter + train_metrics + base/trained eval JSONLs |

### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))

> **📖 Reproduction recipe: [`training/TRAINING.md`](./training/TRAINING.md)** — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.

| File | Role |
|---|---|
| [`training/TRAINING.md`](./training/TRAINING.md) | Step-by-step reproduction recipe (start here) |
| [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer (the hero recipe) |
| [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | Trajectory-level GRPO for 3-turn episodes |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
| [`training/make_plots.py`](./training/make_plots.py) | reward / length / breakdown curves from `train_metrics.jsonl` |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |

---

## Results — the table to look at

All numbers below are on the same 90-task bank, evaluated against frozen Llama-3.2-3B. Verbose = human-written; base/hero/multistep = agent-generated.

| Setup | Mean accuracy | Reward vs base | Mean tokens | Wins per-task | Use when |
|---|---|---|---|---|---|
| **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
| **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often, \~3× shorter than verbose at 80% of its accuracy. |
| **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |

> **Headline:** Hero retains **80% of verbose accuracy at \~40% of the tokens** and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.

### Training curves (hero)

![reward curve](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/reward_curve.png)
![length curve](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/length_curve.png)
![reward-component breakdown](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/breakdown.png)

📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) — every task × verbose / untrained / hero prompts side by side.

### Training curves (multistep)

![multistep reward curve](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/reward_curve.png)
![multistep length curve](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/length_curve.png)
![multistep breakdown](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/breakdown.png)

150 steps × 8 trajectories × 3 turns. Warm-started from the hero adapter, so reward starts at +0.58 and climbs to +0.89 by the final step. KL stays under 0.03 against the warm-start LoRA snapshot.

### Per-category breakdown (hero vs multistep, 90 tasks)

For each task we pick the model with the best reward (ε = 0.05 reward margin counts as a tie, broken in favour of the cheaper option).

| Category | n | Hero | Multi | Use |
|---|---|---|---|---|
| `classification_tough` | 10 | 4 | **6** | **multistep** — only category where multi clearly leads |
| `format`, `meta`, `arithmetic` | 13 | 7 | 5 | tie → **hero** (cheaper) |
| `classification`, `extraction`, `persona`, `style`, `translation`, `refusal`, `discrimination` | 19 | **19** | 0 | **hero** — clean sweep |
| `extraction_tough` | 10 | **7** | 2 | **hero** |
| `reasoning_tough` | 10 | **7** | 3 | **hero** (multi wins big *when* it does, less often) |
| `persona_tough` | 8 | **7** | 1 | **hero** |
| `format_tough` | 8 | **6** | 2 | **hero** |
| `adversarial_tough` | 6 | **5** | 1 | **hero** |
| `reasoning` | 4 | **3** | 1 | **hero** |
| `policy_compression` | 3 | **2** | 1 | **hero** (multi gets 1 dramatic unlock — `policy_content_moderation` 0.00 → 0.67) |

Multistep wins on aggregate (+7pp accuracy) by landing **a small number of dramatic 0→1 unlocks on tough tasks** (e.g. `tough_policy_stance` 0→1.00, `tough_syllogism_check` 0→1.00). Hero wins more often head-to-head.

### Side-by-side: verbose vs hero vs multistep

**`sentiment_basic`** *(classification — clean hero win)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 27 | 0.83 | *"For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."* |
| **Hero** | **18** | **1.00** | *"Classify the input review as positive, negative, or neutral. Output only the label."* |
| Multistep | 45 | 1.00 | *"Given the input review, classify it as positive, negative, or neutral. Output only the label. Use the meta-llama/Llama-3.2-3B-Instruct model. Token budget: 50."* |

**`tough_policy_stance`** *(classification_tough — multistep unlock)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 136 | 0.83 | *"Classify the speaker's stance on the policy proposal mentioned in the quote. Output exactly one label, lowercase, hyphenated..."* |
| Hero | 26 | 0.00 | *"classify the speaker's stance on the policy proposal in the quote as support, oppose, neutral, conditional-support, or conditional-oppose"* |
| **Multistep** | 35 | **1.00** | *"Classify the speaker's stance on the policy proposal. Output exactly one label: support, oppose, neutral, conditional-support, or conditional-oppose. Do not include explanations."* |

**`format_uppercase`** *(format — multistep regression)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 25 | 1.00 | *"Repeat the input sentence back in ALL UPPERCASE letters, with no other changes. Do not add quotes or commentary."* |
| **Hero** | **9** | 1.00 | *"Format the input sentence to all uppercase letters."* |
| Multistep | 24 | 1.00 | *"Format the input sentence to all uppercase, without any changes except capitalization. Output only the uppercase version of the input."* |

📊 **Eval JSONLs (multistep):** [`evals/eval_base.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) · [`evals/eval_trained.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)

---

## How it works

Each episode is one task:

1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
2. Agent outputs a **prompt string** as its action.
3. Env prepends the prompt to 6 held-out test inputs, runs the **frozen target LLM**, scores each output with the task scorer.
4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.

**Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.

90 hand-crafted tasks across 4 tiers: 20 v1 (easy/medium classification, extraction, format), 15 v2 (hard — acrostic, YAML depth, persona), 52 tough (logical fallacy ID, FINRA review, Yoda-with-constraint), 3 long-context policy compression (MSN ad policy, content moderation, broker-dealer review).

---

## Scorers

Each task picks one of 21 scorers from [`server/scorer.py`](./server/scorer.py). The scorer takes the target's output + the task's expected output and returns a value in `[0, 1]`. Per-task score is the mean across held-out test examples.

| Family | Scorers | What they check |
|---|---|---|
| **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites |
| **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
| **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Structural extraction; required keys/values; key ordering; YAML nesting depth |
| **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
| **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
| **Safety** | `refusal_score` | Detects whether the output is a refusal (matches expected refuse / comply label) |
| **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding. |

The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to `SCORERS` registry at the bottom of `scorer.py`.

---

## Other variants we ran

- **Qwen → Qwen same-family control** ([`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)) — looks great on win-rate (70/87 vs verbose) but verbose-Qwen ceiling is only 0.15. Cross-family Llama is the harder bar.
- **Thinking-ON A/B** ([`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)) — `<think>...</think>` chat template enabled. Loses to OFF on reward (+0.379 vs +0.426) at +30% tokens. We ship OFF as default.

---

## Prior work

This env stands on four research lines:

| Line | Paper | What we use |
|---|---|---|
| **Machine Theory of Mind** | [Rabinowitz et al., 2018](https://arxiv.org/abs/1802.07740) (ToMnet) | Conceptual frame: one model learns to model another from observed outputs alone, no internals access. |
| **LLM-on-LLM red teaming** | [Perez et al., 2022](https://arxiv.org/abs/2202.03286) | Direct algorithmic ancestor — same RL-on-LLM-prompts loop. We swap adversarial reward for task-success + length. |
| **Capability elicitation** | [Greenblatt et al., 2024](https://arxiv.org/abs/2405.19550) (password-locked models) | "Minimum input that surfaces a latent capability" as a measurable RL objective. |
| **Prompt optimization** | [AutoPrompt](https://arxiv.org/abs/2010.15980) · [GCG](https://arxiv.org/abs/2307.15043) · [RLPrompt](https://arxiv.org/abs/2205.12548) · [PCRL](https://arxiv.org/abs/2308.08758) | Algorithmic toolkit (gradient-search over discrete tokens, RL-trained policies for prompt search). |

We're not the first to compress prompts with RL. We're trying to be the first place where you can *go to do this experiment* — fork the env, swap in your target, run it, get a number.

---

## Quick start

```python
from prompt_golf_env import GolfAction, PromptGolfEnv

async with PromptGolfEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(task="sentiment_basic")
    obs = result.observation
    result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
    print(f"reward={result.reward:.2f}  raw={result.observation.raw_task_score:.2f}  "
          f"tokens={result.observation.submitted_prompt_tokens}")
```

Run the env locally:

```bash
PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000   # CPU smoke test
PROMPT_GOLF_TARGET_BACKEND=hf PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
    uvicorn server.app:app --port 8000                                # real GPU
```

Reproduce the hero training run on HuggingFace Jobs:

```bash
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh   # ~3h on L40S
```

---

## Files

```
prompt_golf_env/
  models.py                            # GolfAction, GolfObservation, constants
  client.py                            # PromptGolfEnv (EnvClient)
  inference.py                         # OpenEnv-spec inference script
  pyproject.toml
  Dockerfile                           # HF Spaces image
  server/
    app.py                             # FastAPI app
    prompt_golf_environment.py         # core Env
    target_model.py                    # frozen-target wrapper (HF + mock)
    judge.py                           # Qwen3-8B 8-bit judge
    scorer.py                          # 21+ scorers (structural + LLM judge)
    rubrics.py                         # additive reward composition
    tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
  training/                            # see Links → Training pipeline
  ui/ + space-demo/                    # Gradio demos
  BLOG_POST.md                         # writeup
```