Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

File size: 17,435 Bytes

6850dad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a2a589
 
6850dad
4a5fd24
8a2a589
 
 
 
6850dad
a1b7a09
6850dad
a1b7a09
 
3724e90
a1b7a09
a56bede
6206e8a
a1b7a09
6850dad
a1b7a09
6850dad
a1b7a09
 
 
 
 
e1e3cbe
a1b7a09
a56bede
a1b7a09
6206e8a
 
a1b7a09
 
6206e8a
 
 
a1b7a09
 
 
6206e8a
3724e90
6206e8a
67509ac
a1b7a09
67509ac
31ce013
6850dad
31ce013
6850dad
31ce013
 
 
 
4a5fd24
31ce013
6850dad
4a5fd24
31ce013
 
6850dad
31ce013
 
a1b7a09
6850dad
31ce013
 
125b737
 
 
 
 
 
 
 
31ce013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6850dad
31ce013
6850dad
31ce013
7dafc94
31ce013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9867aa7
a1b7a09
6850dad
a1b7a09
6850dad
a1b7a09
67509ac
a1b7a09
 
802278c
a1b7a09
6850dad
4023371
6850dad
a1b7a09
67509ac
a1b7a09
6850dad
433bfad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31ce013
e1e3cbe
31ce013
 
e1e3cbe
a1b7a09
67509ac
fc2c034
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b7a09
67509ac
a1b7a09
 
67509ac
a1b7a09
 
 
 
 
 
 
67509ac
a1b7a09
67509ac
 
a1b7a09
 
 
67509ac
 
a1b7a09
67509ac
 
a1b7a09
67509ac
6850dad
a1b7a09
6850dad
 
 
 
 
a1b7a09
 
 
6850dad
a1b7a09
6850dad
a1b7a09
c3e14ba
a1b7a09
 
 
 
 
 
 
 
6850dad

---
title: Prompt Golf Environment Server
emoji: ⛳
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Prompt Golf

> **Can one LLM learn to whisper to another?**
> An OpenEnv RL environment where the agent's *action* is a prompt and the *reward* is how well that prompt steers a *frozen, different-family* target LLM to do the right thing — minus how long the prompt is.

**The result.** A Qwen3-1.7B agent (LoRA + TRL GRPO, \~3h on a single L40S) learns to write **\~39-token prompts** that retain **80% of the accuracy** of \~94-token human-written prompts on a frozen Llama-3.2-3B target — *cross-family, black-box, learned from outputs alone, no gradient access*. On **63/90 (70%) of tasks the agent's compressed prompt is the best of the three** we evaluated (verbose, untrained agent, trained agent) — cheaper *and* equal-or-better reward. [▶ Try the live demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) · [📝 Read the blog](./BLOG_POST.md)

**Why this matters.** Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — **the LLM can find the minimum policy itself.** Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).

**What's in this repo:** a reusable OpenEnv environment, a 90-task bank with 21 scorers, four trained-adapter releases with full eval data, a training pipeline reproducible on HuggingFace Jobs, a live Gradio demo, and a Trackio dashboard with replayed training metrics — all linked below.

## Links

- 🌍 **Env (this Space):** https://huggingface.co/spaces/rishabh16196/prompt_golf_env
- 🎛️ **Live demo (Gradio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
- 📊 **Training dashboard (Trackio):** https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
- 🐙 **GitHub mirror:** https://github.com/rishabh16196/prompt_golf_env
- 🛠️ **Training pipeline:** [`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training) — full GRPO trainers, eval harness, profilers, HF Jobs launchers
- 📖 **How to train end-to-end:** [`training/TRAINING.md`](./training/TRAINING.md) — step-by-step recipe to reproduce hero + multi-step + evals + plots
- 📝 **Blog post:** [`BLOG_POST.md`](./BLOG_POST.md)

### Trained adapters & data

| Repo | Setup | What's in it |
|---|---|---|
| [`prompt-golf-qwen-to-llama-nothink`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink) | Qwen→Llama, thinking=OFF (**hero**) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
| [`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama) | Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
| [`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b) | Qwen→Qwen same-family (control) | same artifacts |
| [`prompt-golf-multistep-llama`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama) | Qwen→Llama, **multi-turn (3 turns)** | adapter + train_metrics + base/trained eval JSONLs |

### Training pipeline ([`training/`](https://github.com/rishabh16196/prompt_golf_env/tree/main/training))

> **📖 Reproduction recipe: [`training/TRAINING.md`](./training/TRAINING.md)** — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.

| File | Role |
|---|---|
| [`training/TRAINING.md`](./training/TRAINING.md) | Step-by-step reproduction recipe (start here) |
| [`training/train_grpo.py`](./training/train_grpo.py) | TRL GRPO single-step trainer (the hero recipe) |
| [`training/train_grpo_multistep.py`](./training/train_grpo_multistep.py) | Trajectory-level GRPO for 3-turn episodes |
| [`training/eval_before_after.py`](./training/eval_before_after.py) | base + trained-adapter eval harness |
| [`training/profile_baseline.py`](./training/profile_baseline.py) | per-task target-capability profiler |
| [`training/build_before_after_csv.py`](./training/build_before_after_csv.py) | merge eval JSONLs into the demo CSV |
| [`training/make_plots.py`](./training/make_plots.py) | reward / length / breakdown curves from `train_metrics.jsonl` |
| [`training/replay_to_trackio.py`](./training/replay_to_trackio.py) | post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space |
| [`training/hf_job_train.sh`](./training/hf_job_train.sh) / [`hf_job_train_multistep.sh`](./training/hf_job_train_multistep.sh) / [`hf_job_eval.sh`](./training/hf_job_eval.sh) / [`hf_job_profile.sh`](./training/hf_job_profile.sh) | HuggingFace Jobs launchers |

---

## Results — the table to look at

All numbers below are on the same 90-task bank, evaluated against frozen Llama-3.2-3B. Verbose = human-written; base/hero/multistep = agent-generated.

| Setup | Mean accuracy | Reward vs base | Mean tokens | Wins per-task | Use when |
|---|---|---|---|---|---|
| **Verbose** (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
| **Base** (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
| **Hero** (1-step trained) | 0.506 | +0.381 | **38.5** | **63 / 90** | **Default.** Cheapest, wins most often, \~3× shorter than verbose at 80% of its accuracy. |
| **Multistep** (3-turn trained) | **0.576** | **+0.440** | 43.7 | 23 / 90 | Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |

> **Headline:** Hero retains **80% of verbose accuracy at \~40% of the tokens** and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.

### Training curves (hero)

![reward curve](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/reward_curve.png)
![length curve](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/length_curve.png)
![reward-component breakdown](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/resolve/main/plots/breakdown.png)

📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) — every task × verbose / untrained / hero prompts side by side.

### Training curves (multistep)

![multistep reward curve](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/reward_curve.png)
![multistep length curve](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/length_curve.png)
![multistep breakdown](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/resolve/main/plots/breakdown.png)

150 steps × 8 trajectories × 3 turns. Warm-started from the hero adapter, so reward starts at +0.58 and climbs to +0.89 by the final step. KL stays under 0.03 against the warm-start LoRA snapshot.

### Per-category breakdown (hero vs multistep, 90 tasks)

For each task we pick the model with the best reward (ε = 0.05 reward margin counts as a tie, broken in favour of the cheaper option).

| Category | n | Hero | Multi | Use |
|---|---|---|---|---|
| `classification_tough` | 10 | 4 | **6** | **multistep** — only category where multi clearly leads |
| `format`, `meta`, `arithmetic` | 13 | 7 | 5 | tie → **hero** (cheaper) |
| `classification`, `extraction`, `persona`, `style`, `translation`, `refusal`, `discrimination` | 19 | **19** | 0 | **hero** — clean sweep |
| `extraction_tough` | 10 | **7** | 2 | **hero** |
| `reasoning_tough` | 10 | **7** | 3 | **hero** (multi wins big *when* it does, less often) |
| `persona_tough` | 8 | **7** | 1 | **hero** |
| `format_tough` | 8 | **6** | 2 | **hero** |
| `adversarial_tough` | 6 | **5** | 1 | **hero** |
| `reasoning` | 4 | **3** | 1 | **hero** |
| `policy_compression` | 3 | **2** | 1 | **hero** (multi gets 1 dramatic unlock — `policy_content_moderation` 0.00 → 0.67) |

Multistep wins on aggregate (+7pp accuracy) by landing **a small number of dramatic 0→1 unlocks on tough tasks** (e.g. `tough_policy_stance` 0→1.00, `tough_syllogism_check` 0→1.00). Hero wins more often head-to-head.

### Side-by-side: verbose vs hero vs multistep

**`sentiment_basic`** *(classification — clean hero win)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 27 | 0.83 | *"For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."* |
| **Hero** | **18** | **1.00** | *"Classify the input review as positive, negative, or neutral. Output only the label."* |
| Multistep | 45 | 1.00 | *"Given the input review, classify it as positive, negative, or neutral. Output only the label. Use the meta-llama/Llama-3.2-3B-Instruct model. Token budget: 50."* |

**`tough_policy_stance`** *(classification_tough — multistep unlock)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 136 | 0.83 | *"Classify the speaker's stance on the policy proposal mentioned in the quote. Output exactly one label, lowercase, hyphenated..."* |
| Hero | 26 | 0.00 | *"classify the speaker's stance on the policy proposal in the quote as support, oppose, neutral, conditional-support, or conditional-oppose"* |
| **Multistep** | 35 | **1.00** | *"Classify the speaker's stance on the policy proposal. Output exactly one label: support, oppose, neutral, conditional-support, or conditional-oppose. Do not include explanations."* |

**`format_uppercase`** *(format — multistep regression)*

| | Tokens | Acc | Prompt |
|---|---:|---:|---|
| Verbose | 25 | 1.00 | *"Repeat the input sentence back in ALL UPPERCASE letters, with no other changes. Do not add quotes or commentary."* |
| **Hero** | **9** | 1.00 | *"Format the input sentence to all uppercase letters."* |
| Multistep | 24 | 1.00 | *"Format the input sentence to all uppercase, without any changes except capitalization. Output only the uppercase version of the input."* |

📊 **Eval JSONLs (multistep):** [`evals/eval_base.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_base.jsonl) · [`evals/eval_trained.jsonl`](https://huggingface.co/rishabh16196/prompt-golf-multistep-llama/blob/main/evals/eval_trained.jsonl)

---

## How it works

Each episode is one task:

1. `reset(task=...)` → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
2. Agent outputs a **prompt string** as its action.
3. Env prepends the prompt to 6 held-out test inputs, runs the **frozen target LLM**, scores each output with the task scorer.
4. `reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap²`, clipped to `[−0.5, 1.3]`.

**Multi-turn is supported.** With `turn_limit > 1`, the env splits the test pool into a 2-example *feedback slice* (revealed across turns with target outputs) and a 4-example *scoring slice* (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the [per-category breakdown above](#per-category-breakdown-hero-vs-multistep-90-tasks) for results.

90 hand-crafted tasks across 4 tiers: 20 v1 (easy/medium classification, extraction, format), 15 v2 (hard — acrostic, YAML depth, persona), 52 tough (logical fallacy ID, FINRA review, Yoda-with-constraint), 3 long-context policy compression (MSN ad policy, content moderation, broker-dealer review).

---

## Scorers

Each task picks one of 21 scorers from [`server/scorer.py`](./server/scorer.py). The scorer takes the target's output + the task's expected output and returns a value in `[0, 1]`. Per-task score is the mean across held-out test examples.

| Family | Scorers | What they check |
|---|---|---|
| **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites |
| **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
| **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Structural extraction; required keys/values; key ordering; YAML nesting depth |
| **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
| **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
| **Safety** | `refusal_score` | Detects whether the output is a refusal (matches expected refuse / comply label) |
| **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding. |

The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to `SCORERS` registry at the bottom of `scorer.py`.

---

## Other variants we ran

- **Qwen → Qwen same-family control** ([`prompt-golf-grpo-1.5b`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b)) — looks great on win-rate (70/87 vs verbose) but verbose-Qwen ceiling is only 0.15. Cross-family Llama is the harder bar.
- **Thinking-ON A/B** ([`prompt-golf-qwen-to-llama`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama)) — `<think>...</think>` chat template enabled. Loses to OFF on reward (+0.379 vs +0.426) at +30% tokens. We ship OFF as default.

---

## Prior work

This env stands on four research lines:

| Line | Paper | What we use |
|---|---|---|
| **Machine Theory of Mind** | [Rabinowitz et al., 2018](https://arxiv.org/abs/1802.07740) (ToMnet) | Conceptual frame: one model learns to model another from observed outputs alone, no internals access. |
| **LLM-on-LLM red teaming** | [Perez et al., 2022](https://arxiv.org/abs/2202.03286) | Direct algorithmic ancestor — same RL-on-LLM-prompts loop. We swap adversarial reward for task-success + length. |
| **Capability elicitation** | [Greenblatt et al., 2024](https://arxiv.org/abs/2405.19550) (password-locked models) | "Minimum input that surfaces a latent capability" as a measurable RL objective. |
| **Prompt optimization** | [AutoPrompt](https://arxiv.org/abs/2010.15980) · [GCG](https://arxiv.org/abs/2307.15043) · [RLPrompt](https://arxiv.org/abs/2205.12548) · [PCRL](https://arxiv.org/abs/2308.08758) | Algorithmic toolkit (gradient-search over discrete tokens, RL-trained policies for prompt search). |

We're not the first to compress prompts with RL. We're trying to be the first place where you can *go to do this experiment* — fork the env, swap in your target, run it, get a number.

---

## Quick start

```python
from prompt_golf_env import GolfAction, PromptGolfEnv

async with PromptGolfEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(task="sentiment_basic")
    obs = result.observation
    result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
    print(f"reward={result.reward:.2f}  raw={result.observation.raw_task_score:.2f}  "
          f"tokens={result.observation.submitted_prompt_tokens}")
```

Run the env locally:

```bash
PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000   # CPU smoke test
PROMPT_GOLF_TARGET_BACKEND=hf PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
    uvicorn server.app:app --port 8000                                # real GPU
```

Reproduce the hero training run on HuggingFace Jobs:

```bash
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh   # ~3h on L40S
```

---

## Files

```
prompt_golf_env/
  models.py                            # GolfAction, GolfObservation, constants
  client.py                            # PromptGolfEnv (EnvClient)
  inference.py                         # OpenEnv-spec inference script
  pyproject.toml
  Dockerfile                           # HF Spaces image
  server/
    app.py                             # FastAPI app
    prompt_golf_environment.py         # core Env
    target_model.py                    # frozen-target wrapper (HF + mock)
    judge.py                           # Qwen3-8B 8-bit judge
    scorer.py                          # 21+ scorers (structural + LLM judge)
    rubrics.py                         # additive reward composition
    tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
  training/                            # see Links → Training pipeline
  ui/ + space-demo/                    # Gradio demos
  BLOG_POST.md                         # writeup
```