Spaces:
Sleeping
title: Prompt Golf Environment Server
emoji: ⛳
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Prompt Golf
Can one LLM learn to whisper to another? An OpenEnv RL environment where the agent's action is a prompt and the reward is how well that prompt steers a frozen, different-family target LLM to do the right thing — minus how long the prompt is.
The result. A Qwen3-1.7B agent (LoRA + TRL GRPO, ~3h on a single L40S) learns to write ~39-token prompts that retain 80% of the accuracy of ~94-token human-written prompts on a frozen Llama-3.2-3B target — cross-family, black-box, learned from outputs alone, no gradient access. On 63/90 (70%) of tasks the agent's compressed prompt is the best of the three we evaluated (verbose, untrained agent, trained agent) — cheaper and equal-or-better reward. ▶ Try the live demo · 📝 Read the blog
Why this matters. Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — the LLM can find the minimum policy itself. Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).
What's in this repo: a reusable OpenEnv environment, a 90-task bank with 21 scorers, four trained-adapter releases with full eval data, a training pipeline reproducible on HuggingFace Jobs, a live Gradio demo, and a Trackio dashboard with replayed training metrics — all linked below.
Links
- 🌍 Env (this Space): https://huggingface.co/spaces/rishabh16196/prompt_golf_env
- 🎛️ Live demo (Gradio): https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
- 📊 Training dashboard (Trackio): https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
- 🐙 GitHub mirror: https://github.com/rishabh16196/prompt_golf_env
- 🛠️ Training pipeline:
training/— full GRPO trainers, eval harness, profilers, HF Jobs launchers - 📖 How to train end-to-end:
training/TRAINING.md— step-by-step recipe to reproduce hero + multi-step + evals + plots - 📝 Blog post:
BLOG_POST.md
Trained adapters & data
| Repo | Setup | What's in it |
|---|---|---|
prompt-golf-qwen-to-llama-nothink |
Qwen→Llama, thinking=OFF (hero) | adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV |
prompt-golf-qwen-to-llama |
Qwen→Llama, thinking=ON | same artifacts (A/B variant) |
prompt-golf-grpo-1.5b |
Qwen→Qwen same-family (control) | same artifacts |
prompt-golf-multistep-llama |
Qwen→Llama, multi-turn (3 turns) | adapter + train_metrics + base/trained eval JSONLs |
Training pipeline (training/)
📖 Reproduction recipe:
training/TRAINING.md— end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.
| File | Role |
|---|---|
training/TRAINING.md |
Step-by-step reproduction recipe (start here) |
training/train_grpo.py |
TRL GRPO single-step trainer (the hero recipe) |
training/train_grpo_multistep.py |
Trajectory-level GRPO for 3-turn episodes |
training/eval_before_after.py |
base + trained-adapter eval harness |
training/profile_baseline.py |
per-task target-capability profiler |
training/build_before_after_csv.py |
merge eval JSONLs into the demo CSV |
training/make_plots.py |
reward / length / breakdown curves from train_metrics.jsonl |
training/replay_to_trackio.py |
post-hoc replay of train_metrics.jsonl into the Trackio dashboard Space |
training/hf_job_train.sh / hf_job_train_multistep.sh / hf_job_eval.sh / hf_job_profile.sh |
HuggingFace Jobs launchers |
Results — the table to look at
All numbers below are on the same 90-task bank, evaluated against frozen Llama-3.2-3B. Verbose = human-written; base/hero/multistep = agent-generated.
| Setup | Mean accuracy | Reward vs base | Mean tokens | Wins per-task | Use when |
|---|---|---|---|---|---|
| Verbose (human-written) | 0.631 | — | 94.2 | (the bar) | You don't have an agent and don't mind paying full token cost. |
| Base (Qwen3-1.7B, no adapter) | 0.464 | — | 37.5 | 4 / 90 | Almost never. Untrained Qwen3 over-thinks the task. |
| Hero (1-step trained) | 0.506 | +0.381 | 38.5 | 63 / 90 | Default. Cheapest, wins most often, ~3× shorter than verbose at 80% of its accuracy. |
| Multistep (3-turn trained) | 0.576 | +0.440 | 43.7 | 23 / 90 | Nuanced classifiers (classification_tough is its sweet spot). When the +6pp accuracy matters more than +5 tokens. |
Headline: Hero retains 80% of verbose accuracy at ~40% of the tokens and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.
Training curves (hero)
📊 Demo CSV: evals/qwen_to_llama_demo.csv — every task × verbose / untrained / hero prompts side by side.
Training curves (multistep)
150 steps × 8 trajectories × 3 turns. Warm-started from the hero adapter, so reward starts at +0.58 and climbs to +0.89 by the final step. KL stays under 0.03 against the warm-start LoRA snapshot.
Per-category breakdown (hero vs multistep, 90 tasks)
For each task we pick the model with the best reward (ε = 0.05 reward margin counts as a tie, broken in favour of the cheaper option).
| Category | n | Hero | Multi | Use |
|---|---|---|---|---|
classification_tough |
10 | 4 | 6 | multistep — only category where multi clearly leads |
format, meta, arithmetic |
13 | 7 | 5 | tie → hero (cheaper) |
classification, extraction, persona, style, translation, refusal, discrimination |
19 | 19 | 0 | hero — clean sweep |
extraction_tough |
10 | 7 | 2 | hero |
reasoning_tough |
10 | 7 | 3 | hero (multi wins big when it does, less often) |
persona_tough |
8 | 7 | 1 | hero |
format_tough |
8 | 6 | 2 | hero |
adversarial_tough |
6 | 5 | 1 | hero |
reasoning |
4 | 3 | 1 | hero |
policy_compression |
3 | 2 | 1 | hero (multi gets 1 dramatic unlock — policy_content_moderation 0.00 → 0.67) |
Multistep wins on aggregate (+7pp accuracy) by landing a small number of dramatic 0→1 unlocks on tough tasks (e.g. tough_policy_stance 0→1.00, tough_syllogism_check 0→1.00). Hero wins more often head-to-head.
Side-by-side: verbose vs hero vs multistep
sentiment_basic (classification — clean hero win)
| Tokens | Acc | Prompt | |
|---|---|---|---|
| Verbose | 27 | 0.83 | "For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation." |
| Hero | 18 | 1.00 | "Classify the input review as positive, negative, or neutral. Output only the label." |
| Multistep | 45 | 1.00 | "Given the input review, classify it as positive, negative, or neutral. Output only the label. Use the meta-llama/Llama-3.2-3B-Instruct model. Token budget: 50." |
tough_policy_stance (classification_tough — multistep unlock)
| Tokens | Acc | Prompt | |
|---|---|---|---|
| Verbose | 136 | 0.83 | "Classify the speaker's stance on the policy proposal mentioned in the quote. Output exactly one label, lowercase, hyphenated..." |
| Hero | 26 | 0.00 | "classify the speaker's stance on the policy proposal in the quote as support, oppose, neutral, conditional-support, or conditional-oppose" |
| Multistep | 35 | 1.00 | "Classify the speaker's stance on the policy proposal. Output exactly one label: support, oppose, neutral, conditional-support, or conditional-oppose. Do not include explanations." |
format_uppercase (format — multistep regression)
| Tokens | Acc | Prompt | |
|---|---|---|---|
| Verbose | 25 | 1.00 | "Repeat the input sentence back in ALL UPPERCASE letters, with no other changes. Do not add quotes or commentary." |
| Hero | 9 | 1.00 | "Format the input sentence to all uppercase letters." |
| Multistep | 24 | 1.00 | "Format the input sentence to all uppercase, without any changes except capitalization. Output only the uppercase version of the input." |
📊 Eval JSONLs (multistep): evals/eval_base.jsonl · evals/eval_trained.jsonl
How it works
Each episode is one task:
reset(task=...)→ env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.- Agent outputs a prompt string as its action.
- Env prepends the prompt to 6 held-out test inputs, runs the frozen target LLM, scores each output with the task scorer.
reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap², clipped to[−0.5, 1.3].
Multi-turn is supported. With turn_limit > 1, the env splits the test pool into a 2-example feedback slice (revealed across turns with target outputs) and a 4-example scoring slice (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the per-category breakdown above for results.
90 hand-crafted tasks across 4 tiers: 20 v1 (easy/medium classification, extraction, format), 15 v2 (hard — acrostic, YAML depth, persona), 52 tough (logical fallacy ID, FINRA review, Yoda-with-constraint), 3 long-context policy compression (MSN ad policy, content moderation, broker-dealer review).
Scorers
Each task picks one of 21 scorers from server/scorer.py. The scorer takes the target's output + the task's expected output and returns a value in [0, 1]. Per-task score is the mean across held-out test examples.
| Family | Scorers | What they check |
|---|---|---|
| Exact / membership | exact_label, contains_label, contains_all_substrings, uppercase_match |
Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites |
| Numeric | numeric_match, word_count_exact |
Last numeric token within tolerance; word count exactly N |
| JSON / YAML | json_contains_fields, valid_json_object, json_key_order, valid_yaml_depth |
Structural extraction; required keys/values; key ordering; YAML nesting depth |
| Format-strict | three_bullets, acrostic_match, avoid_letter, ends_question, terminal_output_pattern |
Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with ?; terminal-session shape |
| Multi-step / language | stepwise_math, translation_match, selective_translate |
Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
| Safety | refusal_score |
Detects whether the output is a refusal (matches expected refuse / comply label) |
| LLM judge (Qwen3-8B 8-bit) | judge_criteria, judge_vs_expected |
Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding. |
The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to SCORERS registry at the bottom of scorer.py.
Other variants we ran
- Qwen → Qwen same-family control (
prompt-golf-grpo-1.5b) — looks great on win-rate (70/87 vs verbose) but verbose-Qwen ceiling is only 0.15. Cross-family Llama is the harder bar. - Thinking-ON A/B (
prompt-golf-qwen-to-llama) —<think>...</think>chat template enabled. Loses to OFF on reward (+0.379 vs +0.426) at +30% tokens. We ship OFF as default.
Prior work
This env stands on four research lines:
| Line | Paper | What we use |
|---|---|---|
| Machine Theory of Mind | Rabinowitz et al., 2018 (ToMnet) | Conceptual frame: one model learns to model another from observed outputs alone, no internals access. |
| LLM-on-LLM red teaming | Perez et al., 2022 | Direct algorithmic ancestor — same RL-on-LLM-prompts loop. We swap adversarial reward for task-success + length. |
| Capability elicitation | Greenblatt et al., 2024 (password-locked models) | "Minimum input that surfaces a latent capability" as a measurable RL objective. |
| Prompt optimization | AutoPrompt · GCG · RLPrompt · PCRL | Algorithmic toolkit (gradient-search over discrete tokens, RL-trained policies for prompt search). |
We're not the first to compress prompts with RL. We're trying to be the first place where you can go to do this experiment — fork the env, swap in your target, run it, get a number.
Quick start
from prompt_golf_env import GolfAction, PromptGolfEnv
async with PromptGolfEnv(base_url="http://localhost:8000") as env:
result = await env.reset(task="sentiment_basic")
obs = result.observation
result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
print(f"reward={result.reward:.2f} raw={result.observation.raw_task_score:.2f} "
f"tokens={result.observation.submitted_prompt_tokens}")
Run the env locally:
PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000 # CPU smoke test
PROMPT_GOLF_TARGET_BACKEND=hf PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
uvicorn server.app:app --port 8000 # real GPU
Reproduce the hero training run on HuggingFace Jobs:
PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh # ~3h on L40S
Files
prompt_golf_env/
models.py # GolfAction, GolfObservation, constants
client.py # PromptGolfEnv (EnvClient)
inference.py # OpenEnv-spec inference script
pyproject.toml
Dockerfile # HF Spaces image
server/
app.py # FastAPI app
prompt_golf_environment.py # core Env
target_model.py # frozen-target wrapper (HF + mock)
judge.py # Qwen3-8B 8-bit judge
scorer.py # 21+ scorers (structural + LLM judge)
rubrics.py # additive reward composition
tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py # 90-task bank
training/ # see Links → Training pipeline
ui/ + space-demo/ # Gradio demos
BLOG_POST.md # writeup





