prompt_golf_env / README.md
Don Rishabh
docs: add multi-step training curves to README + BLOG_POST
125b737
metadata
title: Prompt Golf Environment Server
emoji: 
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Prompt Golf

Can one LLM learn to whisper to another? An OpenEnv RL environment where the agent's action is a prompt and the reward is how well that prompt steers a frozen, different-family target LLM to do the right thing — minus how long the prompt is.

The result. A Qwen3-1.7B agent (LoRA + TRL GRPO, ~3h on a single L40S) learns to write ~39-token prompts that retain 80% of the accuracy of ~94-token human-written prompts on a frozen Llama-3.2-3B target — cross-family, black-box, learned from outputs alone, no gradient access. On 63/90 (70%) of tasks the agent's compressed prompt is the best of the three we evaluated (verbose, untrained agent, trained agent) — cheaper and equal-or-better reward. ▶ Try the live demo · 📝 Read the blog

Why this matters. Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — the LLM can find the minimum policy itself. Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).

What's in this repo: a reusable OpenEnv environment, a 90-task bank with 21 scorers, four trained-adapter releases with full eval data, a training pipeline reproducible on HuggingFace Jobs, a live Gradio demo, and a Trackio dashboard with replayed training metrics — all linked below.

Links

Trained adapters & data

Repo Setup What's in it
prompt-golf-qwen-to-llama-nothink Qwen→Llama, thinking=OFF (hero) adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV
prompt-golf-qwen-to-llama Qwen→Llama, thinking=ON same artifacts (A/B variant)
prompt-golf-grpo-1.5b Qwen→Qwen same-family (control) same artifacts
prompt-golf-multistep-llama Qwen→Llama, multi-turn (3 turns) adapter + train_metrics + base/trained eval JSONLs

Training pipeline (training/)

📖 Reproduction recipe: training/TRAINING.md — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.

File Role
training/TRAINING.md Step-by-step reproduction recipe (start here)
training/train_grpo.py TRL GRPO single-step trainer (the hero recipe)
training/train_grpo_multistep.py Trajectory-level GRPO for 3-turn episodes
training/eval_before_after.py base + trained-adapter eval harness
training/profile_baseline.py per-task target-capability profiler
training/build_before_after_csv.py merge eval JSONLs into the demo CSV
training/make_plots.py reward / length / breakdown curves from train_metrics.jsonl
training/replay_to_trackio.py post-hoc replay of train_metrics.jsonl into the Trackio dashboard Space
training/hf_job_train.sh / hf_job_train_multistep.sh / hf_job_eval.sh / hf_job_profile.sh HuggingFace Jobs launchers

Results — the table to look at

All numbers below are on the same 90-task bank, evaluated against frozen Llama-3.2-3B. Verbose = human-written; base/hero/multistep = agent-generated.

Setup Mean accuracy Reward vs base Mean tokens Wins per-task Use when
Verbose (human-written) 0.631 94.2 (the bar) You don't have an agent and don't mind paying full token cost.
Base (Qwen3-1.7B, no adapter) 0.464 37.5 4 / 90 Almost never. Untrained Qwen3 over-thinks the task.
Hero (1-step trained) 0.506 +0.381 38.5 63 / 90 Default. Cheapest, wins most often, ~3× shorter than verbose at 80% of its accuracy.
Multistep (3-turn trained) 0.576 +0.440 43.7 23 / 90 Nuanced classifiers (classification_tough is its sweet spot). When the +6pp accuracy matters more than +5 tokens.

Headline: Hero retains 80% of verbose accuracy at ~40% of the tokens and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.

Training curves (hero)

reward curve length curve reward-component breakdown

📊 Demo CSV: evals/qwen_to_llama_demo.csv — every task × verbose / untrained / hero prompts side by side.

Training curves (multistep)

multistep reward curve multistep length curve multistep breakdown

150 steps × 8 trajectories × 3 turns. Warm-started from the hero adapter, so reward starts at +0.58 and climbs to +0.89 by the final step. KL stays under 0.03 against the warm-start LoRA snapshot.

Per-category breakdown (hero vs multistep, 90 tasks)

For each task we pick the model with the best reward (ε = 0.05 reward margin counts as a tie, broken in favour of the cheaper option).

Category n Hero Multi Use
classification_tough 10 4 6 multistep — only category where multi clearly leads
format, meta, arithmetic 13 7 5 tie → hero (cheaper)
classification, extraction, persona, style, translation, refusal, discrimination 19 19 0 hero — clean sweep
extraction_tough 10 7 2 hero
reasoning_tough 10 7 3 hero (multi wins big when it does, less often)
persona_tough 8 7 1 hero
format_tough 8 6 2 hero
adversarial_tough 6 5 1 hero
reasoning 4 3 1 hero
policy_compression 3 2 1 hero (multi gets 1 dramatic unlock — policy_content_moderation 0.00 → 0.67)

Multistep wins on aggregate (+7pp accuracy) by landing a small number of dramatic 0→1 unlocks on tough tasks (e.g. tough_policy_stance 0→1.00, tough_syllogism_check 0→1.00). Hero wins more often head-to-head.

Side-by-side: verbose vs hero vs multistep

sentiment_basic (classification — clean hero win)

Tokens Acc Prompt
Verbose 27 0.83 "For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."
Hero 18 1.00 "Classify the input review as positive, negative, or neutral. Output only the label."
Multistep 45 1.00 "Given the input review, classify it as positive, negative, or neutral. Output only the label. Use the meta-llama/Llama-3.2-3B-Instruct model. Token budget: 50."

tough_policy_stance (classification_tough — multistep unlock)

Tokens Acc Prompt
Verbose 136 0.83 "Classify the speaker's stance on the policy proposal mentioned in the quote. Output exactly one label, lowercase, hyphenated..."
Hero 26 0.00 "classify the speaker's stance on the policy proposal in the quote as support, oppose, neutral, conditional-support, or conditional-oppose"
Multistep 35 1.00 "Classify the speaker's stance on the policy proposal. Output exactly one label: support, oppose, neutral, conditional-support, or conditional-oppose. Do not include explanations."

format_uppercase (format — multistep regression)

Tokens Acc Prompt
Verbose 25 1.00 "Repeat the input sentence back in ALL UPPERCASE letters, with no other changes. Do not add quotes or commentary."
Hero 9 1.00 "Format the input sentence to all uppercase letters."
Multistep 24 1.00 "Format the input sentence to all uppercase, without any changes except capitalization. Output only the uppercase version of the input."

📊 Eval JSONLs (multistep): evals/eval_base.jsonl · evals/eval_trained.jsonl


How it works

Each episode is one task:

  1. reset(task=...) → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
  2. Agent outputs a prompt string as its action.
  3. Env prepends the prompt to 6 held-out test inputs, runs the frozen target LLM, scores each output with the task scorer.
  4. reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap², clipped to [−0.5, 1.3].

Multi-turn is supported. With turn_limit > 1, the env splits the test pool into a 2-example feedback slice (revealed across turns with target outputs) and a 4-example scoring slice (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the per-category breakdown above for results.

90 hand-crafted tasks across 4 tiers: 20 v1 (easy/medium classification, extraction, format), 15 v2 (hard — acrostic, YAML depth, persona), 52 tough (logical fallacy ID, FINRA review, Yoda-with-constraint), 3 long-context policy compression (MSN ad policy, content moderation, broker-dealer review).


Scorers

Each task picks one of 21 scorers from server/scorer.py. The scorer takes the target's output + the task's expected output and returns a value in [0, 1]. Per-task score is the mean across held-out test examples.

Family Scorers What they check
Exact / membership exact_label, contains_label, contains_all_substrings, uppercase_match Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites
Numeric numeric_match, word_count_exact Last numeric token within tolerance; word count exactly N
JSON / YAML json_contains_fields, valid_json_object, json_key_order, valid_yaml_depth Structural extraction; required keys/values; key ordering; YAML nesting depth
Format-strict three_bullets, acrostic_match, avoid_letter, ends_question, terminal_output_pattern Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with ?; terminal-session shape
Multi-step / language stepwise_math, translation_match, selective_translate Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules
Safety refusal_score Detects whether the output is a refusal (matches expected refuse / comply label)
LLM judge (Qwen3-8B 8-bit) judge_criteria, judge_vs_expected Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding.

The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to SCORERS registry at the bottom of scorer.py.


Other variants we ran

  • Qwen → Qwen same-family control (prompt-golf-grpo-1.5b) — looks great on win-rate (70/87 vs verbose) but verbose-Qwen ceiling is only 0.15. Cross-family Llama is the harder bar.
  • Thinking-ON A/B (prompt-golf-qwen-to-llama) — <think>...</think> chat template enabled. Loses to OFF on reward (+0.379 vs +0.426) at +30% tokens. We ship OFF as default.

Prior work

This env stands on four research lines:

Line Paper What we use
Machine Theory of Mind Rabinowitz et al., 2018 (ToMnet) Conceptual frame: one model learns to model another from observed outputs alone, no internals access.
LLM-on-LLM red teaming Perez et al., 2022 Direct algorithmic ancestor — same RL-on-LLM-prompts loop. We swap adversarial reward for task-success + length.
Capability elicitation Greenblatt et al., 2024 (password-locked models) "Minimum input that surfaces a latent capability" as a measurable RL objective.
Prompt optimization AutoPrompt · GCG · RLPrompt · PCRL Algorithmic toolkit (gradient-search over discrete tokens, RL-trained policies for prompt search).

We're not the first to compress prompts with RL. We're trying to be the first place where you can go to do this experiment — fork the env, swap in your target, run it, get a number.


Quick start

from prompt_golf_env import GolfAction, PromptGolfEnv

async with PromptGolfEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(task="sentiment_basic")
    obs = result.observation
    result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
    print(f"reward={result.reward:.2f}  raw={result.observation.raw_task_score:.2f}  "
          f"tokens={result.observation.submitted_prompt_tokens}")

Run the env locally:

PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000   # CPU smoke test
PROMPT_GOLF_TARGET_BACKEND=hf PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
    uvicorn server.app:app --port 8000                                # real GPU

Reproduce the hero training run on HuggingFace Jobs:

PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh   # ~3h on L40S

Files

prompt_golf_env/
  models.py                            # GolfAction, GolfObservation, constants
  client.py                            # PromptGolfEnv (EnvClient)
  inference.py                         # OpenEnv-spec inference script
  pyproject.toml
  Dockerfile                           # HF Spaces image
  server/
    app.py                             # FastAPI app
    prompt_golf_environment.py         # core Env
    target_model.py                    # frozen-target wrapper (HF + mock)
    judge.py                           # Qwen3-8B 8-bit judge
    scorer.py                          # 21+ scorers (structural + LLM judge)
    rubrics.py                         # additive reward composition
    tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
  training/                            # see Links → Training pipeline
  ui/ + space-demo/                    # Gradio demos
  BLOG_POST.md                         # writeup