Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

App Files Files Community

prompt_golf_env / README.md

Don Rishabh

docs: add multi-step training curves to README + BLOG_POST

125b737 12 days ago

preview code

raw

history blame contribute delete

17.4 kB

metadata

title: Prompt Golf Environment Server
emoji: ⛳
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Prompt Golf

Can one LLM learn to whisper to another? An OpenEnv RL environment where the agent's action is a prompt and the reward is how well that prompt steers a frozen, different-family target LLM to do the right thing — minus how long the prompt is.

The result. A Qwen3-1.7B agent (LoRA + TRL GRPO, ~3h on a single L40S) learns to write ~39-token prompts that retain 80% of the accuracy of ~94-token human-written prompts on a frozen Llama-3.2-3B target — cross-family, black-box, learned from outputs alone, no gradient access. On 63/90 (70%) of tasks the agent's compressed prompt is the best of the three we evaluated (verbose, untrained agent, trained agent) — cheaper and equal-or-better reward. ▶ Try the live demo · 📝 Read the blog

Why this matters. Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — the LLM can find the minimum policy itself. Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).

What's in this repo: a reusable OpenEnv environment, a 90-task bank with 21 scorers, four trained-adapter releases with full eval data, a training pipeline reproducible on HuggingFace Jobs, a live Gradio demo, and a Trackio dashboard with replayed training metrics — all linked below.

Links

🌍 Env (this Space): https://huggingface.co/spaces/rishabh16196/prompt_golf_env
🎛️ Live demo (Gradio): https://huggingface.co/spaces/rishabh16196/prompt-golf-demo
📊 Training dashboard (Trackio): https://huggingface.co/spaces/rishabh16196/prompt-golf-trackio
🐙 GitHub mirror: https://github.com/rishabh16196/prompt_golf_env
🛠️ Training pipeline: training/ — full GRPO trainers, eval harness, profilers, HF Jobs launchers
📖 How to train end-to-end: training/TRAINING.md — step-by-step recipe to reproduce hero + multi-step + evals + plots
📝 Blog post: BLOG_POST.md

Trained adapters & data

Repo	Setup	What's in it
`prompt-golf-qwen-to-llama-nothink`	Qwen→Llama, thinking=OFF (hero)	adapter + plots + train_metrics + base/trained eval JSONLs + demo CSV
`prompt-golf-qwen-to-llama`	Qwen→Llama, thinking=ON	same artifacts (A/B variant)
`prompt-golf-grpo-1.5b`	Qwen→Qwen same-family (control)	same artifacts
`prompt-golf-multistep-llama`	Qwen→Llama, multi-turn (3 turns)	adapter + train_metrics + base/trained eval JSONLs

Training pipeline (`training/`)

📖 Reproduction recipe: training/TRAINING.md — end-to-end steps for hero + multi-step training, evals, plots, demo CSV, Trackio replay.

File	Role
`training/TRAINING.md`	Step-by-step reproduction recipe (start here)
`training/train_grpo.py`	TRL GRPO single-step trainer (the hero recipe)
`training/train_grpo_multistep.py`	Trajectory-level GRPO for 3-turn episodes
`training/eval_before_after.py`	base + trained-adapter eval harness
`training/profile_baseline.py`	per-task target-capability profiler
`training/build_before_after_csv.py`	merge eval JSONLs into the demo CSV
`training/make_plots.py`	reward / length / breakdown curves from `train_metrics.jsonl`
`training/replay_to_trackio.py`	post-hoc replay of `train_metrics.jsonl` into the Trackio dashboard Space
`training/hf_job_train.sh` / `hf_job_train_multistep.sh` / `hf_job_eval.sh` / `hf_job_profile.sh`	HuggingFace Jobs launchers

Results — the table to look at

All numbers below are on the same 90-task bank, evaluated against frozen Llama-3.2-3B. Verbose = human-written; base/hero/multistep = agent-generated.

Setup	Mean accuracy	Reward vs base	Mean tokens	Wins per-task	Use when
Verbose (human-written)	0.631	—	94.2	(the bar)	You don't have an agent and don't mind paying full token cost.
Base (Qwen3-1.7B, no adapter)	0.464	—	37.5	4 / 90	Almost never. Untrained Qwen3 over-thinks the task.
Hero (1-step trained)	0.506	+0.381	38.5	63 / 90	Default. Cheapest, wins most often, ~3× shorter than verbose at 80% of its accuracy.
Multistep (3-turn trained)	0.576	+0.440	43.7	23 / 90	Nuanced classifiers (`classification_tough` is its sweet spot). When the +6pp accuracy matters more than +5 tokens.

Headline: Hero retains 80% of verbose accuracy at ~40% of the tokens and wins per-task on 70% of tasks. Multistep gives back compression for accuracy — only worth it for nuanced classification.

Training curves (hero)

📊 Demo CSV: evals/qwen_to_llama_demo.csv — every task × verbose / untrained / hero prompts side by side.

Training curves (multistep)

150 steps × 8 trajectories × 3 turns. Warm-started from the hero adapter, so reward starts at +0.58 and climbs to +0.89 by the final step. KL stays under 0.03 against the warm-start LoRA snapshot.

Per-category breakdown (hero vs multistep, 90 tasks)

For each task we pick the model with the best reward (ε = 0.05 reward margin counts as a tie, broken in favour of the cheaper option).

Category	n	Hero	Multi	Use
`classification_tough`	10	4	6	multistep — only category where multi clearly leads
`format`, `meta`, `arithmetic`	13	7	5	tie → hero (cheaper)
`classification`, `extraction`, `persona`, `style`, `translation`, `refusal`, `discrimination`	19	19	0	hero — clean sweep
`extraction_tough`	10	7	2	hero
`reasoning_tough`	10	7	3	hero (multi wins big when it does, less often)
`persona_tough`	8	7	1	hero
`format_tough`	8	6	2	hero
`adversarial_tough`	6	5	1	hero
`reasoning`	4	3	1	hero
`policy_compression`	3	2	1	hero (multi gets 1 dramatic unlock — `policy_content_moderation` 0.00 → 0.67)

Multistep wins on aggregate (+7pp accuracy) by landing a small number of dramatic 0→1 unlocks on tough tasks (e.g. tough_policy_stance 0→1.00, tough_syllogism_check 0→1.00). Hero wins more often head-to-head.

Side-by-side: verbose vs hero vs multistep

sentiment_basic (classification — clean hero win)

	Tokens	Acc	Prompt
Verbose	27	0.83	"For each input review, output exactly one of: positive, negative, neutral. Output the label only — no punctuation, no explanation."
Hero	18	1.00	"Classify the input review as positive, negative, or neutral. Output only the label."
Multistep	45	1.00	"Given the input review, classify it as positive, negative, or neutral. Output only the label. Use the meta-llama/Llama-3.2-3B-Instruct model. Token budget: 50."

tough_policy_stance (classification_tough — multistep unlock)

	Tokens	Acc	Prompt
Verbose	136	0.83	"Classify the speaker's stance on the policy proposal mentioned in the quote. Output exactly one label, lowercase, hyphenated..."
Hero	26	0.00	"classify the speaker's stance on the policy proposal in the quote as support, oppose, neutral, conditional-support, or conditional-oppose"
Multistep	35	1.00	"Classify the speaker's stance on the policy proposal. Output exactly one label: support, oppose, neutral, conditional-support, or conditional-oppose. Do not include explanations."

format_uppercase (format — multistep regression)

	Tokens	Acc	Prompt
Verbose	25	1.00	"Repeat the input sentence back in ALL UPPERCASE letters, with no other changes. Do not add quotes or commentary."
Hero	9	1.00	"Format the input sentence to all uppercase letters."
Multistep	24	1.00	"Format the input sentence to all uppercase, without any changes except capitalization. Output only the uppercase version of the input."

📊 Eval JSONLs (multistep): evals/eval_base.jsonl · evals/eval_trained.jsonl

How it works

Each episode is one task:

reset(task=...) → env returns task description + 3 visible train examples + token budget + target's empty-prompt baseline.
Agent outputs a prompt string as its action.
Env prepends the prompt to 6 held-out test inputs, runs the frozen target LLM, scores each output with the task scorer.
reward = raw_task_score − 0.5·baseline − 0.002·tokens − leakage_overlap², clipped to [−0.5, 1.3].

Multi-turn is supported. With turn_limit > 1, the env splits the test pool into a 2-example feedback slice (revealed across turns with target outputs) and a 4-example scoring slice (only the final-turn prompt is judged). The agent sees its prior prompts + per-example target outputs in the user message, so it can debug across turns without leaking the inputs that grade it. We trained a 3-turn variant — see the per-category breakdown above for results.

90 hand-crafted tasks across 4 tiers: 20 v1 (easy/medium classification, extraction, format), 15 v2 (hard — acrostic, YAML depth, persona), 52 tough (logical fallacy ID, FINRA review, Yoda-with-constraint), 3 long-context policy compression (MSN ad policy, content moderation, broker-dealer review).

Scorers

Each task picks one of 21 scorers from server/scorer.py. The scorer takes the target's output + the task's expected output and returns a value in [0, 1]. Per-task score is the mean across held-out test examples.

Family	Scorers	What they check
Exact / membership	`exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match`	Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites
Numeric	`numeric_match`, `word_count_exact`	Last numeric token within tolerance; word count exactly N
JSON / YAML	`json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth`	Structural extraction; required keys/values; key ordering; YAML nesting depth
Format-strict	`three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern`	Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape
Multi-step / language	`stepwise_math`, `translation_match`, `selective_translate`	Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules
Safety	`refusal_score`	Detects whether the output is a refusal (matches expected refuse / comply label)
LLM judge (Qwen3-8B 8-bit)	`judge_criteria`, `judge_vs_expected`	Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding.

The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to SCORERS registry at the bottom of scorer.py.

Other variants we ran

Qwen → Qwen same-family control (prompt-golf-grpo-1.5b) — looks great on win-rate (70/87 vs verbose) but verbose-Qwen ceiling is only 0.15. Cross-family Llama is the harder bar.
Thinking-ON A/B (prompt-golf-qwen-to-llama) — <think>...</think> chat template enabled. Loses to OFF on reward (+0.379 vs +0.426) at +30% tokens. We ship OFF as default.

Prior work

This env stands on four research lines:

Line	Paper	What we use
Machine Theory of Mind	Rabinowitz et al., 2018 (ToMnet)	Conceptual frame: one model learns to model another from observed outputs alone, no internals access.
LLM-on-LLM red teaming	Perez et al., 2022	Direct algorithmic ancestor — same RL-on-LLM-prompts loop. We swap adversarial reward for task-success + length.
Capability elicitation	Greenblatt et al., 2024 (password-locked models)	"Minimum input that surfaces a latent capability" as a measurable RL objective.
Prompt optimization	AutoPrompt · GCG · RLPrompt · PCRL	Algorithmic toolkit (gradient-search over discrete tokens, RL-trained policies for prompt search).

We're not the first to compress prompts with RL. We're trying to be the first place where you can go to do this experiment — fork the env, swap in your target, run it, get a number.

Quick start

from prompt_golf_env import GolfAction, PromptGolfEnv

async with PromptGolfEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(task="sentiment_basic")
    obs = result.observation
    result = await env.step(GolfAction(prompt="Classify sentiment, one word."))
    print(f"reward={result.reward:.2f}  raw={result.observation.raw_task_score:.2f}  "
          f"tokens={result.observation.submitted_prompt_tokens}")

Run the env locally:

PROMPT_GOLF_TARGET_BACKEND=mock uvicorn server.app:app --port 8000   # CPU smoke test
PROMPT_GOLF_TARGET_BACKEND=hf PROMPT_GOLF_TARGET_MODEL=meta-llama/Llama-3.2-3B-Instruct \
    uvicorn server.app:app --port 8000                                # real GPU

Reproduce the hero training run on HuggingFace Jobs:

PUSH_TO_HUB=your-user/your-repo bash training/hf_job_train.sh   # ~3h on L40S

Files

prompt_golf_env/
  models.py                            # GolfAction, GolfObservation, constants
  client.py                            # PromptGolfEnv (EnvClient)
  inference.py                         # OpenEnv-spec inference script
  pyproject.toml
  Dockerfile                           # HF Spaces image
  server/
    app.py                             # FastAPI app
    prompt_golf_environment.py         # core Env
    target_model.py                    # frozen-target wrapper (HF + mock)
    judge.py                           # Qwen3-8B 8-bit judge
    scorer.py                          # 21+ scorers (structural + LLM judge)
    rubrics.py                         # additive reward composition
    tasks.py / tasks_v2.py / tasks_tough.py / tasks_policy.py    # 90-task bank
  training/                            # see Links → Training pipeline
  ui/ + space-demo/                    # Gradio demos
  BLOG_POST.md                         # writeup