Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
cc1bf10
·
1 Parent(s): 34b5069

space-demo: bundle for HF Spaces Gradio demo

Browse files

Self-contained 4-file bundle that you push to a separate
rishabh16196/prompt-golf-demo Space:

app.py Gradio app (3-column verbose/untrained/trained,
batched target inference, lazy agent + LoRA loader
for live "Regenerate" mode)
requirements.txt torch + transformers + peft + gradio + accelerate
README.md Spaces metadata frontmatter (sdk: gradio,
hardware: t4-small) + headline numbers
.gitignore __pycache__ etc.

Pulls the demo CSV directly from
rishabh16196/prompt-golf-qwen-to-llama-nothink at startup.
Llama-3.2-3B as the target requires HF_TOKEN as a Space secret.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

space-demo/.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .gradio/
4
+ .venv/
space-demo/README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Prompt Golf Demo
3
+ emoji: ⛳
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ hardware: t4-small
12
+ short_description: "Compressed prompts for Llama 3.2 — a Qwen agent learned to write 12-token prompts that match 250-token human ones"
13
+ tags:
14
+ - prompt-engineering
15
+ - rl
16
+ - grpo
17
+ - prompt-compression
18
+ - openenv
19
+ ---
20
+
21
+ # Prompt Golf — Compression Demo
22
+
23
+ An interactive demo of a Qwen3-1.7B agent (with LoRA adapter, trained via TRL GRPO on the [Prompt Golf environment](https://huggingface.co/spaces/rishabh16196/prompt_golf_env)) that writes short prompts to steer a frozen Llama-3.2-3B-Instruct target.
24
+
25
+ ## How to use
26
+
27
+ 1. Pick a task from the dropdown (sorted by reward gain — top entries show the biggest training wins).
28
+ 2. Three prompts populate side-by-side:
29
+ - **Verbose**: the human-written task description
30
+ - **Untrained**: what raw Qwen3-1.7B writes when asked to compress
31
+ - **Trained**: what the GRPO-tuned Qwen3-1.7B + LoRA writes
32
+ 3. Type a test input and click **Run target with all three prompts** — the demo runs Llama-3.2-3B with each prompt prepended (in one batched forward pass) and shows the three outputs side by side.
33
+ 4. Optionally click **Regenerate prompts live** to load the agent and have it produce fresh untrained / trained prompts on the fly.
34
+
35
+ ## Headline numbers (90-task bank)
36
+
37
+ | Stage | Mean accuracy | Mean tokens |
38
+ |---|---|---|
39
+ | Verbose human prompt | 0.65 | ~63 |
40
+ | Untrained Qwen3-1.7B | 0.48 | ~38 |
41
+ | Trained Qwen3-1.7B + LoRA | 0.52 | ~35 |
42
+
43
+ → **80% accuracy retention at 55% of the verbose token count.** Peak compression: **37× on long-context policy tasks** (e.g. 737-token MSN ad-creative policy → 20-token classifier prompt).
44
+
45
+ ## Hardware
46
+
47
+ This Space is configured for **T4-small** ($0.40/hr). Llama-3.2-3B in bf16 fits comfortably; the agent (Qwen3-1.7B + LoRA) loads lazily on the first "Regenerate" click.
48
+
49
+ ## Links
50
+
51
+ - Environment: https://huggingface.co/spaces/rishabh16196/prompt_golf_env
52
+ - Trained adapter: https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink
53
+ - Demo CSV (90 tasks × all 3 prompt columns): https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv
54
+ - Blog post: https://huggingface.co/spaces/rishabh16196/prompt_golf_env/blob/main/BLOG_POST.md
55
+ - Training notebook: https://huggingface.co/spaces/rishabh16196/prompt_golf_env/blob/main/notebooks/prompt_golf_train_minimal.ipynb
56
+
57
+ ## Configuration (Space env vars)
58
+
59
+ | Var | Default | Purpose |
60
+ |---|---|---|
61
+ | `HF_TOKEN` (required, secret) | — | Auth for downloading gated Llama-3.2 |
62
+ | `DEMO_TARGET_MODEL` | `meta-llama/Llama-3.2-3B-Instruct` | Frozen target |
63
+ | `DEMO_AGENT_MODEL` | `Qwen/Qwen3-1.7B` | Agent base for live regen |
64
+ | `DEMO_AGENT_ADAPTER` | `rishabh16196/prompt-golf-qwen-to-llama-nothink` | Trained LoRA |
65
+ | `DEMO_CSV_URL` | hub URL above | Source of precomputed prompts |
space-demo/app.py ADDED
@@ -0,0 +1,538 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Prompt Golf — Hugging Face Spaces demo (Gradio).
3
+
4
+ Loads:
5
+ - Llama-3.2-3B-Instruct as the frozen TARGET model
6
+ - Qwen3-1.7B + LoRA adapter as the trained AGENT (lazy, on first
7
+ "Regenerate live" click)
8
+ - The demo CSV (verbose / untrained / trained prompts × 90 tasks)
9
+ fetched from the trained-adapter repo on first launch
10
+
11
+ For each task selected: shows the three prompts side-by-side, and
12
+ runs the target on a user-provided test input with all three in one
13
+ batched forward pass — so the demo's punch is "watch the same model
14
+ produce the same answer with a 12-token prompt that the human had to
15
+ write 250 tokens for."
16
+
17
+ Designed for a HuggingFace Space with GPU (T4 / A10G / L4 / L40S).
18
+ HF_TOKEN must be configured as a Space secret (Llama-3.2 is gated).
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import csv
24
+ import io
25
+ import os
26
+ import re
27
+ import textwrap
28
+ import time
29
+ import urllib.request
30
+ from dataclasses import dataclass
31
+ from typing import Dict, List, Optional
32
+
33
+ import torch
34
+ import gradio as gr
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer
36
+
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Defaults — override via Space secrets / env vars if needed
40
+ # ---------------------------------------------------------------------------
41
+
42
+ DEFAULTS = {
43
+ "target_model": os.environ.get(
44
+ "DEMO_TARGET_MODEL", "meta-llama/Llama-3.2-3B-Instruct"
45
+ ),
46
+ "agent_model": os.environ.get(
47
+ "DEMO_AGENT_MODEL", "Qwen/Qwen3-1.7B"
48
+ ),
49
+ "agent_adapter": os.environ.get(
50
+ "DEMO_AGENT_ADAPTER",
51
+ "rishabh16196/prompt-golf-qwen-to-llama-nothink",
52
+ ),
53
+ "demo_csv_url": os.environ.get(
54
+ "DEMO_CSV_URL",
55
+ "https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/"
56
+ "resolve/main/evals/qwen_to_llama_demo.csv",
57
+ ),
58
+ "max_new_tokens": 64,
59
+ "agent_max_new_tokens": 256,
60
+ "enable_thinking": False, # matches the trained adapter's training config
61
+ }
62
+
63
+
64
+ # ---------------------------------------------------------------------------
65
+ # Demo CSV loader
66
+ # ---------------------------------------------------------------------------
67
+
68
+ def load_demo_rows() -> List[Dict]:
69
+ url = DEFAULTS["demo_csv_url"]
70
+ print(f"[demo] fetching CSV from {url}", flush=True)
71
+ headers = {}
72
+ token = os.environ.get("HF_TOKEN")
73
+ if token:
74
+ headers["Authorization"] = f"Bearer {token}"
75
+ req = urllib.request.Request(url, headers=headers)
76
+ with urllib.request.urlopen(req) as r:
77
+ text = r.read().decode("utf-8")
78
+ rows = list(csv.DictReader(io.StringIO(text)))
79
+
80
+ def _delta(r: Dict) -> float:
81
+ try:
82
+ return float(r.get("reward_delta_trained_minus_base") or 0)
83
+ except ValueError:
84
+ return 0.0
85
+
86
+ rows.sort(key=_delta, reverse=True)
87
+ print(f"[demo] loaded {len(rows)} rows", flush=True)
88
+ return rows
89
+
90
+
91
+ # ---------------------------------------------------------------------------
92
+ # Target / agent singletons
93
+ # ---------------------------------------------------------------------------
94
+
95
+ _TOK = None
96
+ _MODEL = None
97
+ _DEVICE = None
98
+ _AGENT_TOK = None
99
+ _AGENT_BASE = None
100
+ _AGENT_TRAINED = None
101
+
102
+
103
+ def _device() -> str:
104
+ if torch.cuda.is_available():
105
+ return "cuda"
106
+ if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
107
+ return "mps"
108
+ return "cpu"
109
+
110
+
111
+ def load_target() -> None:
112
+ global _TOK, _MODEL, _DEVICE
113
+ if _MODEL is not None:
114
+ return
115
+ _DEVICE = _device()
116
+ name = DEFAULTS["target_model"]
117
+ print(f"[demo] loading target {name} on {_DEVICE}...", flush=True)
118
+ t0 = time.time()
119
+ _TOK = AutoTokenizer.from_pretrained(name)
120
+ _TOK.padding_side = "left"
121
+ if _TOK.pad_token is None:
122
+ _TOK.pad_token = _TOK.eos_token
123
+ dtype = torch.bfloat16 if _DEVICE in ("cuda", "mps") else torch.float32
124
+ _MODEL = AutoModelForCausalLM.from_pretrained(
125
+ name, dtype=dtype,
126
+ device_map="auto" if _DEVICE == "cuda" else None,
127
+ )
128
+ if _DEVICE != "cuda":
129
+ _MODEL = _MODEL.to(_DEVICE)
130
+ _MODEL.eval()
131
+ print(f"[demo] target loaded in {time.time()-t0:.1f}s ({dtype})",
132
+ flush=True)
133
+
134
+
135
+ @torch.inference_mode()
136
+ def run_target_batch(prompts: List[str], test_input: str) -> List[str]:
137
+ load_target()
138
+ full_texts = []
139
+ keep_idx = []
140
+ for i, p in enumerate(prompts):
141
+ if p and p.strip():
142
+ full_texts.append(f"{p}\n\n{test_input}".strip())
143
+ keep_idx.append(i)
144
+ if not full_texts:
145
+ return ["" for _ in prompts]
146
+
147
+ enc = _TOK(full_texts, return_tensors="pt", padding=True,
148
+ truncation=True, max_length=4096).to(_DEVICE)
149
+ out = _MODEL.generate(
150
+ **enc,
151
+ max_new_tokens=DEFAULTS["max_new_tokens"],
152
+ do_sample=False,
153
+ temperature=1.0,
154
+ pad_token_id=_TOK.pad_token_id,
155
+ )
156
+ in_len = enc["input_ids"].shape[1]
157
+ decoded = []
158
+ for i in range(out.shape[0]):
159
+ new_ids = out[i][in_len:]
160
+ decoded.append(_TOK.decode(new_ids, skip_special_tokens=True).strip())
161
+
162
+ results = ["" for _ in prompts]
163
+ for j, idx in enumerate(keep_idx):
164
+ results[idx] = decoded[j]
165
+ return results
166
+
167
+
168
+ def count_tokens(text: str) -> int:
169
+ load_target()
170
+ return len(_TOK.encode(text or "", add_special_tokens=False))
171
+
172
+
173
+ # ---------------------------------------------------------------------------
174
+ # Agent (lazy) — for the "Regenerate live" button
175
+ # ---------------------------------------------------------------------------
176
+
177
+ # Inlined utilities (copied from training/train_grpo.py so the Space stays
178
+ # self-contained — no need to install the full env package).
179
+ SYSTEM_PROMPT = textwrap.dedent("""
180
+ You are a prompt engineer. Your job: write a system prompt that makes a
181
+ separate, frozen target LLM solve the task on HIDDEN test inputs.
182
+
183
+ Rules:
184
+ - Output ONLY your prompt, wrapped in <prompt>...</prompt>.
185
+ - Keep it SHORT. Shorter prompts score higher.
186
+ - DO NOT copy train examples verbatim into your prompt — a leakage
187
+ detector scales the reward toward zero if you do.
188
+ - Use imperative voice. Anchor the output format tightly.
189
+ """).strip()
190
+
191
+ PROMPT_TAG_RE = re.compile(r"<prompt>(.*?)</prompt>", re.DOTALL | re.IGNORECASE)
192
+ THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
193
+
194
+
195
+ def extract_prompt(text: str) -> str:
196
+ text = text or ""
197
+ stripped = THINK_BLOCK_RE.sub("", text).strip()
198
+ m = PROMPT_TAG_RE.search(stripped)
199
+ if m and m.group(1).strip():
200
+ return m.group(1).strip()
201
+ for line in stripped.split("\n"):
202
+ line = line.strip()
203
+ if line and not line.lower().startswith(("<think>", "</think>")):
204
+ return line
205
+ return "Follow the instruction. Output only the answer."
206
+
207
+
208
+ def build_user_message(task_id: str, category: str, description: str,
209
+ budget: int, target_model_id: str) -> str:
210
+ return textwrap.dedent(f"""
211
+ TASK: {task_id} (category: {category})
212
+ DESCRIPTION: {description}
213
+ TOKEN BUDGET: {budget}
214
+ TARGET: {target_model_id}
215
+ BASELINE (empty prompt) SCORE: 0.00
216
+
217
+ Visible train examples (do not copy verbose):
218
+ (none)
219
+
220
+ Write your prompt inside <prompt>...</prompt>.
221
+ """).strip()
222
+
223
+
224
+ def load_agents() -> bool:
225
+ global _AGENT_TOK, _AGENT_BASE, _AGENT_TRAINED
226
+ if _AGENT_TRAINED is not None:
227
+ return True
228
+ if not DEFAULTS.get("agent_adapter"):
229
+ return False
230
+ name = DEFAULTS["agent_model"]
231
+ adapter = DEFAULTS["agent_adapter"]
232
+ print(f"[demo] loading agent {name} + adapter {adapter}...", flush=True)
233
+ t0 = time.time()
234
+ _AGENT_TOK = AutoTokenizer.from_pretrained(name)
235
+ _AGENT_TOK.padding_side = "left"
236
+ if _AGENT_TOK.pad_token is None:
237
+ _AGENT_TOK.pad_token = _AGENT_TOK.eos_token
238
+ dev = _device()
239
+ dtype = torch.bfloat16 if dev in ("cuda", "mps") else torch.float32
240
+ _AGENT_BASE = AutoModelForCausalLM.from_pretrained(
241
+ name, dtype=dtype,
242
+ device_map="auto" if dev == "cuda" else None,
243
+ )
244
+ if dev != "cuda":
245
+ _AGENT_BASE = _AGENT_BASE.to(dev)
246
+ _AGENT_BASE.eval()
247
+
248
+ from peft import PeftModel
249
+ base_for_adapter = AutoModelForCausalLM.from_pretrained(
250
+ name, dtype=dtype,
251
+ device_map="auto" if dev == "cuda" else None,
252
+ )
253
+ if dev != "cuda":
254
+ base_for_adapter = base_for_adapter.to(dev)
255
+ _AGENT_TRAINED = PeftModel.from_pretrained(base_for_adapter, adapter)
256
+ _AGENT_TRAINED.eval()
257
+ print(f"[demo] agents loaded in {time.time()-t0:.1f}s", flush=True)
258
+ return True
259
+
260
+
261
+ @torch.inference_mode()
262
+ def _agent_generate(model, tok, chat_str: str, max_new_tokens: int) -> str:
263
+ enc = tok(chat_str, return_tensors="pt").to(_device())
264
+ out = model.generate(
265
+ **enc, max_new_tokens=max_new_tokens, do_sample=False,
266
+ temperature=1.0, pad_token_id=tok.pad_token_id,
267
+ )
268
+ new_ids = out[0][enc["input_ids"].shape[1]:]
269
+ return tok.decode(new_ids, skip_special_tokens=True).strip()
270
+
271
+
272
+ def regenerate_live(task_id: str, category: str, verbose_prompt: str,
273
+ budget_str: str):
274
+ if not task_id:
275
+ return "", "", "(no task selected)"
276
+ if not load_agents():
277
+ return "", "", ("agent loading disabled — set DEMO_AGENT_ADAPTER "
278
+ "to enable live regeneration")
279
+ try:
280
+ budget = int(budget_str)
281
+ except (ValueError, TypeError):
282
+ budget = 60
283
+
284
+ user_msg = build_user_message(
285
+ task_id=task_id, category=category,
286
+ description=verbose_prompt, budget=budget,
287
+ target_model_id=DEFAULTS["target_model"],
288
+ )
289
+ messages = [
290
+ {"role": "system", "content": SYSTEM_PROMPT},
291
+ {"role": "user", "content": user_msg},
292
+ ]
293
+ try:
294
+ chat_str = _AGENT_TOK.apply_chat_template(
295
+ messages, tokenize=False, add_generation_prompt=True,
296
+ enable_thinking=DEFAULTS["enable_thinking"],
297
+ )
298
+ except TypeError:
299
+ chat_str = _AGENT_TOK.apply_chat_template(
300
+ messages, tokenize=False, add_generation_prompt=True,
301
+ )
302
+
303
+ t0 = time.time()
304
+ raw_base = _agent_generate(
305
+ _AGENT_BASE, _AGENT_TOK, chat_str,
306
+ max_new_tokens=DEFAULTS["agent_max_new_tokens"],
307
+ )
308
+ t1 = time.time()
309
+ raw_trained = _agent_generate(
310
+ _AGENT_TRAINED, _AGENT_TOK, chat_str,
311
+ max_new_tokens=DEFAULTS["agent_max_new_tokens"],
312
+ )
313
+ t2 = time.time()
314
+
315
+ base_p = extract_prompt(raw_base)
316
+ trained_p = extract_prompt(raw_trained)
317
+ msg = (
318
+ f"agents regenerated in {t2-t0:.1f}s "
319
+ f"(base {t1-t0:.1f}s, trained {t2-t1:.1f}s) | "
320
+ f"base: {count_tokens(base_p)} tok | "
321
+ f"trained: {count_tokens(trained_p)} tok"
322
+ )
323
+ return base_p, trained_p, msg
324
+
325
+
326
+ # ---------------------------------------------------------------------------
327
+ # Gradio handlers
328
+ # ---------------------------------------------------------------------------
329
+
330
+ ROWS: List[Dict] = []
331
+
332
+
333
+ def task_choices() -> List[str]:
334
+ out = []
335
+ for r in ROWS:
336
+ try:
337
+ cr = float(r.get("compression_ratio_trained_vs_verbose") or 0)
338
+ rd = float(r.get("reward_delta_trained_minus_base") or 0)
339
+ tag = (f" [{int(round(1/cr))}× compress, Δr={rd:+.2f}]"
340
+ if cr else "")
341
+ except (ValueError, ZeroDivisionError):
342
+ tag = ""
343
+ out.append(f"{r['task_id']}{tag}")
344
+ return out
345
+
346
+
347
+ def _row_for_label(label: str) -> Optional[Dict]:
348
+ if not label:
349
+ return None
350
+ tid = label.split()[0]
351
+ for r in ROWS:
352
+ if r["task_id"] == tid:
353
+ return r
354
+ return None
355
+
356
+
357
+ def select_task(label: str):
358
+ r = _row_for_label(label) or {}
359
+ return (
360
+ r.get("verbose_prompt", ""),
361
+ r.get("base_prompt", ""),
362
+ r.get("trained_prompt", ""),
363
+ r.get("category", ""),
364
+ r.get("scorer", ""),
365
+ r.get("verbose_tokens", "?"),
366
+ r.get("base_tokens", "?"),
367
+ r.get("trained_tokens", "?"),
368
+ r.get("verbose_accuracy", "?"),
369
+ r.get("base_accuracy", "?"),
370
+ r.get("trained_accuracy", "?"),
371
+ r.get("budget_tokens", "?"),
372
+ r.get("task_id", ""),
373
+ "", # test_input — start blank
374
+ )
375
+
376
+
377
+ def generate_three(verbose_prompt: str, base_prompt: str, trained_prompt: str,
378
+ test_input: str):
379
+ if not test_input.strip():
380
+ empty = "(enter a test input above)"
381
+ return empty, empty, empty, ""
382
+ t0 = time.time()
383
+ outs = run_target_batch(
384
+ [verbose_prompt, base_prompt, trained_prompt], test_input,
385
+ )
386
+ elapsed = time.time() - t0
387
+ metrics = (
388
+ f"batched in {elapsed:.1f}s | "
389
+ f"verbose: {count_tokens(verbose_prompt)} tok | "
390
+ f"untrained: {count_tokens(base_prompt)} tok | "
391
+ f"trained: {count_tokens(trained_prompt)} tok"
392
+ )
393
+ return outs[0], outs[1], outs[2], metrics
394
+
395
+
396
+ # ---------------------------------------------------------------------------
397
+ # Build app
398
+ # ---------------------------------------------------------------------------
399
+
400
+ def build_app() -> gr.Blocks:
401
+ global ROWS
402
+ ROWS = load_demo_rows()
403
+ initial = task_choices()[0] if ROWS else ""
404
+
405
+ with gr.Blocks(
406
+ title="Prompt Golf — Compression Demo",
407
+ theme=gr.themes.Soft(),
408
+ ) as app:
409
+ gr.Markdown(
410
+ f"# Prompt Golf — Compression Demo\n"
411
+ f"Compressed prompts from a Qwen3-1.7B agent (trained via GRPO), "
412
+ f"scored against **`{DEFAULTS['target_model']}`** as the target. "
413
+ f"Tasks ordered by reward gain (top = biggest improvement).\n\n"
414
+ f"Three columns: **verbose** (the human-written task description), "
415
+ f"**untrained** (raw Qwen3 output), and **trained** (after RL "
416
+ f"fine-tuning). Pick a task, type a test input, watch the target "
417
+ f"produce outputs with each prompt side by side."
418
+ )
419
+
420
+ with gr.Row():
421
+ task_dd = gr.Dropdown(
422
+ choices=task_choices(),
423
+ value=initial,
424
+ label="Task",
425
+ scale=4,
426
+ )
427
+ cat = gr.Textbox(label="category", interactive=False, scale=1)
428
+ scorer = gr.Textbox(label="scorer", interactive=False, scale=1)
429
+
430
+ # Hidden state for live regen
431
+ _task_id_state = gr.Textbox(visible=False)
432
+ _budget_state = gr.Textbox(visible=False)
433
+
434
+ with gr.Row():
435
+ with gr.Column():
436
+ gr.Markdown("### Verbose (human-written)")
437
+ verbose_box = gr.Textbox(
438
+ label="prompt", lines=8, interactive=True,
439
+ )
440
+ with gr.Row():
441
+ v_tok = gr.Textbox(label="tokens", interactive=False)
442
+ v_acc = gr.Textbox(label="accuracy", interactive=False)
443
+ with gr.Column():
444
+ gr.Markdown("### Untrained agent (base)")
445
+ base_box = gr.Textbox(
446
+ label="prompt", lines=8, interactive=True,
447
+ )
448
+ with gr.Row():
449
+ b_tok = gr.Textbox(label="tokens", interactive=False)
450
+ b_acc = gr.Textbox(label="accuracy", interactive=False)
451
+ with gr.Column():
452
+ gr.Markdown("### Trained agent (compressed)")
453
+ trained_box = gr.Textbox(
454
+ label="prompt", lines=8, interactive=True,
455
+ )
456
+ with gr.Row():
457
+ t_tok = gr.Textbox(label="tokens", interactive=False)
458
+ t_acc = gr.Textbox(label="accuracy", interactive=False)
459
+
460
+ gr.Markdown("### Test input — edit to try your own")
461
+ test_input = gr.Textbox(
462
+ label="input",
463
+ lines=3,
464
+ placeholder=("Type or paste a test input. The three prompts above "
465
+ "will each be prepended to it before the target "
466
+ "generates."),
467
+ )
468
+
469
+ with gr.Row():
470
+ regen_btn = gr.Button(
471
+ "Regenerate prompts live (loads agent + LoRA)",
472
+ variant="secondary",
473
+ )
474
+ run_btn = gr.Button(
475
+ "Run target with all three prompts", variant="primary"
476
+ )
477
+ regen_status = gr.Textbox(label="agent status", interactive=False)
478
+
479
+ with gr.Row():
480
+ with gr.Column():
481
+ gr.Markdown("### Target output — VERBOSE")
482
+ out_v = gr.Textbox(label="output", lines=4, interactive=False)
483
+ with gr.Column():
484
+ gr.Markdown("### Target output — UNTRAINED")
485
+ out_b = gr.Textbox(label="output", lines=4, interactive=False)
486
+ with gr.Column():
487
+ gr.Markdown("### Target output — TRAINED")
488
+ out_t = gr.Textbox(label="output", lines=4, interactive=False)
489
+
490
+ metrics = gr.Textbox(label="metrics", interactive=False)
491
+
492
+ gr.Markdown(
493
+ "---\n"
494
+ "**About**: this is the demo artifact for "
495
+ "[`prompt_golf_env`](https://huggingface.co/spaces/rishabh16196/prompt_golf_env), "
496
+ "an OpenEnv environment where the agent's *action* is a prompt "
497
+ "and the *reward* is how well that prompt steers a frozen target "
498
+ "LLM. The trained adapter shown here was fine-tuned with GRPO on "
499
+ "a 90-task bank including 3 long-context policy-compression "
500
+ "tasks (~700-token policies → ~25-token classifier prompts).\n"
501
+ "- 📝 [Blog post](https://huggingface.co/spaces/rishabh16196/prompt_golf_env/blob/main/BLOG_POST.md)\n"
502
+ "- 📊 [Demo CSV](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv)\n"
503
+ "- 🤖 [Trained adapter](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink)"
504
+ )
505
+
506
+ # Wire events
507
+ select_outputs = [
508
+ verbose_box, base_box, trained_box, cat, scorer,
509
+ v_tok, b_tok, t_tok, v_acc, b_acc, t_acc,
510
+ _budget_state, _task_id_state, test_input,
511
+ ]
512
+ task_dd.change(select_task, inputs=[task_dd], outputs=select_outputs)
513
+ regen_btn.click(
514
+ regenerate_live,
515
+ inputs=[_task_id_state, cat, verbose_box, _budget_state],
516
+ outputs=[base_box, trained_box, regen_status],
517
+ )
518
+ run_btn.click(
519
+ generate_three,
520
+ inputs=[verbose_box, base_box, trained_box, test_input],
521
+ outputs=[out_v, out_b, out_t, metrics],
522
+ )
523
+ app.load(select_task, inputs=[task_dd], outputs=select_outputs)
524
+
525
+ return app
526
+
527
+
528
+ def main() -> None:
529
+ print(f"[demo] target = {DEFAULTS['target_model']}", flush=True)
530
+ print(f"[demo] adapter = {DEFAULTS['agent_adapter']}", flush=True)
531
+ print(f"[demo] csv url = {DEFAULTS['demo_csv_url']}", flush=True)
532
+ load_target()
533
+ app = build_app()
534
+ app.launch()
535
+
536
+
537
+ if __name__ == "__main__":
538
+ main()
space-demo/requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ torch>=2.1
2
+ transformers>=4.45
3
+ peft>=0.13
4
+ accelerate>=0.34
5
+ huggingface_hub>=0.26
6
+ gradio>=4.40
7
+ sentencepiece