Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
dec12b4
·
1 Parent(s): 8a2a589

README: replace ~ with ≈ in intro to fix accidental strikethrough

Browse files

GitHub-flavored markdown / HF rendering interpreted "~3h on a single
L40S, ... ~39-token prompts ... ~94-token" as strikethrough spans.
Switch to ≈ which renders identically and parses unambiguously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ tags:
16
  > **Can one LLM learn to whisper to another?**
17
  > An OpenEnv RL environment where the agent's *action* is a prompt and the *reward* is how well that prompt steers a *frozen, different-family* target LLM to do the right thing — minus how long the prompt is.
18
 
19
- **The result.** A Qwen3-1.7B agent (LoRA + TRL GRPO, ~3h on a single L40S) learns to write **~39-token prompts** that retain **80% of the accuracy** of ~94-token human-written prompts on a frozen Llama-3.2-3B target — *cross-family, black-box, learned from outputs alone, no gradient access*. On **63/90 (70%) of tasks the agent's compressed prompt is the best of the three** we evaluated (verbose, untrained agent, trained agent) — cheaper *and* equal-or-better reward. [▶ Try the live demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) · [📝 Read the blog](./BLOG_POST.md)
20
 
21
  **Why this matters.** Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — **the LLM can find the minimum policy itself.** Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).
22
 
 
16
  > **Can one LLM learn to whisper to another?**
17
  > An OpenEnv RL environment where the agent's *action* is a prompt and the *reward* is how well that prompt steers a *frozen, different-family* target LLM to do the right thing — minus how long the prompt is.
18
 
19
+ **The result.** A Qwen3-1.7B agent (LoRA + TRL GRPO, 3h on a single L40S) learns to write **39-token prompts** that retain **80% of the accuracy** of 94-token human-written prompts on a frozen Llama-3.2-3B target — *cross-family, black-box, learned from outputs alone, no gradient access*. On **63/90 (70%) of tasks the agent's compressed prompt is the best of the three** we evaluated (verbose, untrained agent, trained agent) — cheaper *and* equal-or-better reward. [▶ Try the live demo](https://huggingface.co/spaces/rishabh16196/prompt-golf-demo) · [📝 Read the blog](./BLOG_POST.md)
20
 
21
  **Why this matters.** Production LLM systems prepend 1000-token policies to every classification call (creative compliance, content moderation, regulated-comm review). Today the only way to compress them is for a human prompt engineer to iterate by hand. If one LLM can build a behavioral model of another LLM accurately enough — the same way humans model each other — **the LLM can find the minimum policy itself.** Train once, ship the compressor, save 30× per call. Same env generalizes to red-teaming (swap the rubric), capability elicitation (swap the target), and prompt distillation (swap the bank).
22