Spaces:

rishabh16196
/

prompt_golf_env

Sleeping

Don Rishabh Claude Opus 4.7 (1M context) commited on 14 days ago

Commit

9867aa7

1 Parent(s): c3e14ba

docs: drop the misleading 37× compression anecdote (0-accuracy task)

The MSN ad creative policy headline (737 tok → 20 tok, "37×
compression") was citing a task where BOTH verbose and trained
prompts get 0 accuracy on Llama-3.2-3B. Calling that a 'compression
win' was disingenuous — at 0 accuracy it's neither prompt working,
not a deployable result.

Removed:
- README hero-result line: 'Peak compression: 37× on long-context
policy tasks (e.g. 737-token MSN... → 20-token...)'
- BLOG_POST per-task highlights row for policy_msn_ad_creative
- BLOG_POST 'For policy compliance — the long-context money case'
paragraph (the actual 37× anecdote)

Replaced the BLOG_POST policy paragraph with a JSON key-ordering
case study where the trained prompt is BOTH shorter AND more
accurate — a real win, not a 0-accuracy compression.

Also adds a 📊 Demo CSV link under each variant section in the
README so per-task numbers are one click away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

BLOG_POST.md +3 -6
README.md +7 -1

BLOG_POST.md CHANGED Viewed

@@ -130,7 +130,6 @@ The verdict: thinking mode **did not help**. Thinking-OFF beat thinking-ON by **
 | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
 | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
 | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
-| `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression (both fail; target capability gap, not prompt issue) |
 The trained adapter learned **a nuanced strategy**, not just "shorter":
 - Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
@@ -154,13 +153,11 @@ For YAML extraction with strict nesting:
 >
 > Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
-For policy compliance — the long-context money case:
-> Verbose: 737 tokens of MSN ad policy listing prohibited content, restricted categories, format standards, decision schemas.
 >
-> Trained agent: *"Classify the input creative as allow, disallow, or review based on the given policy guidelines."* (20 tokens)
-That's **37× compression** on the prompt that's most expensive to ship. The trained agent's accuracy on this specific task is 0 — but so is the verbose prompt's, because Llama-3.2-3B isn't quite strong enough to reason over implicit policy hierarchy. **The compression is "free" — it doesn't lose accuracy the verbose prompt didn't already lack.**
 ---

 | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
 | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
 | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
 The trained adapter learned **a nuanced strategy**, not just "shorter":
 - Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
 >
 > Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
+For JSON key ordering:
+> Verbose: 47 tokens describing key-order specification format, default-sorting behavior, output rules.
 >
+> Trained agent: *"Given the input, output a JSON object with keys in the exact order specified. Ignore default sorting."* (38 tokens — 0.78 accuracy vs 0.61 verbose; **the trained prompt is shorter AND more accurate**)
 ---

README.md CHANGED Viewed

@@ -63,10 +63,12 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
 | Untrained Qwen3-1.7B | 0.48 | ~38 |
 | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
-→ **80% accuracy retention at 55% of the verbose token count.**  Peak compression: **37× on long-context policy tasks** (e.g. a 737-token MSN ad-creative policy → a 20-token classifier prompt).
 The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
 ---
 ## How it works
@@ -100,6 +102,8 @@ Same agent (Qwen3-1.7B + LoRA) but target is also Qwen3-1.7B. Different framing:
 Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
 ### Cross-family thinking=ON variant
 Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
@@ -114,6 +118,8 @@ Identical training setup but with Qwen3's `<think>...</think>` chat template ena
 OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
 ---
 ## Quick start

 | Untrained Qwen3-1.7B | 0.48 | ~38 |
 | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
+→ **80% accuracy retention at 55% of the verbose token count.**
 The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
+📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) — 90 rows × verbose / untrained / trained prompts side by side, with per-task accuracy + reward-advantage columns.
 ---
 ## How it works
 Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
+📊 **Demo CSV:** [`evals/qwen_to_qwen_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
 ### Cross-family thinking=ON variant
 Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
 OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
+📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
 ---
 ## Quick start