Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
9867aa7
·
1 Parent(s): c3e14ba

docs: drop the misleading 37× compression anecdote (0-accuracy task)

Browse files

The MSN ad creative policy headline (737 tok → 20 tok, "37×
compression") was citing a task where BOTH verbose and trained
prompts get 0 accuracy on Llama-3.2-3B. Calling that a 'compression
win' was disingenuous — at 0 accuracy it's neither prompt working,
not a deployable result.

Removed:
- README hero-result line: 'Peak compression: 37× on long-context
policy tasks (e.g. 737-token MSN... → 20-token...)'
- BLOG_POST per-task highlights row for policy_msn_ad_creative
- BLOG_POST 'For policy compliance — the long-context money case'
paragraph (the actual 37× anecdote)

Replaced the BLOG_POST policy paragraph with a JSON key-ordering
case study where the trained prompt is BOTH shorter AND more
accurate — a real win, not a 0-accuracy compression.

Also adds a 📊 Demo CSV link under each variant section in the
README so per-task numbers are one click away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. BLOG_POST.md +3 -6
  2. README.md +7 -1
BLOG_POST.md CHANGED
@@ -130,7 +130,6 @@ The verdict: thinking mode **did not help**. Thinking-OFF beat thinking-ON by **
130
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
131
  | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
132
  | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
133
- | `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression (both fail; target capability gap, not prompt issue) |
134
 
135
  The trained adapter learned **a nuanced strategy**, not just "shorter":
136
  - Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
@@ -154,13 +153,11 @@ For YAML extraction with strict nesting:
154
  >
155
  > Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
156
 
157
- For policy compliance — the long-context money case:
158
 
159
- > Verbose: 737 tokens of MSN ad policy listing prohibited content, restricted categories, format standards, decision schemas.
160
  >
161
- > Trained agent: *"Classify the input creative as allow, disallow, or review based on the given policy guidelines."* (20 tokens)
162
-
163
- That's **37× compression** on the prompt that's most expensive to ship. The trained agent's accuracy on this specific task is 0 — but so is the verbose prompt's, because Llama-3.2-3B isn't quite strong enough to reason over implicit policy hierarchy. **The compression is "free" — it doesn't lose accuracy the verbose prompt didn't already lack.**
164
 
165
  ---
166
 
 
130
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
131
  | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
132
  | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
 
133
 
134
  The trained adapter learned **a nuanced strategy**, not just "shorter":
135
  - Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
 
153
  >
154
  > Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
155
 
156
+ For JSON key ordering:
157
 
158
+ > Verbose: 47 tokens describing key-order specification format, default-sorting behavior, output rules.
159
  >
160
+ > Trained agent: *"Given the input, output a JSON object with keys in the exact order specified. Ignore default sorting."* (38 tokens — 0.78 accuracy vs 0.61 verbose; **the trained prompt is shorter AND more accurate**)
 
 
161
 
162
  ---
163
 
README.md CHANGED
@@ -63,10 +63,12 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
63
  | Untrained Qwen3-1.7B | 0.48 | ~38 |
64
  | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
65
 
66
- → **80% accuracy retention at 55% of the verbose token count.** Peak compression: **37× on long-context policy tasks** (e.g. a 737-token MSN ad-creative policy → a 20-token classifier prompt).
67
 
68
  The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
69
 
 
 
70
  ---
71
 
72
  ## How it works
@@ -100,6 +102,8 @@ Same agent (Qwen3-1.7B + LoRA) but target is also Qwen3-1.7B. Different framing:
100
 
101
  Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
102
 
 
 
103
  ### Cross-family thinking=ON variant
104
 
105
  Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
@@ -114,6 +118,8 @@ Identical training setup but with Qwen3's `<think>...</think>` chat template ena
114
 
115
  OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
116
 
 
 
117
  ---
118
 
119
  ## Quick start
 
63
  | Untrained Qwen3-1.7B | 0.48 | ~38 |
64
  | **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
65
 
66
+ → **80% accuracy retention at 55% of the verbose token count.**
67
 
68
  The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
69
 
70
+ 📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) — 90 rows × verbose / untrained / trained prompts side by side, with per-task accuracy + reward-advantage columns.
71
+
72
  ---
73
 
74
  ## How it works
 
102
 
103
  Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
104
 
105
+ 📊 **Demo CSV:** [`evals/qwen_to_qwen_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
106
+
107
  ### Cross-family thinking=ON variant
108
 
109
  Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
 
118
 
119
  OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
120
 
121
+ 📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
122
+
123
  ---
124
 
125
  ## Quick start