Spaces:
Sleeping
docs: drop the misleading 37× compression anecdote (0-accuracy task)
Browse filesThe MSN ad creative policy headline (737 tok → 20 tok, "37×
compression") was citing a task where BOTH verbose and trained
prompts get 0 accuracy on Llama-3.2-3B. Calling that a 'compression
win' was disingenuous — at 0 accuracy it's neither prompt working,
not a deployable result.
Removed:
- README hero-result line: 'Peak compression: 37× on long-context
policy tasks (e.g. 737-token MSN... → 20-token...)'
- BLOG_POST per-task highlights row for policy_msn_ad_creative
- BLOG_POST 'For policy compliance — the long-context money case'
paragraph (the actual 37× anecdote)
Replaced the BLOG_POST policy paragraph with a JSON key-ordering
case study where the trained prompt is BOTH shorter AND more
accurate — a real win, not a 0-accuracy compression.
Also adds a 📊 Demo CSV link under each variant section in the
README so per-task numbers are one click away.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +3 -6
- README.md +7 -1
|
@@ -130,7 +130,6 @@ The verdict: thinking mode **did not help**. Thinking-OFF beat thinking-ON by **
|
|
| 130 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 131 |
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
|
| 132 |
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
|
| 133 |
-
| `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression (both fail; target capability gap, not prompt issue) |
|
| 134 |
|
| 135 |
The trained adapter learned **a nuanced strategy**, not just "shorter":
|
| 136 |
- Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
|
|
@@ -154,13 +153,11 @@ For YAML extraction with strict nesting:
|
|
| 154 |
>
|
| 155 |
> Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
|
| 156 |
|
| 157 |
-
For
|
| 158 |
|
| 159 |
-
> Verbose:
|
| 160 |
>
|
| 161 |
-
> Trained agent: *"
|
| 162 |
-
|
| 163 |
-
That's **37× compression** on the prompt that's most expensive to ship. The trained agent's accuracy on this specific task is 0 — but so is the verbose prompt's, because Llama-3.2-3B isn't quite strong enough to reason over implicit policy hierarchy. **The compression is "free" — it doesn't lose accuracy the verbose prompt didn't already lack.**
|
| 164 |
|
| 165 |
---
|
| 166 |
|
|
|
|
| 130 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 131 |
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | shorter AND +17pp accuracy |
|
| 132 |
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | added label vocabulary the verbose prompt forgot |
|
|
|
|
| 133 |
|
| 134 |
The trained adapter learned **a nuanced strategy**, not just "shorter":
|
| 135 |
- Add label vocabulary when the verbose prompt forgets it ("positive / negative / neutral")
|
|
|
|
| 153 |
>
|
| 154 |
> Trained agent: *"Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification."* (20 tokens, 1.00 accuracy)
|
| 155 |
|
| 156 |
+
For JSON key ordering:
|
| 157 |
|
| 158 |
+
> Verbose: 47 tokens describing key-order specification format, default-sorting behavior, output rules.
|
| 159 |
>
|
| 160 |
+
> Trained agent: *"Given the input, output a JSON object with keys in the exact order specified. Ignore default sorting."* (38 tokens — 0.78 accuracy vs 0.61 verbose; **the trained prompt is shorter AND more accurate**)
|
|
|
|
|
|
|
| 161 |
|
| 162 |
---
|
| 163 |
|
|
@@ -63,10 +63,12 @@ A Qwen3-1.7B agent (trained via TRL GRPO) learns to write **35-token prompts** t
|
|
| 63 |
| Untrained Qwen3-1.7B | 0.48 | ~38 |
|
| 64 |
| **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
|
| 65 |
|
| 66 |
-
→ **80% accuracy retention at 55% of the verbose token count.**
|
| 67 |
|
| 68 |
The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
|
| 69 |
|
|
|
|
|
|
|
| 70 |
---
|
| 71 |
|
| 72 |
## How it works
|
|
@@ -100,6 +102,8 @@ Same agent (Qwen3-1.7B + LoRA) but target is also Qwen3-1.7B. Different framing:
|
|
| 100 |
|
| 101 |
Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
|
| 102 |
|
|
|
|
|
|
|
| 103 |
### Cross-family thinking=ON variant
|
| 104 |
|
| 105 |
Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
|
|
@@ -114,6 +118,8 @@ Identical training setup but with Qwen3's `<think>...</think>` chat template ena
|
|
| 114 |
|
| 115 |
OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
|
| 116 |
|
|
|
|
|
|
|
| 117 |
---
|
| 118 |
|
| 119 |
## Quick start
|
|
|
|
| 63 |
| Untrained Qwen3-1.7B | 0.48 | ~38 |
|
| 64 |
| **Trained Qwen3-1.7B + LoRA** | **0.52** | **35** |
|
| 65 |
|
| 66 |
+
→ **80% accuracy retention at 55% of the verbose token count.**
|
| 67 |
|
| 68 |
The trained prompt **beats the human verbose prompt on 48 of 87 tasks (55%)** under the same rubric (`raw_score − 0.5·baseline − 0.002·tokens`). On the rest, the accuracy drop on hard tasks outweighs the length savings — those are the cases where the trained agent compressed too aggressively to keep up with Llama's verbose-prompt capability ceiling.
|
| 69 |
|
| 70 |
+
📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama-nothink/blob/main/evals/qwen_to_llama_demo.csv) — 90 rows × verbose / untrained / trained prompts side by side, with per-task accuracy + reward-advantage columns.
|
| 71 |
+
|
| 72 |
---
|
| 73 |
|
| 74 |
## How it works
|
|
|
|
| 102 |
|
| 103 |
Qwen target's weakness makes this the easier comparison to "win" — the trained agent's compressed prompts beat verbose on 80% of tasks because verbose itself is failing. Cross-family Llama target is a much harder bar to clear, but the absolute accuracy is far higher.
|
| 104 |
|
| 105 |
+
📊 **Demo CSV:** [`evals/qwen_to_qwen_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-grpo-1.5b/blob/main/evals/qwen_to_qwen_demo.csv)
|
| 106 |
+
|
| 107 |
### Cross-family thinking=ON variant
|
| 108 |
|
| 109 |
Identical training setup but with Qwen3's `<think>...</think>` chat template enabled.
|
|
|
|
| 118 |
|
| 119 |
OFF wins on reward and compression; ON wins on accuracy by 1.6pp at a 30% token cost. We ship OFF as the default; ON is a different operating point on the accuracy/length frontier.
|
| 120 |
|
| 121 |
+
📊 **Demo CSV:** [`evals/qwen_to_llama_demo.csv`](https://huggingface.co/rishabh16196/prompt-golf-qwen-to-llama/blob/main/evals/qwen_to_llama_demo.csv)
|
| 122 |
+
|
| 123 |
---
|
| 124 |
|
| 125 |
## Quick start
|