Spaces:
Sleeping
Sleeping
Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit ·
5f71cca
1
Parent(s): 86d20a9
BLOG_POST: drop 37× MSN policy compression callout
Browse filesRemoves the policy_msn_ad_creative row + "long-context money case"
quote. Both prompts score 0 on that task, so the 37× compression is
a no-cost-no-benefit number that overstates the result.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +0 -9
BLOG_POST.md
CHANGED
|
@@ -262,9 +262,6 @@ The full row-by-row demo CSV is at **[`evals/qwen_to_llama_demo.csv`](https://hu
|
|
| 262 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 263 |
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
|
| 264 |
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
|
| 265 |
-
| `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression — both fail because Llama-3.2-3B can't reason over the policy hierarchy, but the compression doesn't *cost* anything |
|
| 266 |
-
|
| 267 |
-
The `policy_msn_ad_creative` row is the most interesting. Both prompts get 0 accuracy, so it looks like a draw — but the verbose prompt was charging 737 tokens of prefill on every request to deliver that 0. The trained agent does it for 20. **Pair the compressed prompt with a stronger target and you'd ship the same behavior at 37× lower input-token cost.**
|
| 268 |
|
| 269 |
### What the agent actually wrote
|
| 270 |
|
|
@@ -280,12 +277,6 @@ For YAML extraction with strict nesting:
|
|
| 280 |
>
|
| 281 |
> *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
|
| 282 |
|
| 283 |
-
For policy compliance — the long-context money case:
|
| 284 |
-
|
| 285 |
-
> *Verbose:* 737 tokens of MSN ad-creative policy listing prohibited content, restricted categories, format standards, decision schemas.
|
| 286 |
-
>
|
| 287 |
-
> *Trained agent:* "Classify the input creative as allow, disallow, or review based on the given policy guidelines." (20 tokens)
|
| 288 |
-
|
| 289 |
### When the agent doesn't just shrink — it improves
|
| 290 |
|
| 291 |

|
|
|
|
| 262 |
| `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
|
| 263 |
| `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
|
| 264 |
| `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
|
|
|
|
|
|
|
|
|
|
| 265 |
|
| 266 |
### What the agent actually wrote
|
| 267 |
|
|
|
|
| 277 |
>
|
| 278 |
> *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
|
| 279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
### When the agent doesn't just shrink — it improves
|
| 281 |
|
| 282 |

|