Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
5f71cca
·
1 Parent(s): 86d20a9

BLOG_POST: drop 37× MSN policy compression callout

Browse files

Removes the policy_msn_ad_creative row + "long-context money case"
quote. Both prompts score 0 on that task, so the 37× compression is
a no-cost-no-benefit number that overstates the result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. BLOG_POST.md +0 -9
BLOG_POST.md CHANGED
@@ -262,9 +262,6 @@ The full row-by-row demo CSV is at **[`evals/qwen_to_llama_demo.csv`](https://hu
262
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
263
  | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
264
  | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
265
- | `policy_msn_ad_creative` | **737 tok** / 0.00 | **20 tok** / 0.00 | 37× compression — both fail because Llama-3.2-3B can't reason over the policy hierarchy, but the compression doesn't *cost* anything |
266
-
267
- The `policy_msn_ad_creative` row is the most interesting. Both prompts get 0 accuracy, so it looks like a draw — but the verbose prompt was charging 737 tokens of prefill on every request to deliver that 0. The trained agent does it for 20. **Pair the compressed prompt with a stronger target and you'd ship the same behavior at 37× lower input-token cost.**
268
 
269
  ### What the agent actually wrote
270
 
@@ -280,12 +277,6 @@ For YAML extraction with strict nesting:
280
  >
281
  > *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
282
 
283
- For policy compliance — the long-context money case:
284
-
285
- > *Verbose:* 737 tokens of MSN ad-creative policy listing prohibited content, restricted categories, format standards, decision schemas.
286
- >
287
- > *Trained agent:* "Classify the input creative as allow, disallow, or review based on the given policy guidelines." (20 tokens)
288
-
289
  ### When the agent doesn't just shrink — it improves
290
 
291
  ![Live demo: shakespearean_response, 1× compress, +0.15 reward gain](./assets/demo_shakespearean.png)
 
262
  | `tough_yaml_nested_depth` | 74 tok / 0.96 | **20 tok** / **1.00** | 3.7× compression, accuracy improved |
263
  | `json_key_ordering` | 47 tok / 0.61 | **38 tok** / **0.78** | Shorter AND +17pp accuracy |
264
  | `tough_fallacy_classify` | 164 tok / 0.00 | **59 tok** / **0.33** | Trained agent added the label vocabulary the verbose prompt forgot |
 
 
 
265
 
266
  ### What the agent actually wrote
267
 
 
277
  >
278
  > *Trained agent:* "Generate a YAML document that meets the specified minimum nesting depth and includes all entities from the given specification." (20 tokens, **1.00 accuracy**)
279
 
 
 
 
 
 
 
280
  ### When the agent doesn't just shrink — it improves
281
 
282
  ![Live demo: shakespearean_response, 1× compress, +0.15 reward gain](./assets/demo_shakespearean.png)