Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit
86d20a9
·
1 Parent(s): 8d3be14

BLOG_POST: drop policy-tasks-as-headline-workload framing

Browse files

The policy tasks aren't the workload the eval numbers are reported on,
so calling them "the headline" overstates their role. Drop the paragraph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. BLOG_POST.md +0 -2
BLOG_POST.md CHANGED
@@ -173,8 +173,6 @@ Task quality is the single biggest determinant of whether this kind of environme
173
 
174
  Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60–250).
175
 
176
- The **policy tasks are the headline workload**: each has a 500–700-word real-world-style policy as the verbose prompt, and the agent has to compress it into a ≤250-token classifier prompt that still routes inputs to the right `allow / disallow / review` decision. That's where the inference-cost story is most visible — these are prompts that look like real production system prompts.
177
-
178
  <!-- IMAGE 2 — A second worked example from the demo
179
  Use: Image 3 (sentiment_nuanced) — best mid-tier example
180
  Caption emphasizes: trained drops the explanatory 'mixed' clause but keeps the labels. -->
 
173
 
174
  Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60–250).
175
 
 
 
176
  <!-- IMAGE 2 — A second worked example from the demo
177
  Use: Image 3 (sentiment_nuanced) — best mid-tier example
178
  Caption emphasizes: trained drops the explanatory 'mixed' clause but keeps the labels. -->