Spaces:
Sleeping
Sleeping
Don Rishabh Claude Opus 4.7 (1M context) commited on
Commit ·
86d20a9
1
Parent(s): 8d3be14
BLOG_POST: drop policy-tasks-as-headline-workload framing
Browse filesThe policy tasks aren't the workload the eval numbers are reported on,
so calling them "the headline" overstates their role. Drop the paragraph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- BLOG_POST.md +0 -2
BLOG_POST.md
CHANGED
|
@@ -173,8 +173,6 @@ Task quality is the single biggest determinant of whether this kind of environme
|
|
| 173 |
|
| 174 |
Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60–250).
|
| 175 |
|
| 176 |
-
The **policy tasks are the headline workload**: each has a 500–700-word real-world-style policy as the verbose prompt, and the agent has to compress it into a ≤250-token classifier prompt that still routes inputs to the right `allow / disallow / review` decision. That's where the inference-cost story is most visible — these are prompts that look like real production system prompts.
|
| 177 |
-
|
| 178 |
<!-- IMAGE 2 — A second worked example from the demo
|
| 179 |
Use: Image 3 (sentiment_nuanced) — best mid-tier example
|
| 180 |
Caption emphasizes: trained drops the explanatory 'mixed' clause but keeps the labels. -->
|
|
|
|
| 173 |
|
| 174 |
Each task ships 3 visible train examples + 6 hidden test examples + a per-task token budget (60–250).
|
| 175 |
|
|
|
|
|
|
|
| 176 |
<!-- IMAGE 2 — A second worked example from the demo
|
| 177 |
Use: Image 3 (sentiment_nuanced) — best mid-tier example
|
| 178 |
Caption emphasizes: trained drops the explanatory 'mixed' clause but keeps the labels. -->
|