File size: 2,790 Bytes
9ffc733
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: HyperBrickCaseOps Agent Guide
---

# HyperBrickCaseOps Agent Guide

This environment evaluates real-world customer support triage. Agents must classify the ticket, request missing info when required, draft the customer reply, add an internal note, and submit only when the workflow is complete.

## Quick Start (Agent Strategy)

Recommended action order:

1. `classify` — set `queue`, `priority`, `issue_type`
2. `request_info` if `required_next_actions` includes it
3. `wait` if the customer follow-up is pending
4. `draft_reply`
5. `add_internal_note`
6. `submit`

## Environment API

The environment follows the standard OpenEnv API:

- `reset()` -> initial observation
- `step(action)` -> next observation, reward, done
- `state()` -> internal state snapshot

Server entrypoint:

- `server.app:app`

## Action Schema

Each step takes a typed `SupportDeskAction`:

- `operation`: `classify|request_info|draft_reply|add_internal_note|submit|wait`
- `queue`: string or null
- `priority`: string or null
- `issue_type`: string or null
- `status`: string or null
- `resolution_code`: string or null
- `requested_fields`: list of strings
- `reply`: string or null
- `internal_note`: string or null

## Observation Highlights

The observation includes:

- `task_id`, `difficulty`, `objective`
- `ticket` (customer, tier, region, business impact)
- `knowledge_base` (policy snippets)
- `case` (current triage state)
- `workflow_stage`, `required_next_actions`, `risk_flags`

## Tasks and Difficulty

There are 4 tasks with increasing difficulty:

- `billing_refund_easy` (easy)
- `account_takeover_medium` (medium)
- `api_incident_hard` (hard)
- `regulated_export_exception_hard` (hard)

## Grading and Reward

- Deterministic graders score task completion
- Final scores are clamped to `(0.01, 0.99)`
- Reward provides dense progress signals across the episode

## Routing Guide (High-Level)

- Duplicate charge -> `billing_ops`, `high`, `duplicate_charge`
- Suspicious login -> `trust_and_safety`, `urgent`, `account_compromise`
- Production 500s -> `platform_engineering`, `urgent`, `production_incident`
- Export policy bypass -> `compliance_ops`, `high`, `regulated_exception`

## Required Environment Variables

Baseline inference uses:

- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`

## Mandatory Stdout Format

The inference script must emit exactly:

```
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
```

Rules:

- One `[START]` at episode begin
- One `[STEP]` per env step
- One `[END]` after episode close
- `reward` and `rewards` formatted to 2 decimals
- `done`/`success` are lowercase booleans