modelbuilderhq commited on
Commit
c6e8802
·
verified ·
1 Parent(s): 85333b3

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +86 -6
README.md CHANGED
@@ -15,6 +15,49 @@ HyperBrickCaseOps is an OpenEnv environment for enterprise support operations. T
15
 
16
  The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## Environment description and motivation
19
 
20
  This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.
@@ -169,27 +212,58 @@ Difficulty: easy
169
 
170
  A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.
171
 
 
 
 
 
 
 
 
 
172
  ### 2. `account_takeover_medium`
173
 
174
  Difficulty: medium
175
 
176
  This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.
177
 
 
 
 
 
 
 
 
 
178
  ### 3. `api_incident_hard`
179
 
180
  Difficulty: hard
181
 
182
  This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.
183
 
 
 
 
 
 
 
 
 
184
  ### 4. `regulated_export_exception_hard`
185
 
186
  Difficulty: hard
187
 
188
  This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.
189
 
 
 
 
 
 
 
 
190
  ## Reward and grader design
191
 
192
- Each task has a deterministic grader that returns a score in `[0.0, 1.0]`.
193
 
194
  The grader checks:
195
 
@@ -204,6 +278,12 @@ The grader checks:
204
 
205
  The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.
206
 
 
 
 
 
 
 
207
  Examples:
208
 
209
  - bonus for early correct routing on urgent tasks
@@ -373,11 +453,11 @@ Pre-submission script:
373
 
374
  Current deterministic fallback baseline:
375
 
376
- - `billing_refund_easy`: `1.00`
377
- - `account_takeover_medium`: `1.00`
378
- - `api_incident_hard`: `1.00`
379
- - `regulated_export_exception_hard`: `1.00`
380
- - average: `1.00`
381
 
382
  These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.
383
 
 
15
 
16
  The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.
17
 
18
+ ## Agent quickstart
19
+
20
+ If you are a generic agent being evaluated on this environment, the safest default strategy is:
21
+
22
+ 1. Read `objective`, `ticket`, `knowledge_base`, `workflow_stage`, and `required_next_actions`.
23
+ 2. Classify the case first by setting `queue`, `priority`, and `issue_type`.
24
+ 3. If the task requires missing details, use `request_info` before drafting a final answer.
25
+ 4. If customer follow-up is pending, use `wait` before assuming the missing fields arrived.
26
+ 5. Draft the customer-facing reply only after the routing and verification logic are correct.
27
+ 6. Add the internal note before final submission.
28
+ 7. Use `submit` only when the workflow really is complete.
29
+
30
+ High-level rule:
31
+
32
+ - primary issue first, secondary concerns second
33
+ - safe workflow over fast workflow
34
+ - do not resolve or unlock cases early just because the customer sounds urgent
35
+
36
+ ## Agent playbook
37
+
38
+ The environment is easiest to solve if the agent follows this action order:
39
+
40
+ - `classify`
41
+ - `request_info` if `required_next_actions` includes it
42
+ - `wait` if customer follow-up is pending
43
+ - `draft_reply`
44
+ - `add_internal_note`
45
+ - `submit`
46
+
47
+ Common failure modes:
48
+
49
+ - asking for unnecessary information on the easy billing task
50
+ - resolving a security or compliance case before required verification is complete
51
+ - routing the task based on a distracting secondary issue instead of the primary issue
52
+ - using `submit` while `required_next_actions` is still non-empty
53
+
54
+ Quick routing guide:
55
+
56
+ - duplicate charge after cancellation -> `billing_ops`, `high`, `duplicate_charge`
57
+ - suspicious login / locked out -> `trust_and_safety`, `urgent`, `account_compromise`
58
+ - production 500s / outage -> `platform_engineering`, `urgent`, `production_incident`
59
+ - export restriction / policy bypass request -> `compliance_ops`, `high`, `regulated_exception`
60
+
61
  ## Environment description and motivation
62
 
63
  This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.
 
212
 
213
  A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.
214
 
215
+ Best action pattern:
216
+
217
+ - classify to billing first
218
+ - do not request extra fields
219
+ - confirm refund timing in the reply
220
+ - add a note that the duplicate charge was verified
221
+ - resolve the case with the refund resolution code
222
+
223
  ### 2. `account_takeover_medium`
224
 
225
  Difficulty: medium
226
 
227
  This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.
228
 
229
+ Best action pattern:
230
+
231
+ - classify to trust and safety with urgent priority
232
+ - request `workspace_id`, `last_successful_login`, and `billing_email`
233
+ - wait for the partial follow-up
234
+ - reply with safe security steps
235
+ - keep the case open with `waiting_on_customer`
236
+
237
  ### 3. `api_incident_hard`
238
 
239
  Difficulty: hard
240
 
241
  This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.
242
 
243
+ Best action pattern:
244
+
245
+ - classify to platform engineering with urgent priority
246
+ - request `request_ids`, `timestamp_utc`, and `region`
247
+ - make clear that engineering is engaged
248
+ - do not resolve the case
249
+ - submit as an open incident / escalated case
250
+
251
  ### 4. `regulated_export_exception_hard`
252
 
253
  Difficulty: hard
254
 
255
  This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.
256
 
257
+ Best action pattern:
258
+
259
+ - classify to compliance operations
260
+ - request `tenant_region`, `dpa_amendment_id`, and `legal_contact_email`
261
+ - explicitly say no temporary bypass can be granted yet
262
+ - keep the case open pending legal/compliance review
263
+
264
  ## Reward and grader design
265
 
266
+ Each task has a deterministic grader that returns a score in `(0.01, 0.99)` for submission compatibility.
267
 
268
  The grader checks:
269
 
 
278
 
279
  The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.
280
 
281
+ Important:
282
+
283
+ - step rewards may go slightly negative when the agent makes a clearly suboptimal or unsafe move
284
+ - final deterministic grader outputs are clamped strictly inside `(0.01, 0.99)`
285
+ - `inference.py` also clamps the final emitted submission score to `(0.01, 0.99)`
286
+
287
  Examples:
288
 
289
  - bonus for early correct routing on urgent tasks
 
453
 
454
  Current deterministic fallback baseline:
455
 
456
+ - `billing_refund_easy`: `0.99`
457
+ - `account_takeover_medium`: `0.99`
458
+ - `api_incident_hard`: `0.99`
459
+ - `regulated_export_exception_hard`: `0.99`
460
+ - average: `0.99`
461
 
462
  These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.
463