Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -15,6 +15,49 @@ HyperBrickCaseOps is an OpenEnv environment for enterprise support operations. T
|
|
| 15 |
|
| 16 |
The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
## Environment description and motivation
|
| 19 |
|
| 20 |
This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.
|
|
@@ -169,27 +212,58 @@ Difficulty: easy
|
|
| 169 |
|
| 170 |
A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.
|
| 171 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
### 2. `account_takeover_medium`
|
| 173 |
|
| 174 |
Difficulty: medium
|
| 175 |
|
| 176 |
This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
### 3. `api_incident_hard`
|
| 179 |
|
| 180 |
Difficulty: hard
|
| 181 |
|
| 182 |
This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.
|
| 183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
### 4. `regulated_export_exception_hard`
|
| 185 |
|
| 186 |
Difficulty: hard
|
| 187 |
|
| 188 |
This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
## Reward and grader design
|
| 191 |
|
| 192 |
-
Each task has a deterministic grader that returns a score in `
|
| 193 |
|
| 194 |
The grader checks:
|
| 195 |
|
|
@@ -204,6 +278,12 @@ The grader checks:
|
|
| 204 |
|
| 205 |
The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.
|
| 206 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
Examples:
|
| 208 |
|
| 209 |
- bonus for early correct routing on urgent tasks
|
|
@@ -373,11 +453,11 @@ Pre-submission script:
|
|
| 373 |
|
| 374 |
Current deterministic fallback baseline:
|
| 375 |
|
| 376 |
-
- `billing_refund_easy`: `
|
| 377 |
-
- `account_takeover_medium`: `
|
| 378 |
-
- `api_incident_hard`: `
|
| 379 |
-
- `regulated_export_exception_hard`: `
|
| 380 |
-
- average: `
|
| 381 |
|
| 382 |
These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.
|
| 383 |
|
|
|
|
| 15 |
|
| 16 |
The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.
|
| 17 |
|
| 18 |
+
## Agent quickstart
|
| 19 |
+
|
| 20 |
+
If you are a generic agent being evaluated on this environment, the safest default strategy is:
|
| 21 |
+
|
| 22 |
+
1. Read `objective`, `ticket`, `knowledge_base`, `workflow_stage`, and `required_next_actions`.
|
| 23 |
+
2. Classify the case first by setting `queue`, `priority`, and `issue_type`.
|
| 24 |
+
3. If the task requires missing details, use `request_info` before drafting a final answer.
|
| 25 |
+
4. If customer follow-up is pending, use `wait` before assuming the missing fields arrived.
|
| 26 |
+
5. Draft the customer-facing reply only after the routing and verification logic are correct.
|
| 27 |
+
6. Add the internal note before final submission.
|
| 28 |
+
7. Use `submit` only when the workflow really is complete.
|
| 29 |
+
|
| 30 |
+
High-level rule:
|
| 31 |
+
|
| 32 |
+
- primary issue first, secondary concerns second
|
| 33 |
+
- safe workflow over fast workflow
|
| 34 |
+
- do not resolve or unlock cases early just because the customer sounds urgent
|
| 35 |
+
|
| 36 |
+
## Agent playbook
|
| 37 |
+
|
| 38 |
+
The environment is easiest to solve if the agent follows this action order:
|
| 39 |
+
|
| 40 |
+
- `classify`
|
| 41 |
+
- `request_info` if `required_next_actions` includes it
|
| 42 |
+
- `wait` if customer follow-up is pending
|
| 43 |
+
- `draft_reply`
|
| 44 |
+
- `add_internal_note`
|
| 45 |
+
- `submit`
|
| 46 |
+
|
| 47 |
+
Common failure modes:
|
| 48 |
+
|
| 49 |
+
- asking for unnecessary information on the easy billing task
|
| 50 |
+
- resolving a security or compliance case before required verification is complete
|
| 51 |
+
- routing the task based on a distracting secondary issue instead of the primary issue
|
| 52 |
+
- using `submit` while `required_next_actions` is still non-empty
|
| 53 |
+
|
| 54 |
+
Quick routing guide:
|
| 55 |
+
|
| 56 |
+
- duplicate charge after cancellation -> `billing_ops`, `high`, `duplicate_charge`
|
| 57 |
+
- suspicious login / locked out -> `trust_and_safety`, `urgent`, `account_compromise`
|
| 58 |
+
- production 500s / outage -> `platform_engineering`, `urgent`, `production_incident`
|
| 59 |
+
- export restriction / policy bypass request -> `compliance_ops`, `high`, `regulated_exception`
|
| 60 |
+
|
| 61 |
## Environment description and motivation
|
| 62 |
|
| 63 |
This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.
|
|
|
|
| 212 |
|
| 213 |
A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.
|
| 214 |
|
| 215 |
+
Best action pattern:
|
| 216 |
+
|
| 217 |
+
- classify to billing first
|
| 218 |
+
- do not request extra fields
|
| 219 |
+
- confirm refund timing in the reply
|
| 220 |
+
- add a note that the duplicate charge was verified
|
| 221 |
+
- resolve the case with the refund resolution code
|
| 222 |
+
|
| 223 |
### 2. `account_takeover_medium`
|
| 224 |
|
| 225 |
Difficulty: medium
|
| 226 |
|
| 227 |
This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.
|
| 228 |
|
| 229 |
+
Best action pattern:
|
| 230 |
+
|
| 231 |
+
- classify to trust and safety with urgent priority
|
| 232 |
+
- request `workspace_id`, `last_successful_login`, and `billing_email`
|
| 233 |
+
- wait for the partial follow-up
|
| 234 |
+
- reply with safe security steps
|
| 235 |
+
- keep the case open with `waiting_on_customer`
|
| 236 |
+
|
| 237 |
### 3. `api_incident_hard`
|
| 238 |
|
| 239 |
Difficulty: hard
|
| 240 |
|
| 241 |
This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.
|
| 242 |
|
| 243 |
+
Best action pattern:
|
| 244 |
+
|
| 245 |
+
- classify to platform engineering with urgent priority
|
| 246 |
+
- request `request_ids`, `timestamp_utc`, and `region`
|
| 247 |
+
- make clear that engineering is engaged
|
| 248 |
+
- do not resolve the case
|
| 249 |
+
- submit as an open incident / escalated case
|
| 250 |
+
|
| 251 |
### 4. `regulated_export_exception_hard`
|
| 252 |
|
| 253 |
Difficulty: hard
|
| 254 |
|
| 255 |
This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.
|
| 256 |
|
| 257 |
+
Best action pattern:
|
| 258 |
+
|
| 259 |
+
- classify to compliance operations
|
| 260 |
+
- request `tenant_region`, `dpa_amendment_id`, and `legal_contact_email`
|
| 261 |
+
- explicitly say no temporary bypass can be granted yet
|
| 262 |
+
- keep the case open pending legal/compliance review
|
| 263 |
+
|
| 264 |
## Reward and grader design
|
| 265 |
|
| 266 |
+
Each task has a deterministic grader that returns a score in `(0.01, 0.99)` for submission compatibility.
|
| 267 |
|
| 268 |
The grader checks:
|
| 269 |
|
|
|
|
| 278 |
|
| 279 |
The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.
|
| 280 |
|
| 281 |
+
Important:
|
| 282 |
+
|
| 283 |
+
- step rewards may go slightly negative when the agent makes a clearly suboptimal or unsafe move
|
| 284 |
+
- final deterministic grader outputs are clamped strictly inside `(0.01, 0.99)`
|
| 285 |
+
- `inference.py` also clamps the final emitted submission score to `(0.01, 0.99)`
|
| 286 |
+
|
| 287 |
Examples:
|
| 288 |
|
| 289 |
- bonus for early correct routing on urgent tasks
|
|
|
|
| 453 |
|
| 454 |
Current deterministic fallback baseline:
|
| 455 |
|
| 456 |
+
- `billing_refund_easy`: `0.99`
|
| 457 |
+
- `account_takeover_medium`: `0.99`
|
| 458 |
+
- `api_incident_hard`: `0.99`
|
| 459 |
+
- `regulated_export_exception_hard`: `0.99`
|
| 460 |
+
- average: `0.99`
|
| 461 |
|
| 462 |
These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.
|
| 463 |
|