Spaces:
Sleeping
Sleeping
| title: HyperBrickCaseOps Agent Guide | |
| # HyperBrickCaseOps Agent Guide | |
| This environment evaluates real-world customer support triage. Agents must classify the ticket, request missing info when required, draft the customer reply, add an internal note, and submit only when the workflow is complete. | |
| ## Quick Start (Agent Strategy) | |
| Recommended action order: | |
| 1. `classify` — set `queue`, `priority`, `issue_type` | |
| 2. `request_info` if `required_next_actions` includes it | |
| 3. `wait` if the customer follow-up is pending | |
| 4. `draft_reply` | |
| 5. `add_internal_note` | |
| 6. `submit` | |
| ## Environment API | |
| The environment follows the standard OpenEnv API: | |
| - `reset()` -> initial observation | |
| - `step(action)` -> next observation, reward, done | |
| - `state()` -> internal state snapshot | |
| Server entrypoint: | |
| - `server.app:app` | |
| ## Action Schema | |
| Each step takes a typed `SupportDeskAction`: | |
| - `operation`: `classify|request_info|draft_reply|add_internal_note|submit|wait` | |
| - `queue`: string or null | |
| - `priority`: string or null | |
| - `issue_type`: string or null | |
| - `status`: string or null | |
| - `resolution_code`: string or null | |
| - `requested_fields`: list of strings | |
| - `reply`: string or null | |
| - `internal_note`: string or null | |
| ## Observation Highlights | |
| The observation includes: | |
| - `task_id`, `difficulty`, `objective` | |
| - `ticket` (customer, tier, region, business impact) | |
| - `knowledge_base` (policy snippets) | |
| - `case` (current triage state) | |
| - `workflow_stage`, `required_next_actions`, `risk_flags` | |
| ## Tasks and Difficulty | |
| There are 4 tasks with increasing difficulty: | |
| - `billing_refund_easy` (easy) | |
| - `account_takeover_medium` (medium) | |
| - `api_incident_hard` (hard) | |
| - `regulated_export_exception_hard` (hard) | |
| ## Grading and Reward | |
| - Deterministic graders score task completion | |
| - Final scores are clamped to `(0.01, 0.99)` | |
| - Reward provides dense progress signals across the episode | |
| ## Routing Guide (High-Level) | |
| - Duplicate charge -> `billing_ops`, `high`, `duplicate_charge` | |
| - Suspicious login -> `trust_and_safety`, `urgent`, `account_compromise` | |
| - Production 500s -> `platform_engineering`, `urgent`, `production_incident` | |
| - Export policy bypass -> `compliance_ops`, `high`, `regulated_exception` | |
| ## Required Environment Variables | |
| Baseline inference uses: | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| - `HF_TOKEN` | |
| ## Mandatory Stdout Format | |
| The inference script must emit exactly: | |
| ``` | |
| [START] task=<task_name> env=<benchmark> model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn> | |
| ``` | |
| Rules: | |
| - One `[START]` at episode begin | |
| - One `[STEP]` per env step | |
| - One `[END]` after episode close | |
| - `reward` and `rewards` formatted to 2 decimals | |
| - `done`/`success` are lowercase booleans | |