Spaces:
Sleeping
Sleeping
| title: HyperBrickCaseOps | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - customer-support | |
| base_path: /web | |
| # HyperBrickCaseOps | |
| SupportDesk is best thought of as an enterprise operations-desk environment, not a generic support classifier. | |
| SupportDesk is a real-world RL environment for enterprise support operations. The agent receives a realistic inbound ticket, a small internal knowledge base, and the live case state. It must route the case, set the right priority, decide whether to request more information, draft the customer response, add an internal note, and submit the case with the correct final status. | |
| One-sentence summary: HyperBrickCaseOps is a deterministic OpenEnv customer-support operations environment that evaluates whether an agent can triage, communicate, escalate, and resolve enterprise cases correctly end to end. | |
| This environment is intentionally built around work humans actually do every day in B2B SaaS support queues. It is not a toy chat task and it is not a game. The environment includes enterprise mechanics such as SLA countdowns, business-impact context, and distracting secondary concerns, so the agent has to prioritize the primary operational issue instead of just pattern-matching keywords. | |
| ## Environment Description and Motivation | |
| The goal of this environment is to model a real operational gap in agent evaluation: many support benchmarks only test whether a model can produce a plausible reply, but real support work also requires correct routing, escalation, information gathering, and final disposition decisions. SupportDesk is designed to evaluate whether an agent can handle enterprise support operations end to end rather than just generate support-sounding text. | |
| This makes the environment useful for both: | |
| - training agents to improve multi-step support operations behavior | |
| - evaluating whether an agent can make safe and business-correct support decisions under pressure | |
| ## Why this should score well | |
| - Real-world utility: customer support triage is a real production workflow with immediate evaluation value. | |
| - Deterministic grading: every task has an explicit gold queue, priority, issue type, required follow-up fields, reply markers, note markers, status, and resolution code. | |
| - Dense rewards: each step gets rewarded from the delta in the deterministic grader, which gives partial progress rather than only a binary terminal signal. | |
| - Reproducible baseline: `inference.py` runs all tasks in a fixed order and falls back to a deterministic heuristic policy if model credentials are unavailable. | |
| - Novel mechanics: observations expose SLA pressure, business impact, and secondary concerns, which makes the environment closer to an enterprise operations desk than a plain support classifier. | |
| ## Architecture Diagram | |
| ```text | |
| Inbound Task Spec + Ticket + KB | |
| | | |
| v | |
| SupportDeskEnvironment | |
| - reset() | |
| - step(action) | |
| - state() | |
| | | |
| +--> SupportDeskObservation | |
| +--> dense reward shaping | |
| +--> episode termination | |
| | | |
| v | |
| Deterministic Grader | |
| - queue correctness | |
| - priority correctness | |
| - issue type correctness | |
| - requested fields | |
| - reply coverage | |
| - internal note coverage | |
| - status / resolution | |
| | | |
| v | |
| Baseline in inference.py | |
| - OpenAI-compatible client path | |
| - deterministic fallback path | |
| ``` | |
| ## Why this is more novel than a standard support benchmark | |
| - It is not just routing or intent classification. The agent has to combine queueing, urgency, customer communication, internal notes, and final disposition in one trajectory. | |
| - It models primary-vs-secondary issue prioritization. The hardest task includes a tempting compliance side-question that should not override the live outage. | |
| - It encodes enterprise pressure directly in the observation through SLA countdowns, affected-user counts, and business-impact context. | |
| - It evaluates operational judgment, not just answer quality. A polished reply with the wrong queue, wrong escalation choice, or premature resolution still scores poorly. | |
| - It is built specifically for OpenEnv-style agent learning and evaluation, where the same environment can be used for baseline runs, external agents, and RL experiments. | |
| ## Action Space | |
| Each `step()` takes a typed `SupportDeskAction` with: | |
| - `operation`: one of `classify`, `request_info`, `draft_reply`, `add_internal_note`, `submit` | |
| - `queue` | |
| - `priority` | |
| - `issue_type` | |
| - `status` | |
| - `resolution_code` | |
| - `requested_fields` | |
| - `reply` | |
| - `internal_note` | |
| The environment allows the agent to update multiple fields in one structured action, which keeps the workflow realistic and helps training. | |
| ## Observation Space | |
| Each observation contains: | |
| - `task_id`, `difficulty`, and the agent objective | |
| - the inbound `ticket` | |
| - ticket-level urgency metadata such as `affected_users`, `sla_minutes_remaining`, `business_impact`, and `secondary_concerns` | |
| - `knowledge_base` policy snippets | |
| - allowed queues, priorities, statuses, and issue types | |
| - the mutable `case` snapshot | |
| - `action_history` | |
| - `feedback` | |
| - `remaining_steps` | |
| - the standard OpenEnv `reward` and `done` | |
| ## OpenEnv Interface | |
| The environment implements the standard OpenEnv API: | |
| - `reset()` returns the initial typed observation for a new case | |
| - `step(action)` returns the next typed observation together with reward and done status | |
| - `state()` returns the current typed environment state | |
| - `openenv.yaml` provides environment metadata used by validators and deployment tooling | |
| The implementation uses typed Pydantic models for action, observation, and state. | |
| ## Task Descriptions with Expected Difficulty | |
| 1. `billing_refund_easy` - Expected difficulty: easy | |
| Duplicate-charge billing ticket. The correct path is immediate billing routing, a refund confirmation, and case resolution. | |
| 2. `account_takeover_medium` - Expected difficulty: medium | |
| Suspicious-login security ticket. The agent must escalate to trust and safety, request verification details, and keep the case waiting on the customer. | |
| 3. `api_incident_hard` - Expected difficulty: hard | |
| Enterprise production API incident with a distracting compliance mention. The agent must escalate to platform engineering, request the right diagnostics, and open the incident instead of resolving it. | |
| What makes these tasks less generic than ordinary support-routing demos: | |
| - They mix queueing, priority, customer communication, internal note-taking, and close-vs-escalate decisions in one trajectory. | |
| - They include operational context like customer tier, affected-user count, SLA pressure, and business impact. | |
| - The harder tasks contain conflicting or distracting signals, so a frontier model has to identify the primary issue instead of treating every mention as equally important. | |
| ## Deterministic Graders | |
| The final task score is a weighted total in `[0.0, 1.0]`: | |
| - Queue correctness: `0.15` | |
| - Priority correctness: `0.10` | |
| - Issue-type correctness: `0.10` | |
| - Requested-fields correctness: `0.15` | |
| - Reply coverage: `0.25` | |
| - Internal-note coverage: `0.10` | |
| - Final status: `0.10` | |
| - Resolution code: `0.05` | |
| The same grader also drives dense reward shaping during the episode by comparing the current score to the previous score and then subtracting small penalties for no-op or low-signal actions. | |
| ## Project Layout | |
| ```text | |
| . | |
| |-- inference.py | |
| |-- openenv.yaml | |
| |-- pyproject.toml | |
| |-- requirements.txt | |
| |-- supportdesk_env | |
| | |-- __init__.py | |
| | |-- client.py | |
| | |-- graders.py | |
| | |-- models.py | |
| | |-- tasks.py | |
| | `-- server | |
| | |-- app.py | |
| | `-- supportdesk_environment.py | |
| |-- tests | |
| | `-- test_supportdesk.py | |
| `-- uv.lock | |
| ``` | |
| ## Local Setup | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Or with uv: | |
| ```bash | |
| uv sync | |
| ``` | |
| Optional environment variables for the baseline: | |
| ```bash | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="openai/gpt-oss-120b" | |
| export OPENAI_API_KEY="sk-..." # Or use HF_TOKEN with a compatible router | |
| export HF_TOKEN="hf_..." | |
| ``` | |
| The baseline uses the OpenAI Python client and supports both `OPENAI_API_KEY` and `HF_TOKEN`. | |
| ## Setup and Usage Instructions | |
| Typical local workflow: | |
| ```bash | |
| pip install -r requirements.txt | |
| python -m openenv.cli validate . | |
| python inference.py | |
| python -m supportdesk_env.server.app | |
| ``` | |
| ## Local RL Playground | |
| If you want to import the package directly and train against the local environment without going through the HTTP server, use the tabular Q-learning example: | |
| ```bash | |
| python examples/rl/train_q_agent.py | |
| ``` | |
| This script imports the package, instantiates `SupportDeskEnvironment` directly, trains a tiny Q-learning agent over a compact discrete action library, and then prints greedy evaluation results for all three tasks. It is meant as a local experimentation playground, not as the official submission baseline. | |
| ## Run the Server | |
| ```bash | |
| python -m supportdesk_env.server.app | |
| ``` | |
| Or with the OpenEnv entrypoint: | |
| ```bash | |
| server | |
| ``` | |
| ## Run the Baseline | |
| ```bash | |
| python inference.py | |
| ``` | |
| When model credentials are present, the script uses the OpenAI client against `API_BASE_URL` and `MODEL_NAME`. If credentials are missing or a request fails, it falls back to a deterministic heuristic policy so the script still completes and prints reproducible scores. | |
| ## Docker | |
| ```bash | |
| docker build -t supportdesk-env . | |
| docker run -p 8000:8000 supportdesk-env | |
| ``` | |
| ## Hugging Face Space Deployment | |
| Deploy this repo as a Docker Space and keep it public for submission. The Space should include the `openenv` tag and the following environment configuration values: | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| - `HF_TOKEN` | |
| If the OpenEnv CLI is installed, deployment can be done with: | |
| ```bash | |
| openenv push --repo-id your-username/HyperBrickCaseOps | |
| ``` | |
| ## Validation | |
| ```bash | |
| openenv validate . | |
| ``` | |
| For a full pre-submission pass against a deployed Space: | |
| ```bash | |
| ./scripts/validate-submission.sh https://your-space.hf.space . | |
| ``` | |
| ## Submission Checklist | |
| - Public GitHub repository with this codebase | |
| - Root `inference.py` | |
| - Working Docker build | |
| - Deployed Hugging Face Docker Space tagged `openenv` | |
| - Space secrets configured: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` | |
| - README present with environment overview, action/observation definitions, tasks, setup, and baseline scores | |
| ## Baseline Scores | |
| Expected deterministic fallback baseline: | |
| - `billing_refund_easy`: `1.00` | |
| - `account_takeover_medium`: `1.00` | |
| - `api_incident_hard`: `1.00` | |
| - Average: `1.00` | |
| These scores are deliberately reproducible because the fallback policy follows the gold workflow exactly. A model-backed run will typically be lower unless the prompt or model is improved, which makes the environment useful for both training and evaluation. | |