File size: 11,083 Bytes
181758b
d33da97
181758b
 
 
 
 
 
 
 
 
d33da97
181758b
 
 
 
 
d33da97
 
181758b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d33da97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181758b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d33da97
181758b
d33da97
181758b
d33da97
181758b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d33da97
181758b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---

title: HyperBrickCaseOps
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - customer-support
base_path: /web
---


# HyperBrickCaseOps

SupportDesk is best thought of as an enterprise operations-desk environment, not a generic support classifier.

SupportDesk is a real-world RL environment for enterprise support operations. The agent receives a realistic inbound ticket, a small internal knowledge base, and the live case state. It must route the case, set the right priority, decide whether to request more information, draft the customer response, add an internal note, and submit the case with the correct final status.

One-sentence summary: HyperBrickCaseOps is a deterministic OpenEnv customer-support operations environment that evaluates whether an agent can triage, communicate, escalate, and resolve enterprise cases correctly end to end.

This environment is intentionally built around work humans actually do every day in B2B SaaS support queues. It is not a toy chat task and it is not a game. The environment includes enterprise mechanics such as SLA countdowns, business-impact context, and distracting secondary concerns, so the agent has to prioritize the primary operational issue instead of just pattern-matching keywords.

## Environment Description and Motivation

The goal of this environment is to model a real operational gap in agent evaluation: many support benchmarks only test whether a model can produce a plausible reply, but real support work also requires correct routing, escalation, information gathering, and final disposition decisions. SupportDesk is designed to evaluate whether an agent can handle enterprise support operations end to end rather than just generate support-sounding text.

This makes the environment useful for both:

- training agents to improve multi-step support operations behavior
- evaluating whether an agent can make safe and business-correct support decisions under pressure

## Why this should score well

- Real-world utility: customer support triage is a real production workflow with immediate evaluation value.
- Deterministic grading: every task has an explicit gold queue, priority, issue type, required follow-up fields, reply markers, note markers, status, and resolution code.
- Dense rewards: each step gets rewarded from the delta in the deterministic grader, which gives partial progress rather than only a binary terminal signal.
- Reproducible baseline: `inference.py` runs all tasks in a fixed order and falls back to a deterministic heuristic policy if model credentials are unavailable.
- Novel mechanics: observations expose SLA pressure, business impact, and secondary concerns, which makes the environment closer to an enterprise operations desk than a plain support classifier.

## Architecture Diagram

```text

Inbound Task Spec + Ticket + KB

            |

            v

  SupportDeskEnvironment

  - reset()

  - step(action)

  - state()

            |

            +--> SupportDeskObservation

            +--> dense reward shaping

            +--> episode termination

            |

            v

     Deterministic Grader

     - queue correctness

     - priority correctness

     - issue type correctness

     - requested fields

     - reply coverage

     - internal note coverage

     - status / resolution

            |

            v

   Baseline in inference.py

   - OpenAI-compatible client path

   - deterministic fallback path

```

## Why this is more novel than a standard support benchmark

- It is not just routing or intent classification. The agent has to combine queueing, urgency, customer communication, internal notes, and final disposition in one trajectory.
- It models primary-vs-secondary issue prioritization. The hardest task includes a tempting compliance side-question that should not override the live outage.
- It encodes enterprise pressure directly in the observation through SLA countdowns, affected-user counts, and business-impact context.
- It evaluates operational judgment, not just answer quality. A polished reply with the wrong queue, wrong escalation choice, or premature resolution still scores poorly.
- It is built specifically for OpenEnv-style agent learning and evaluation, where the same environment can be used for baseline runs, external agents, and RL experiments.

## Action Space

Each `step()` takes a typed `SupportDeskAction` with:

- `operation`: one of `classify`, `request_info`, `draft_reply`, `add_internal_note`, `submit`
- `queue`
- `priority`
- `issue_type`
- `status`
- `resolution_code`
- `requested_fields`
- `reply`
- `internal_note`

The environment allows the agent to update multiple fields in one structured action, which keeps the workflow realistic and helps training.

## Observation Space

Each observation contains:

- `task_id`, `difficulty`, and the agent objective
- the inbound `ticket`
- ticket-level urgency metadata such as `affected_users`, `sla_minutes_remaining`, `business_impact`, and `secondary_concerns`
- `knowledge_base` policy snippets
- allowed queues, priorities, statuses, and issue types
- the mutable `case` snapshot
- `action_history`
- `feedback`
- `remaining_steps`
- the standard OpenEnv `reward` and `done`

## OpenEnv Interface

The environment implements the standard OpenEnv API:

- `reset()` returns the initial typed observation for a new case
- `step(action)` returns the next typed observation together with reward and done status
- `state()` returns the current typed environment state
- `openenv.yaml` provides environment metadata used by validators and deployment tooling

The implementation uses typed Pydantic models for action, observation, and state.

## Task Descriptions with Expected Difficulty

1. `billing_refund_easy` - Expected difficulty: easy
   Duplicate-charge billing ticket. The correct path is immediate billing routing, a refund confirmation, and case resolution.
2. `account_takeover_medium` - Expected difficulty: medium
   Suspicious-login security ticket. The agent must escalate to trust and safety, request verification details, and keep the case waiting on the customer.
3. `api_incident_hard` - Expected difficulty: hard
   Enterprise production API incident with a distracting compliance mention. The agent must escalate to platform engineering, request the right diagnostics, and open the incident instead of resolving it.

What makes these tasks less generic than ordinary support-routing demos:

- They mix queueing, priority, customer communication, internal note-taking, and close-vs-escalate decisions in one trajectory.
- They include operational context like customer tier, affected-user count, SLA pressure, and business impact.
- The harder tasks contain conflicting or distracting signals, so a frontier model has to identify the primary issue instead of treating every mention as equally important.

## Deterministic Graders

The final task score is a weighted total in `[0.0, 1.0]`:

- Queue correctness: `0.15`
- Priority correctness: `0.10`
- Issue-type correctness: `0.10`
- Requested-fields correctness: `0.15`
- Reply coverage: `0.25`
- Internal-note coverage: `0.10`
- Final status: `0.10`
- Resolution code: `0.05`

The same grader also drives dense reward shaping during the episode by comparing the current score to the previous score and then subtracting small penalties for no-op or low-signal actions.

## Project Layout

```text

.

|-- inference.py

|-- openenv.yaml

|-- pyproject.toml

|-- requirements.txt

|-- supportdesk_env

|   |-- __init__.py

|   |-- client.py

|   |-- graders.py

|   |-- models.py

|   |-- tasks.py

|   `-- server

|       |-- app.py

|       `-- supportdesk_environment.py

|-- tests

|   `-- test_supportdesk.py

`-- uv.lock

```

## Local Setup

```bash

pip install -r requirements.txt

```

Or with uv:

```bash

uv sync

```

Optional environment variables for the baseline:

```bash

export API_BASE_URL="https://router.huggingface.co/v1"

export MODEL_NAME="openai/gpt-oss-120b"

export OPENAI_API_KEY="sk-..."  # Or use HF_TOKEN with a compatible router

export HF_TOKEN="hf_..."

```

The baseline uses the OpenAI Python client and supports both `OPENAI_API_KEY` and `HF_TOKEN`.

## Setup and Usage Instructions

Typical local workflow:

```bash

pip install -r requirements.txt

python -m openenv.cli validate .

python inference.py

python -m supportdesk_env.server.app

```

## Local RL Playground

If you want to import the package directly and train against the local environment without going through the HTTP server, use the tabular Q-learning example:

```bash

python examples/rl/train_q_agent.py

```

This script imports the package, instantiates `SupportDeskEnvironment` directly, trains a tiny Q-learning agent over a compact discrete action library, and then prints greedy evaluation results for all three tasks. It is meant as a local experimentation playground, not as the official submission baseline.

## Run the Server

```bash

python -m supportdesk_env.server.app

```

Or with the OpenEnv entrypoint:

```bash

server

```

## Run the Baseline

```bash

python inference.py

```

When model credentials are present, the script uses the OpenAI client against `API_BASE_URL` and `MODEL_NAME`. If credentials are missing or a request fails, it falls back to a deterministic heuristic policy so the script still completes and prints reproducible scores.

## Docker

```bash

docker build -t supportdesk-env .

docker run -p 8000:8000 supportdesk-env

```

## Hugging Face Space Deployment

Deploy this repo as a Docker Space and keep it public for submission. The Space should include the `openenv` tag and the following environment configuration values:

- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`

If the OpenEnv CLI is installed, deployment can be done with:

```bash

openenv push --repo-id your-username/HyperBrickCaseOps

```

## Validation

```bash

openenv validate .

```

For a full pre-submission pass against a deployed Space:

```bash

./scripts/validate-submission.sh https://your-space.hf.space .

```

## Submission Checklist

- Public GitHub repository with this codebase
- Root `inference.py`
- Working Docker build
- Deployed Hugging Face Docker Space tagged `openenv`
- Space secrets configured: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
- README present with environment overview, action/observation definitions, tasks, setup, and baseline scores

## Baseline Scores

Expected deterministic fallback baseline:

- `billing_refund_easy`: `1.00`
- `account_takeover_medium`: `1.00`
- `api_incident_hard`: `1.00`
- Average: `1.00`

These scores are deliberately reproducible because the fallback policy follows the gold workflow exactly. A model-backed run will typically be lower unless the prompt or model is improved, which makes the environment useful for both training and evaluation.