File size: 6,373 Bytes
311f509
 
 
 
 
 
 
 
 
 
 
 
 
 
18feac5
311f509
ecc565d
 
 
18feac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d581e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18feac5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: Support Ops OpenEnv
emoji: πŸ“¦
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Support Ops OpenEnv

`support_ops_env` is a real-world OpenEnv benchmark for customer support operations. The agent is not answering trivia or playing a game; it is working a realistic support queue where it must inspect business artifacts, look up operational policy, draft a customer reply, and submit a final case resolution.

## Why this environment

Modern tool-using agents often fail on operational workflows that require evidence gathering, policy compliance, and safe escalation. This environment targets that gap with deterministic tasks that resemble what ecommerce support, trust-and-safety, and operations agents do every day.

## Task set

The benchmark ships with three deterministic tasks and matching deterministic graders:

1. `damaged-mug-replacement` (`easy`)
   Resolve a damaged-item replacement request.
2. `duplicate-charge-refund` (`medium`)
   Investigate a duplicate billing complaint and refund the extra capture.
3. `account-takeover-fraud` (`hard`)
   Handle a suspected account takeover with a security-first fraud escalation.

Each task has a fixed expected resolution, required evidence, and reply keywords. The grader returns a score in `[0.0, 1.0]` from weighted resolution accuracy, evidence coverage, reply quality, and efficiency.

## Action space

The environment uses a typed `ToolUseAction` model with these actions:

- `review_ticket`
- `inspect_artifact`
- `search_policy`
- `draft_reply`
- `submit_resolution`

Optional fields on the action are `artifact_id`, `query`, `message`, and `resolution_code`.

## Observation space

The typed `ToolUseObservation` includes:

- `task_id`, `difficulty`, `objective`
- `customer_message`
- `workspace_summary`
- `available_actions`
- `available_resolution_codes`
- `collected_evidence`
- `last_tool_result`
- `last_action_error`
- `remaining_steps`
- `current_score`

The typed `ToolUseState` exposes internal progress such as `final_score`, `drafted_reply`, `resolution_code`, `required_evidence`, `collected_evidence`, and action history.

## How to use

Each episode is a support case. The agent should usually follow this flow:

1. Read the customer ticket.
2. Inspect the relevant business artifacts.
3. Look up the matching policy.
4. Draft a customer-facing reply.
5. Submit the final resolution code.

### What each action field means

- `action_type`
  The operation you want the environment to perform.
- `artifact_id`
  The internal record you want to inspect. Examples: `order`, `payment`, `account`, `risk_log`.
- `query`
  The policy lookup term. Examples: `damaged_items`, `duplicate_charge`, `account_takeover`.
- `message`
  The reply draft that would be sent to the customer.
- `resolution_code`
  The final case outcome you want to submit. Examples: `send_replacement`, `refund_duplicate_charge`, `lock_account_and_escalate_fraud`.

### Typical action examples

Review the ticket:

```json
{
  "action_type": "review_ticket"
}
```

Inspect an order record:

```json
{
  "action_type": "inspect_artifact",
  "artifact_id": "order"
}
```

Look up a policy:

```json
{
  "action_type": "search_policy",
  "query": "duplicate_charge"
}
```

Save a reply draft:

```json
{
  "action_type": "draft_reply",
  "message": "We confirmed the duplicate charge and issued a refund. You should see it in 3-5 business days."
}
```

Submit the final resolution:

```json
{
  "action_type": "submit_resolution",
  "resolution_code": "refund_duplicate_charge"
}
```

### How the playground works

If `/web` is enabled, the playground lets you send one action at a time.

- Start with `Reset`.
- Enter the action fields for the next step.
- Use `Get state` to inspect internal progress.
- Keep stepping until you submit a final resolution or run out of steps.

The observation will show:

- which evidence you have already collected
- the last tool result
- any action validation error
- your current partial score
- how many steps remain

## Reward design

The reward is shaped over the full trajectory:

- Positive reward for first-time collection of relevant artifacts and policies
- Smaller reward for drafting a reply that includes required customer-facing details
- Very small or zero reward for repeated or invalid actions
- Final step reward equal to the deterministic grader score

This gives agents signal before the final submission while still anchoring the episode outcome to task completion quality.

## Setup

### Local Python

```bash
UV_CACHE_DIR=/tmp/uv-cache uv sync
.venv/bin/pip install -e .
```

### Run the server

```bash
.venv/bin/python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Docker

```bash
docker build -t support-ops-openenv .
docker run --rm -p 8000:8000 support-ops-openenv
```

## Baseline inference

The required root `inference.py` uses the OpenAI client for model calls and emits the mandatory `[START]`, `[STEP]`, and `[END]` logs.

Environment variables:

- `HF_TOKEN` or `OPENAI_API_KEY`
- `API_BASE_URL`
- `MODEL_NAME`
- `LOCAL_IMAGE_NAME` if you want to run via `from_docker_image()`
- `ENV_BASE_URL` if you want to connect to a running server

Example:

```bash
export HF_TOKEN=...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py
```

The script evaluates all three tasks in a fixed order for reproducible scoring. If no API key is available, it falls back to a deterministic scripted policy so the benchmark remains runnable offline.

## Expected baseline behavior

The bundled fallback policy should solve all three tasks with high scores because it follows the intended evidence path exactly. Frontier LLMs should also perform well on the easy and medium tasks and show larger variance on the hard fraud-escalation task if they over-index on issuing refunds instead of following policy.

## Project structure

```text
.
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ README.md
β”œβ”€β”€ inference.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ server/
β”‚   └── app.py
└── tool_use_env/
    β”œβ”€β”€ client.py
    β”œβ”€β”€ grader.py
    β”œβ”€β”€ models.py
    β”œβ”€β”€ tasks.py
    └── server/
        β”œβ”€β”€ app.py
        └── tool_use_env_environment.py
```