Spaces:
Sleeping
Sleeping
File size: 6,373 Bytes
311f509 18feac5 311f509 ecc565d 18feac5 7d581e2 18feac5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | ---
title: Support Ops OpenEnv
emoji: π¦
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Support Ops OpenEnv
`support_ops_env` is a real-world OpenEnv benchmark for customer support operations. The agent is not answering trivia or playing a game; it is working a realistic support queue where it must inspect business artifacts, look up operational policy, draft a customer reply, and submit a final case resolution.
## Why this environment
Modern tool-using agents often fail on operational workflows that require evidence gathering, policy compliance, and safe escalation. This environment targets that gap with deterministic tasks that resemble what ecommerce support, trust-and-safety, and operations agents do every day.
## Task set
The benchmark ships with three deterministic tasks and matching deterministic graders:
1. `damaged-mug-replacement` (`easy`)
Resolve a damaged-item replacement request.
2. `duplicate-charge-refund` (`medium`)
Investigate a duplicate billing complaint and refund the extra capture.
3. `account-takeover-fraud` (`hard`)
Handle a suspected account takeover with a security-first fraud escalation.
Each task has a fixed expected resolution, required evidence, and reply keywords. The grader returns a score in `[0.0, 1.0]` from weighted resolution accuracy, evidence coverage, reply quality, and efficiency.
## Action space
The environment uses a typed `ToolUseAction` model with these actions:
- `review_ticket`
- `inspect_artifact`
- `search_policy`
- `draft_reply`
- `submit_resolution`
Optional fields on the action are `artifact_id`, `query`, `message`, and `resolution_code`.
## Observation space
The typed `ToolUseObservation` includes:
- `task_id`, `difficulty`, `objective`
- `customer_message`
- `workspace_summary`
- `available_actions`
- `available_resolution_codes`
- `collected_evidence`
- `last_tool_result`
- `last_action_error`
- `remaining_steps`
- `current_score`
The typed `ToolUseState` exposes internal progress such as `final_score`, `drafted_reply`, `resolution_code`, `required_evidence`, `collected_evidence`, and action history.
## How to use
Each episode is a support case. The agent should usually follow this flow:
1. Read the customer ticket.
2. Inspect the relevant business artifacts.
3. Look up the matching policy.
4. Draft a customer-facing reply.
5. Submit the final resolution code.
### What each action field means
- `action_type`
The operation you want the environment to perform.
- `artifact_id`
The internal record you want to inspect. Examples: `order`, `payment`, `account`, `risk_log`.
- `query`
The policy lookup term. Examples: `damaged_items`, `duplicate_charge`, `account_takeover`.
- `message`
The reply draft that would be sent to the customer.
- `resolution_code`
The final case outcome you want to submit. Examples: `send_replacement`, `refund_duplicate_charge`, `lock_account_and_escalate_fraud`.
### Typical action examples
Review the ticket:
```json
{
"action_type": "review_ticket"
}
```
Inspect an order record:
```json
{
"action_type": "inspect_artifact",
"artifact_id": "order"
}
```
Look up a policy:
```json
{
"action_type": "search_policy",
"query": "duplicate_charge"
}
```
Save a reply draft:
```json
{
"action_type": "draft_reply",
"message": "We confirmed the duplicate charge and issued a refund. You should see it in 3-5 business days."
}
```
Submit the final resolution:
```json
{
"action_type": "submit_resolution",
"resolution_code": "refund_duplicate_charge"
}
```
### How the playground works
If `/web` is enabled, the playground lets you send one action at a time.
- Start with `Reset`.
- Enter the action fields for the next step.
- Use `Get state` to inspect internal progress.
- Keep stepping until you submit a final resolution or run out of steps.
The observation will show:
- which evidence you have already collected
- the last tool result
- any action validation error
- your current partial score
- how many steps remain
## Reward design
The reward is shaped over the full trajectory:
- Positive reward for first-time collection of relevant artifacts and policies
- Smaller reward for drafting a reply that includes required customer-facing details
- Very small or zero reward for repeated or invalid actions
- Final step reward equal to the deterministic grader score
This gives agents signal before the final submission while still anchoring the episode outcome to task completion quality.
## Setup
### Local Python
```bash
UV_CACHE_DIR=/tmp/uv-cache uv sync
.venv/bin/pip install -e .
```
### Run the server
```bash
.venv/bin/python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t support-ops-openenv .
docker run --rm -p 8000:8000 support-ops-openenv
```
## Baseline inference
The required root `inference.py` uses the OpenAI client for model calls and emits the mandatory `[START]`, `[STEP]`, and `[END]` logs.
Environment variables:
- `HF_TOKEN` or `OPENAI_API_KEY`
- `API_BASE_URL`
- `MODEL_NAME`
- `LOCAL_IMAGE_NAME` if you want to run via `from_docker_image()`
- `ENV_BASE_URL` if you want to connect to a running server
Example:
```bash
export HF_TOKEN=...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py
```
The script evaluates all three tasks in a fixed order for reproducible scoring. If no API key is available, it falls back to a deterministic scripted policy so the benchmark remains runnable offline.
## Expected baseline behavior
The bundled fallback policy should solve all three tasks with high scores because it follows the intended evidence path exactly. Frontier LLMs should also perform well on the easy and medium tasks and show larger variance on the hard fraud-escalation task if they over-index on issuing refunds instead of following policy.
## Project structure
```text
.
βββ Dockerfile
βββ README.md
βββ inference.py
βββ openenv.yaml
βββ pyproject.toml
βββ server/
β βββ app.py
βββ tool_use_env/
βββ client.py
βββ grader.py
βββ models.py
βββ tasks.py
βββ server/
βββ app.py
βββ tool_use_env_environment.py
```
|