--- title: Support Ops OpenEnv emoji: 📦 colorFrom: blue colorTo: indigo sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Support Ops OpenEnv `support_ops_env` is a real-world OpenEnv benchmark for customer support operations. The agent is not answering trivia or playing a game; it is working a realistic support queue where it must inspect business artifacts, look up operational policy, draft a customer reply, and submit a final case resolution. ## Why this environment Modern tool-using agents often fail on operational workflows that require evidence gathering, policy compliance, and safe escalation. This environment targets that gap with deterministic tasks that resemble what ecommerce support, trust-and-safety, and operations agents do every day. ## Task set The benchmark ships with three deterministic tasks and matching deterministic graders: 1. `damaged-mug-replacement` (`easy`) Resolve a damaged-item replacement request. 2. `duplicate-charge-refund` (`medium`) Investigate a duplicate billing complaint and refund the extra capture. 3. `account-takeover-fraud` (`hard`) Handle a suspected account takeover with a security-first fraud escalation. Each task has a fixed expected resolution, required evidence, and reply keywords. The grader returns a score in `[0.0, 1.0]` from weighted resolution accuracy, evidence coverage, reply quality, and efficiency. ## Action space The environment uses a typed `ToolUseAction` model with these actions: - `review_ticket` - `inspect_artifact` - `search_policy` - `draft_reply` - `submit_resolution` Optional fields on the action are `artifact_id`, `query`, `message`, and `resolution_code`. ## Observation space The typed `ToolUseObservation` includes: - `task_id`, `difficulty`, `objective` - `customer_message` - `workspace_summary` - `available_actions` - `available_resolution_codes` - `collected_evidence` - `last_tool_result` - `last_action_error` - `remaining_steps` - `current_score` The typed `ToolUseState` exposes internal progress such as `final_score`, `drafted_reply`, `resolution_code`, `required_evidence`, `collected_evidence`, and action history. ## How to use Each episode is a support case. The agent should usually follow this flow: 1. Read the customer ticket. 2. Inspect the relevant business artifacts. 3. Look up the matching policy. 4. Draft a customer-facing reply. 5. Submit the final resolution code. ### What each action field means - `action_type` The operation you want the environment to perform. - `artifact_id` The internal record you want to inspect. Examples: `order`, `payment`, `account`, `risk_log`. - `query` The policy lookup term. Examples: `damaged_items`, `duplicate_charge`, `account_takeover`. - `message` The reply draft that would be sent to the customer. - `resolution_code` The final case outcome you want to submit. Examples: `send_replacement`, `refund_duplicate_charge`, `lock_account_and_escalate_fraud`. ### Typical action examples Review the ticket: ```json { "action_type": "review_ticket" } ``` Inspect an order record: ```json { "action_type": "inspect_artifact", "artifact_id": "order" } ``` Look up a policy: ```json { "action_type": "search_policy", "query": "duplicate_charge" } ``` Save a reply draft: ```json { "action_type": "draft_reply", "message": "We confirmed the duplicate charge and issued a refund. You should see it in 3-5 business days." } ``` Submit the final resolution: ```json { "action_type": "submit_resolution", "resolution_code": "refund_duplicate_charge" } ``` ### How the playground works If `/web` is enabled, the playground lets you send one action at a time. - Start with `Reset`. - Enter the action fields for the next step. - Use `Get state` to inspect internal progress. - Keep stepping until you submit a final resolution or run out of steps. The observation will show: - which evidence you have already collected - the last tool result - any action validation error - your current partial score - how many steps remain ## Reward design The reward is shaped over the full trajectory: - Positive reward for first-time collection of relevant artifacts and policies - Smaller reward for drafting a reply that includes required customer-facing details - Very small or zero reward for repeated or invalid actions - Final step reward equal to the deterministic grader score This gives agents signal before the final submission while still anchoring the episode outcome to task completion quality. ## Setup ### Local Python ```bash UV_CACHE_DIR=/tmp/uv-cache uv sync .venv/bin/pip install -e . ``` ### Run the server ```bash .venv/bin/python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Docker ```bash docker build -t support-ops-openenv . docker run --rm -p 8000:8000 support-ops-openenv ``` ## Baseline inference The required root `inference.py` uses the OpenAI client for model calls and emits the mandatory `[START]`, `[STEP]`, and `[END]` logs. Environment variables: - `HF_TOKEN` or `OPENAI_API_KEY` - `API_BASE_URL` - `MODEL_NAME` - `LOCAL_IMAGE_NAME` if you want to run via `from_docker_image()` - `ENV_BASE_URL` if you want to connect to a running server Example: ```bash export HF_TOKEN=... export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct python inference.py ``` The script evaluates all three tasks in a fixed order for reproducible scoring. If no API key is available, it falls back to a deterministic scripted policy so the benchmark remains runnable offline. ## Expected baseline behavior The bundled fallback policy should solve all three tasks with high scores because it follows the intended evidence path exactly. Frontier LLMs should also perform well on the easy and medium tasks and show larger variance on the hard fraud-escalation task if they over-index on issuing refunds instead of following policy. ## Project structure ```text . ├── Dockerfile ├── README.md ├── inference.py ├── openenv.yaml ├── pyproject.toml ├── server/ │ └── app.py └── tool_use_env/ ├── client.py ├── grader.py ├── models.py ├── tasks.py └── server/ ├── app.py └── tool_use_env_environment.py ```