Spaces:
Sleeping
title: Support Ops OpenEnv
emoji: π¦
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Support Ops OpenEnv
support_ops_env is a real-world OpenEnv benchmark for customer support operations. The agent is not answering trivia or playing a game; it is working a realistic support queue where it must inspect business artifacts, look up operational policy, draft a customer reply, and submit a final case resolution.
Why this environment
Modern tool-using agents often fail on operational workflows that require evidence gathering, policy compliance, and safe escalation. This environment targets that gap with deterministic tasks that resemble what ecommerce support, trust-and-safety, and operations agents do every day.
Task set
The benchmark ships with three deterministic tasks and matching deterministic graders:
damaged-mug-replacement(easy) Resolve a damaged-item replacement request.duplicate-charge-refund(medium) Investigate a duplicate billing complaint and refund the extra capture.account-takeover-fraud(hard) Handle a suspected account takeover with a security-first fraud escalation.
Each task has a fixed expected resolution, required evidence, and reply keywords. The grader returns a score in [0.0, 1.0] from weighted resolution accuracy, evidence coverage, reply quality, and efficiency.
Action space
The environment uses a typed ToolUseAction model with these actions:
review_ticketinspect_artifactsearch_policydraft_replysubmit_resolution
Optional fields on the action are artifact_id, query, message, and resolution_code.
Observation space
The typed ToolUseObservation includes:
task_id,difficulty,objectivecustomer_messageworkspace_summaryavailable_actionsavailable_resolution_codescollected_evidencelast_tool_resultlast_action_errorremaining_stepscurrent_score
The typed ToolUseState exposes internal progress such as final_score, drafted_reply, resolution_code, required_evidence, collected_evidence, and action history.
How to use
Each episode is a support case. The agent should usually follow this flow:
- Read the customer ticket.
- Inspect the relevant business artifacts.
- Look up the matching policy.
- Draft a customer-facing reply.
- Submit the final resolution code.
What each action field means
action_typeThe operation you want the environment to perform.artifact_idThe internal record you want to inspect. Examples:order,payment,account,risk_log.queryThe policy lookup term. Examples:damaged_items,duplicate_charge,account_takeover.messageThe reply draft that would be sent to the customer.resolution_codeThe final case outcome you want to submit. Examples:send_replacement,refund_duplicate_charge,lock_account_and_escalate_fraud.
Typical action examples
Review the ticket:
{
"action_type": "review_ticket"
}
Inspect an order record:
{
"action_type": "inspect_artifact",
"artifact_id": "order"
}
Look up a policy:
{
"action_type": "search_policy",
"query": "duplicate_charge"
}
Save a reply draft:
{
"action_type": "draft_reply",
"message": "We confirmed the duplicate charge and issued a refund. You should see it in 3-5 business days."
}
Submit the final resolution:
{
"action_type": "submit_resolution",
"resolution_code": "refund_duplicate_charge"
}
How the playground works
If /web is enabled, the playground lets you send one action at a time.
- Start with
Reset. - Enter the action fields for the next step.
- Use
Get stateto inspect internal progress. - Keep stepping until you submit a final resolution or run out of steps.
The observation will show:
- which evidence you have already collected
- the last tool result
- any action validation error
- your current partial score
- how many steps remain
Reward design
The reward is shaped over the full trajectory:
- Positive reward for first-time collection of relevant artifacts and policies
- Smaller reward for drafting a reply that includes required customer-facing details
- Very small or zero reward for repeated or invalid actions
- Final step reward equal to the deterministic grader score
This gives agents signal before the final submission while still anchoring the episode outcome to task completion quality.
Setup
Local Python
UV_CACHE_DIR=/tmp/uv-cache uv sync
.venv/bin/pip install -e .
Run the server
.venv/bin/python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
Docker
docker build -t support-ops-openenv .
docker run --rm -p 8000:8000 support-ops-openenv
Baseline inference
The required root inference.py uses the OpenAI client for model calls and emits the mandatory [START], [STEP], and [END] logs.
Environment variables:
HF_TOKENorOPENAI_API_KEYAPI_BASE_URLMODEL_NAMELOCAL_IMAGE_NAMEif you want to run viafrom_docker_image()ENV_BASE_URLif you want to connect to a running server
Example:
export HF_TOKEN=...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py
The script evaluates all three tasks in a fixed order for reproducible scoring. If no API key is available, it falls back to a deterministic scripted policy so the benchmark remains runnable offline.
Expected baseline behavior
The bundled fallback policy should solve all three tasks with high scores because it follows the intended evidence path exactly. Frontier LLMs should also perform well on the easy and medium tasks and show larger variance on the hard fraud-escalation task if they over-index on issuing refunds instead of following policy.
Project structure
.
βββ Dockerfile
βββ README.md
βββ inference.py
βββ openenv.yaml
βββ pyproject.toml
βββ server/
β βββ app.py
βββ tool_use_env/
βββ client.py
βββ grader.py
βββ models.py
βββ tasks.py
βββ server/
βββ app.py
βββ tool_use_env_environment.py