Spaces:

Clove25
/

ToolUseEnv

Sleeping

App Files Files Community

ToolUseEnv / README.md

Clove25

Update README.md

7d581e2 verified about 1 month ago

preview code

raw

history blame contribute delete

6.37 kB

metadata

title: Support Ops OpenEnv
emoji: 📦
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Support Ops OpenEnv

support_ops_env is a real-world OpenEnv benchmark for customer support operations. The agent is not answering trivia or playing a game; it is working a realistic support queue where it must inspect business artifacts, look up operational policy, draft a customer reply, and submit a final case resolution.

Why this environment

Modern tool-using agents often fail on operational workflows that require evidence gathering, policy compliance, and safe escalation. This environment targets that gap with deterministic tasks that resemble what ecommerce support, trust-and-safety, and operations agents do every day.

Task set

The benchmark ships with three deterministic tasks and matching deterministic graders:

damaged-mug-replacement (easy) Resolve a damaged-item replacement request.
duplicate-charge-refund (medium) Investigate a duplicate billing complaint and refund the extra capture.
account-takeover-fraud (hard) Handle a suspected account takeover with a security-first fraud escalation.

Each task has a fixed expected resolution, required evidence, and reply keywords. The grader returns a score in [0.0, 1.0] from weighted resolution accuracy, evidence coverage, reply quality, and efficiency.

Action space

The environment uses a typed ToolUseAction model with these actions:

review_ticket
inspect_artifact
search_policy
draft_reply
submit_resolution

Optional fields on the action are artifact_id, query, message, and resolution_code.

Observation space

The typed ToolUseObservation includes:

task_id, difficulty, objective
customer_message
workspace_summary
available_actions
available_resolution_codes
collected_evidence
last_tool_result
last_action_error
remaining_steps
current_score

The typed ToolUseState exposes internal progress such as final_score, drafted_reply, resolution_code, required_evidence, collected_evidence, and action history.

How to use

Each episode is a support case. The agent should usually follow this flow:

Read the customer ticket.
Inspect the relevant business artifacts.
Look up the matching policy.
Draft a customer-facing reply.
Submit the final resolution code.

What each action field means

action_type The operation you want the environment to perform.
artifact_id The internal record you want to inspect. Examples: order, payment, account, risk_log.
query The policy lookup term. Examples: damaged_items, duplicate_charge, account_takeover.
message The reply draft that would be sent to the customer.
resolution_code The final case outcome you want to submit. Examples: send_replacement, refund_duplicate_charge, lock_account_and_escalate_fraud.

Typical action examples

Review the ticket:

{
  "action_type": "review_ticket"
}

Inspect an order record:

{
  "action_type": "inspect_artifact",
  "artifact_id": "order"
}

Look up a policy:

{
  "action_type": "search_policy",
  "query": "duplicate_charge"
}

Save a reply draft:

{
  "action_type": "draft_reply",
  "message": "We confirmed the duplicate charge and issued a refund. You should see it in 3-5 business days."
}

Submit the final resolution:

{
  "action_type": "submit_resolution",
  "resolution_code": "refund_duplicate_charge"
}

How the playground works

If /web is enabled, the playground lets you send one action at a time.

Start with Reset.
Enter the action fields for the next step.
Use Get state to inspect internal progress.
Keep stepping until you submit a final resolution or run out of steps.

The observation will show:

which evidence you have already collected
the last tool result
any action validation error
your current partial score
how many steps remain

Reward design

The reward is shaped over the full trajectory:

Positive reward for first-time collection of relevant artifacts and policies
Smaller reward for drafting a reply that includes required customer-facing details
Very small or zero reward for repeated or invalid actions
Final step reward equal to the deterministic grader score

This gives agents signal before the final submission while still anchoring the episode outcome to task completion quality.

Setup

Local Python

UV_CACHE_DIR=/tmp/uv-cache uv sync
.venv/bin/pip install -e .

Run the server

.venv/bin/python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

Docker

docker build -t support-ops-openenv .
docker run --rm -p 8000:8000 support-ops-openenv

Baseline inference

The required root inference.py uses the OpenAI client for model calls and emits the mandatory [START], [STEP], and [END] logs.

Environment variables:

HF_TOKEN or OPENAI_API_KEY
API_BASE_URL
MODEL_NAME
LOCAL_IMAGE_NAME if you want to run via from_docker_image()
ENV_BASE_URL if you want to connect to a running server

Example:

export HF_TOKEN=...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py

The script evaluates all three tasks in a fixed order for reproducible scoring. If no API key is available, it falls back to a deterministic scripted policy so the benchmark remains runnable offline.

Expected baseline behavior

The bundled fallback policy should solve all three tasks with high scores because it follows the intended evidence path exactly. Frontier LLMs should also perform well on the easy and medium tasks and show larger variance on the hard fraud-escalation task if they over-index on issuing refunds instead of following policy.

Project structure

.
├── Dockerfile
├── README.md
├── inference.py
├── openenv.yaml
├── pyproject.toml
├── server/
│   └── app.py
└── tool_use_env/
    ├── client.py
    ├── grader.py
    ├── models.py
    ├── tasks.py
    └── server/
        ├── app.py
        └── tool_use_env_environment.py