Spaces:

modelbuilderhq
/

HyperBrickCaseOps

Sleeping

App Files Files Community

HyperBrickCaseOps / README.md

modelbuilderhq

Upload folder using huggingface_hub

4f129c9 verified 30 days ago

preview code

raw

history blame contribute delete

13.1 kB

metadata

title: HyperBrickCaseOps
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - customer-support
base_path: /web

HyperBrickCaseOps

HyperBrickCaseOps is an OpenEnv environment for enterprise support operations. The agent gets a real support ticket, a few policy snippets, and the current case state. From there it has to do the same kind of work a human support or operations teammate would do: route the case, set urgency, ask for missing details, write the customer reply, leave an internal note, and decide whether the case should stay open, be resolved, or be escalated.

The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.

Agent quickstart

If you are a generic agent being evaluated on this environment, the safest default strategy is:

Read objective, ticket, knowledge_base, workflow_stage, and required_next_actions.
Classify the case first by setting queue, priority, and issue_type.
If the task requires missing details, use request_info before drafting a final answer.
If customer follow-up is pending, use wait before assuming the missing fields arrived.
Draft the customer-facing reply only after the routing and verification logic are correct.
Add the internal note before final submission.
Use submit only when the workflow really is complete.

High-level rule:

primary issue first, secondary concerns second
safe workflow over fast workflow
do not resolve or unlock cases early just because the customer sounds urgent

Agent playbook

The environment is easiest to solve if the agent follows this action order:

classify
request_info if required_next_actions includes it
wait if customer follow-up is pending
draft_reply
add_internal_note
submit

Common failure modes:

asking for unnecessary information on the easy billing task
resolving a security or compliance case before required verification is complete
routing the task based on a distracting secondary issue instead of the primary issue
using submit while required_next_actions is still non-empty

Quick routing guide:

duplicate charge after cancellation -> billing_ops, high, duplicate_charge
suspicious login / locked out -> trust_and_safety, urgent, account_compromise
production 500s / outage -> platform_engineering, urgent, production_incident
export restriction / policy bypass request -> compliance_ops, high, regulated_exception

Environment description and motivation

This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.

HyperBrickCaseOps is meant to test that full workflow.

It is not a toy game and it is not a chat-only task. The cases include things like:

SLA pressure
affected user counts
customer tier
secondary concerns that should not distract the agent from the main issue
delayed customer follow-up turns
unsafe requests that should not be approved just because the customer sounds urgent

OpenEnv interface

The environment uses the standard OpenEnv flow:

reset() starts a new case and returns the first observation
step(action) applies one typed action and returns the next observation
state() returns the current typed internal state

The metadata is defined in openenv.yaml, and the HTTP app is created through create_app(...).

Action space

Each step takes a typed SupportDeskAction.

Fields:

operation
queue
priority
issue_type
status
resolution_code
requested_fields
reply
internal_note

Supported operations:

classify Sets queue, priority, and issue_type.
request_info Requests missing fields from the customer.
draft_reply Writes the customer-facing reply.
add_internal_note Writes the internal note for handoff or auditability.
submit Sets the final status and resolution_code.
wait Advances the environment when a customer follow-up is pending.

Example action:

{
  "operation": "classify",
  "queue": "trust_and_safety",
  "priority": "urgent",
  "issue_type": "account_compromise",
  "status": null,
  "resolution_code": null,
  "requested_fields": [],
  "reply": null,
  "internal_note": null
}

Observation space

Each observation is a typed SupportDeskObservation.

Main fields:

task_id
difficulty
objective
ticket
knowledge_base
available_queues
available_priorities
available_statuses
available_issue_types
case
current_sla_minutes_remaining
workflow_stage
required_next_actions
risk_flags
action_history
feedback
remaining_steps
reward
done

The case object is the mutable operational state. It contains:

current queue, priority, and issue type
requested fields
reply draft
internal note
final status and resolution code
customer follow-up state

Customer follow-up can move through:

none
pending
partial
complete
incorrect

The observation is designed to help the agent reason about process, not just text:

workflow_stage shows whether the agent is still classifying, waiting on a reply, drafting communication, or ready to submit
required_next_actions tells the agent which steps are still missing
risk_flags surfaces urgency and safety issues like SLA risk, unsafe unlock pressure, and irrelevant customer follow-up

State space

state() returns the typed SupportDeskState.

Main fields:

episode_id
task_id
difficulty
step_count
reward
done
current_score
max_steps
case
current_sla_minutes_remaining
workflow_stage
required_next_actions
risk_flags
action_history
completed_milestones
last_feedback

Task descriptions

There are four deterministic tasks in a fixed order.

1. `billing_refund_easy`

Difficulty: easy

A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.

Best action pattern:

classify to billing first
do not request extra fields
confirm refund timing in the reply
add a note that the duplicate charge was verified
resolve the case with the refund resolution code

2. `account_takeover_medium`

Difficulty: medium

This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.

Best action pattern:

classify to trust and safety with urgent priority
request workspace_id, last_successful_login, and billing_email
wait for the partial follow-up
reply with safe security steps
keep the case open with waiting_on_customer

3. `api_incident_hard`

Difficulty: hard

This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.

Best action pattern:

classify to platform engineering with urgent priority
request request_ids, timestamp_utc, and region
make clear that engineering is engaged
do not resolve the case
submit as an open incident / escalated case

4. `regulated_export_exception_hard`

Difficulty: hard

This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.

Best action pattern:

classify to compliance operations
request tenant_region, dpa_amendment_id, and legal_contact_email
explicitly say no temporary bypass can be granted yet
keep the case open pending legal/compliance review

Reward and grader design

Each task has a deterministic grader that returns a score in (0.01, 0.99) for submission compatibility.

The grader checks:

queue correctness
priority correctness
issue type correctness
requested fields
reply coverage
internal note coverage
final status
resolution code

The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.

Important:

step rewards may go slightly negative when the agent makes a clearly suboptimal or unsafe move
final deterministic grader outputs are clamped strictly inside (0.01, 0.99)
inference.py also clamps the final emitted submission score to (0.01, 0.99)

Examples:

bonus for early correct routing on urgent tasks
bonus for moving through the workflow in the right order
bonus when wait correctly reveals a scripted customer follow-up
penalty for premature submit
penalty for over-escalation
penalty for mixed or sloppy actions
penalty when the SLA gets critically low

Project layout

.
|-- inference.py
|-- openenv.yaml
|-- pyproject.toml
|-- Dockerfile
|-- uv.lock
|-- __init__.py
|-- client.py
|-- graders.py
|-- models.py
|-- openenv_compat.py
|-- policies.py
|-- tasks.py
|-- server
|   |-- __init__.py
|   |-- app.py
|   `-- supportdesk_environment.py
|-- tests
|   `-- test_supportdesk.py
`-- examples
    `-- rl
        `-- train_q_agent.py

Setup instructions

Option 1: pip

pip install -r requirements.txt

Option 2: uv

uv sync

Usage instructions

Validate the repo:

python -m openenv.cli validate .

Start the local server:

python -m server.app

Or use the entrypoint:

server

Run the baseline:

python inference.py

There is also a small local RL example:

python examples/rl/train_q_agent.py

Baseline and environment variables

inference.py uses the OpenAI Python client when model configuration is supplied externally at runtime.

Supported variables:

API_BASE_URL
MODEL_NAME
HF_TOKEN
OPENAI_API_KEY
MAX_STEPS
TEMPERATURE

Example:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token-here"
python inference.py

Important:

the repo does not depend on hardcoded credentials
the expected evaluation setup is environment-variable driven
if credentials are missing or the model call fails, the baseline falls back to a deterministic heuristic policy so the script still completes

Docker

Build:

docker build -t supportdesk-env .

Run:

docker run -p 8000:8000 supportdesk-env

Hugging Face Space deployment

This repo is meant to run as a Docker Space. Keep both the GitHub repository and the Hugging Face Space public for submission.

If you have the OpenEnv CLI installed, a typical deployment command is:

openenv push --repo-id your-username/HyperBrickCaseOps

Validation

Local validation:

openenv validate .

Validation against a running environment:

openenv validate http://127.0.0.1:8000

Pre-submission script:

./scripts/validate-submission.sh https://your-space.hf.space .

Submission checklist

real-world environment, not a toy or game
typed OpenEnv action, observation, and state models
working reset, step, and state
at least 3 tasks with deterministic graders
meaningful reward over the trajectory
root inference.py
working Dockerfile
openenv.yaml present
README includes environment description, motivation, action space, observation space, task descriptions, setup instructions, and baseline scores

Baseline scores

Current deterministic fallback baseline:

billing_refund_easy: 0.99
account_takeover_medium: 0.99
api_incident_hard: 0.99
regulated_export_exception_hard: 0.99
average: 0.99

These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.

HyperBrickCaseOps

Agent quickstart

Agent playbook

Environment description and motivation

OpenEnv interface

Action space

Observation space

State space

Task descriptions

1. billing_refund_easy

2. account_takeover_medium

3. api_incident_hard

4. regulated_export_exception_hard

Reward and grader design

Project layout

Setup instructions

Option 1: pip

Option 2: uv

Usage instructions

Baseline and environment variables

Docker

Hugging Face Space deployment

Validation

Submission checklist

Baseline scores

1. `billing_refund_easy`

2. `account_takeover_medium`

3. `api_incident_hard`

4. `regulated_export_exception_hard`