Spaces:

modelbuilderhq
/

HyperBrickCaseOps

Sleeping

App Files Files Community

HyperBrickCaseOps / README.md

modelbuilderhq

Upload folder using huggingface_hub

d33da97 verified about 1 month ago

11.1 kB

title: HyperBrickCaseOps
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - customer-support
base_path: /web

HyperBrickCaseOps

SupportDesk is best thought of as an enterprise operations-desk environment, not a generic support classifier.

SupportDesk is a real-world RL environment for enterprise support operations. The agent receives a realistic inbound ticket, a small internal knowledge base, and the live case state. It must route the case, set the right priority, decide whether to request more information, draft the customer response, add an internal note, and submit the case with the correct final status.

One-sentence summary: HyperBrickCaseOps is a deterministic OpenEnv customer-support operations environment that evaluates whether an agent can triage, communicate, escalate, and resolve enterprise cases correctly end to end.

This environment is intentionally built around work humans actually do every day in B2B SaaS support queues. It is not a toy chat task and it is not a game. The environment includes enterprise mechanics such as SLA countdowns, business-impact context, and distracting secondary concerns, so the agent has to prioritize the primary operational issue instead of just pattern-matching keywords.

Environment Description and Motivation

The goal of this environment is to model a real operational gap in agent evaluation: many support benchmarks only test whether a model can produce a plausible reply, but real support work also requires correct routing, escalation, information gathering, and final disposition decisions. SupportDesk is designed to evaluate whether an agent can handle enterprise support operations end to end rather than just generate support-sounding text.

This makes the environment useful for both:

training agents to improve multi-step support operations behavior
evaluating whether an agent can make safe and business-correct support decisions under pressure

Why this should score well

Real-world utility: customer support triage is a real production workflow with immediate evaluation value.
Deterministic grading: every task has an explicit gold queue, priority, issue type, required follow-up fields, reply markers, note markers, status, and resolution code.
Dense rewards: each step gets rewarded from the delta in the deterministic grader, which gives partial progress rather than only a binary terminal signal.
Reproducible baseline: inference.py runs all tasks in a fixed order and falls back to a deterministic heuristic policy if model credentials are unavailable.
Novel mechanics: observations expose SLA pressure, business impact, and secondary concerns, which makes the environment closer to an enterprise operations desk than a plain support classifier.

Architecture Diagram

Inbound Task Spec + Ticket + KB
            |
            v
  SupportDeskEnvironment
  - reset()
  - step(action)
  - state()
            |
            +--> SupportDeskObservation
            +--> dense reward shaping
            +--> episode termination
            |
            v
     Deterministic Grader
     - queue correctness
     - priority correctness
     - issue type correctness
     - requested fields
     - reply coverage
     - internal note coverage
     - status / resolution
            |
            v
   Baseline in inference.py
   - OpenAI-compatible client path
   - deterministic fallback path

Why this is more novel than a standard support benchmark

It is not just routing or intent classification. The agent has to combine queueing, urgency, customer communication, internal notes, and final disposition in one trajectory.
It models primary-vs-secondary issue prioritization. The hardest task includes a tempting compliance side-question that should not override the live outage.
It encodes enterprise pressure directly in the observation through SLA countdowns, affected-user counts, and business-impact context.
It evaluates operational judgment, not just answer quality. A polished reply with the wrong queue, wrong escalation choice, or premature resolution still scores poorly.
It is built specifically for OpenEnv-style agent learning and evaluation, where the same environment can be used for baseline runs, external agents, and RL experiments.

Action Space

Each step() takes a typed SupportDeskAction with:

operation: one of classify, request_info, draft_reply, add_internal_note, submit
queue
priority
issue_type
status
resolution_code
requested_fields
reply
internal_note

The environment allows the agent to update multiple fields in one structured action, which keeps the workflow realistic and helps training.

Observation Space

Each observation contains:

task_id, difficulty, and the agent objective
the inbound ticket
ticket-level urgency metadata such as affected_users, sla_minutes_remaining, business_impact, and secondary_concerns
knowledge_base policy snippets
allowed queues, priorities, statuses, and issue types
the mutable case snapshot
action_history
feedback
remaining_steps
the standard OpenEnv reward and done

OpenEnv Interface

The environment implements the standard OpenEnv API:

reset() returns the initial typed observation for a new case
step(action) returns the next typed observation together with reward and done status
state() returns the current typed environment state
openenv.yaml provides environment metadata used by validators and deployment tooling

The implementation uses typed Pydantic models for action, observation, and state.

Task Descriptions with Expected Difficulty

billing_refund_easy - Expected difficulty: easy Duplicate-charge billing ticket. The correct path is immediate billing routing, a refund confirmation, and case resolution.
account_takeover_medium - Expected difficulty: medium Suspicious-login security ticket. The agent must escalate to trust and safety, request verification details, and keep the case waiting on the customer.
api_incident_hard - Expected difficulty: hard Enterprise production API incident with a distracting compliance mention. The agent must escalate to platform engineering, request the right diagnostics, and open the incident instead of resolving it.

What makes these tasks less generic than ordinary support-routing demos:

They mix queueing, priority, customer communication, internal note-taking, and close-vs-escalate decisions in one trajectory.
They include operational context like customer tier, affected-user count, SLA pressure, and business impact.
The harder tasks contain conflicting or distracting signals, so a frontier model has to identify the primary issue instead of treating every mention as equally important.

Deterministic Graders

The final task score is a weighted total in [0.0, 1.0]:

Queue correctness: 0.15
Priority correctness: 0.10
Issue-type correctness: 0.10
Requested-fields correctness: 0.15
Reply coverage: 0.25
Internal-note coverage: 0.10
Final status: 0.10
Resolution code: 0.05

The same grader also drives dense reward shaping during the episode by comparing the current score to the previous score and then subtracting small penalties for no-op or low-signal actions.

Project Layout

.
|-- inference.py
|-- openenv.yaml
|-- pyproject.toml
|-- requirements.txt
|-- supportdesk_env
|   |-- __init__.py
|   |-- client.py
|   |-- graders.py
|   |-- models.py
|   |-- tasks.py
|   `-- server
|       |-- app.py
|       `-- supportdesk_environment.py
|-- tests
|   `-- test_supportdesk.py
`-- uv.lock

Local Setup

pip install -r requirements.txt

Or with uv:

uv sync

Optional environment variables for the baseline:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="openai/gpt-oss-120b"
export OPENAI_API_KEY="sk-..."  # Or use HF_TOKEN with a compatible router
export HF_TOKEN="hf_..."

The baseline uses the OpenAI Python client and supports both OPENAI_API_KEY and HF_TOKEN.

Setup and Usage Instructions

Typical local workflow:

pip install -r requirements.txt
python -m openenv.cli validate .
python inference.py
python -m supportdesk_env.server.app

Local RL Playground

If you want to import the package directly and train against the local environment without going through the HTTP server, use the tabular Q-learning example:

python examples/rl/train_q_agent.py

This script imports the package, instantiates SupportDeskEnvironment directly, trains a tiny Q-learning agent over a compact discrete action library, and then prints greedy evaluation results for all three tasks. It is meant as a local experimentation playground, not as the official submission baseline.

Run the Server

python -m supportdesk_env.server.app

Or with the OpenEnv entrypoint:

server

Run the Baseline

python inference.py

When model credentials are present, the script uses the OpenAI client against API_BASE_URL and MODEL_NAME. If credentials are missing or a request fails, it falls back to a deterministic heuristic policy so the script still completes and prints reproducible scores.

Docker

docker build -t supportdesk-env .
docker run -p 8000:8000 supportdesk-env

Hugging Face Space Deployment

Deploy this repo as a Docker Space and keep it public for submission. The Space should include the openenv tag and the following environment configuration values:

API_BASE_URL
MODEL_NAME
HF_TOKEN

If the OpenEnv CLI is installed, deployment can be done with:

openenv push --repo-id your-username/HyperBrickCaseOps

Validation

openenv validate .

For a full pre-submission pass against a deployed Space:

./scripts/validate-submission.sh https://your-space.hf.space .

Submission Checklist

Public GitHub repository with this codebase
Root inference.py
Working Docker build
Deployed Hugging Face Docker Space tagged openenv
Space secrets configured: API_BASE_URL, MODEL_NAME, HF_TOKEN
README present with environment overview, action/observation definitions, tasks, setup, and baseline scores

Baseline Scores

Expected deterministic fallback baseline:

billing_refund_easy: 1.00
account_takeover_medium: 1.00
api_incident_hard: 1.00
Average: 1.00

These scores are deliberately reproducible because the fallback policy follows the gold workflow exactly. A model-backed run will typically be lower unless the prompt or model is improved, which makes the environment useful for both training and evaluation.