Spaces:

modelbuilderhq
/

HyperBrickCaseOps

Sleeping

App Files Files Community

HyperBrickCaseOps / README.md

modelbuilderhq

Upload folder using huggingface_hub

55db2c6 verified about 2 months ago

preview code

raw

history blame

9.99 kB

	---
	title: HyperBrickCaseOps
	sdk: docker
	app_port: 8000
	tags:
	- openenv
	- reinforcement-learning
	- customer-support
	base_path: /web
	---

	# HyperBrickCaseOps

	HyperBrickCaseOps is an OpenEnv environment for enterprise support operations. The agent gets a real support ticket, a few policy snippets, and the current case state. From there it has to do the same kind of work a human support or operations teammate would do: route the case, set urgency, ask for missing details, write the customer reply, leave an internal note, and decide whether the case should stay open, be resolved, or be escalated.

	The main idea is simple: good support work is not just writing a polite reply. It also means making the right operational decision.

	## Environment description and motivation

	This environment was built around a gap that shows up in a lot of support benchmarks. Many benchmarks check whether a model can produce a plausible response, but real support work also needs correct routing, escalation, information gathering, and final case handling.

	HyperBrickCaseOps is meant to test that full workflow.

	It is not a toy game and it is not a chat-only task. The cases include things like:

	- SLA pressure
	- affected user counts
	- customer tier
	- secondary concerns that should not distract the agent from the main issue
	- delayed customer follow-up turns
	- unsafe requests that should not be approved just because the customer sounds urgent

	## OpenEnv interface

	The environment uses the standard OpenEnv flow:

	- `reset()` starts a new case and returns the first observation
	- `step(action)` applies one typed action and returns the next observation
	- `state()` returns the current typed internal state

	The metadata is defined in `openenv.yaml`, and the HTTP app is created through `create_app(...)`.

	## Action space

	Each step takes a typed `SupportDeskAction`.

	Fields:

	- `operation`
	- `queue`
	- `priority`
	- `issue_type`
	- `status`
	- `resolution_code`
	- `requested_fields`
	- `reply`
	- `internal_note`

	Supported operations:

	- `classify`
	Sets `queue`, `priority`, and `issue_type`.
	- `request_info`
	Requests missing fields from the customer.
	- `draft_reply`
	Writes the customer-facing reply.
	- `add_internal_note`
	Writes the internal note for handoff or auditability.
	- `submit`
	Sets the final `status` and `resolution_code`.
	- `wait`
	Advances the environment when a customer follow-up is pending.

	Example action:

	```json
	{
	"operation": "classify",
	"queue": "trust_and_safety",
	"priority": "urgent",
	"issue_type": "account_compromise",
	"status": null,
	"resolution_code": null,
	"requested_fields": [],
	"reply": null,
	"internal_note": null
	}
	```

	## Observation space

	Each observation is a typed `SupportDeskObservation`.

	Main fields:

	- `task_id`
	- `difficulty`
	- `objective`
	- `ticket`
	- `knowledge_base`
	- `available_queues`
	- `available_priorities`
	- `available_statuses`
	- `available_issue_types`
	- `case`
	- `current_sla_minutes_remaining`
	- `workflow_stage`
	- `required_next_actions`
	- `risk_flags`
	- `action_history`
	- `feedback`
	- `remaining_steps`
	- `reward`
	- `done`

	The `case` object is the mutable operational state. It contains:

	- current queue, priority, and issue type
	- requested fields
	- reply draft
	- internal note
	- final status and resolution code
	- customer follow-up state

	Customer follow-up can move through:

	- `none`
	- `pending`
	- `partial`
	- `complete`
	- `incorrect`

	The observation is designed to help the agent reason about process, not just text:

	- `workflow_stage` shows whether the agent is still classifying, waiting on a reply, drafting communication, or ready to submit
	- `required_next_actions` tells the agent which steps are still missing
	- `risk_flags` surfaces urgency and safety issues like SLA risk, unsafe unlock pressure, and irrelevant customer follow-up

	## State space

	`state()` returns the typed `SupportDeskState`.

	Main fields:

	- `episode_id`
	- `task_id`
	- `difficulty`
	- `step_count`
	- `reward`
	- `done`
	- `current_score`
	- `max_steps`
	- `case`
	- `current_sla_minutes_remaining`
	- `workflow_stage`
	- `required_next_actions`
	- `risk_flags`
	- `action_history`
	- `completed_milestones`
	- `last_feedback`

	## Task descriptions

	There are four deterministic tasks in a fixed order.

	### 1. `billing_refund_easy`

	Difficulty: easy

	A customer was charged twice after cancellation. The right workflow is to route the case to billing, confirm the refund path, leave a useful note, and resolve the case without asking for unnecessary extra information.

	### 2. `account_takeover_medium`

	Difficulty: medium

	This is a suspicious-login recovery case. The agent has to route it to trust and safety, request verification details, handle a delayed partial follow-up from the customer, and keep the case open until the missing information is provided. Unlocking the account immediately would be unsafe.

	### 3. `api_incident_hard`

	Difficulty: hard

	This task simulates a live enterprise API incident. The ticket includes a secondary compliance concern, but the primary issue is the outage. The agent needs to escalate to engineering, request the right diagnostics, communicate clearly, and keep the incident open rather than marking it resolved.

	### 4. `regulated_export_exception_hard`

	Difficulty: hard

	This is a regulated exception request. The customer wants a shortcut around an export restriction, but the correct workflow is to route the case to compliance, request legal approval details, and keep the case open pending review. Sending it straight to engineering for a workaround is the wrong move.

	## Reward and grader design

	Each task has a deterministic grader that returns a score in `[0.0, 1.0]`.

	The grader checks:

	- queue correctness
	- priority correctness
	- issue type correctness
	- requested fields
	- reply coverage
	- internal note coverage
	- final status
	- resolution code

	The environment uses the grader score delta as the main dense reward signal. On top of that, it adds smaller process-aware bonuses and penalties so that the full trajectory matters, not just the final snapshot.

	Examples:

	- bonus for early correct routing on urgent tasks
	- bonus for moving through the workflow in the right order
	- bonus when `wait` correctly reveals a scripted customer follow-up
	- penalty for premature submit
	- penalty for over-escalation
	- penalty for mixed or sloppy actions
	- penalty when the SLA gets critically low

	## Project layout

	```text
	.
	\|-- inference.py
	\|-- openenv.yaml
	\|-- pyproject.toml
	\|-- Dockerfile
	\|-- uv.lock
	\|-- supportdesk_env
	\| \|-- __init__.py
	\| \|-- graders.py
	\| \|-- models.py
	\| \|-- policies.py
	\| \|-- tasks.py
	\| `-- server
	\| \|-- app.py
	\| `-- supportdesk_environment.py
	\|-- tests
	\| `-- test_supportdesk.py
	`-- examples
	`-- rl
	`-- train_q_agent.py
	```

	## Setup instructions

	### Option 1: pip

	```bash
	pip install -r requirements.txt
	```

	### Option 2: uv

	```bash
	uv sync
	```

	## Usage instructions

	Validate the repo:

	```bash
	python -m openenv.cli validate .
	```

	Start the local server:

	```bash
	python -m supportdesk_env.server.app
	```

	Or use the entrypoint:

	```bash
	server
	```

	Run the baseline:

	```bash
	python inference.py
	```

	There is also a small local RL example:

	```bash
	python examples/rl/train_q_agent.py
	```

	## Baseline and environment variables

	`inference.py` uses the OpenAI Python client when model configuration is supplied externally at runtime.

	Supported variables:

	- `API_BASE_URL`
	- `MODEL_NAME`
	- `HF_TOKEN`
	- `OPENAI_API_KEY`
	- `MAX_STEPS`
	- `TEMPERATURE`

	Example:

	```bash
	export API_BASE_URL="https://router.huggingface.co/v1"
	export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	export HF_TOKEN="your-token-here"
	python inference.py
	```

	Important:

	- the repo does not depend on hardcoded credentials
	- the expected evaluation setup is environment-variable driven
	- if credentials are missing or the model call fails, the baseline falls back to a deterministic heuristic policy so the script still completes

	## Docker

	Build:

	```bash
	docker build -t supportdesk-env .
	```

	Run:

	```bash
	docker run -p 8000:8000 supportdesk-env
	```

	## Hugging Face Space deployment

	This repo is meant to run as a Docker Space. Keep both the GitHub repository and the Hugging Face Space public for submission.

	If you have the OpenEnv CLI installed, a typical deployment command is:

	```bash
	openenv push --repo-id your-username/HyperBrickCaseOps
	```

	## Validation

	Local validation:

	```bash
	openenv validate .
	```

	Validation against a running environment:

	```bash
	openenv validate http://127.0.0.1:8000
	```

	Pre-submission script:

	```bash
	./scripts/validate-submission.sh https://your-space.hf.space .
	```

	## Submission checklist

	- real-world environment, not a toy or game
	- typed OpenEnv action, observation, and state models
	- working `reset`, `step`, and `state`
	- at least 3 tasks with deterministic graders
	- meaningful reward over the trajectory
	- root `inference.py`
	- working `Dockerfile`
	- `openenv.yaml` present
	- README includes environment description, motivation, action space, observation space, task descriptions, setup instructions, and baseline scores

	## Baseline scores

	Current deterministic fallback baseline:

	- `billing_refund_easy`: `1.00`
	- `account_takeover_medium`: `1.00`
	- `api_incident_hard`: `1.00`
	- `regulated_export_exception_hard`: `1.00`
	- average: `1.00`

	These scores are intentionally reproducible. The fallback policy exists to show that the environment, reward shaping, and graders all work end to end. Model-backed runs can be lower, which is useful for evaluation.