Spaces:

Cyber-Machine
/

workflow_arena

Running

App Files Files Community

workflow_arena / README.md

Cyber-Machine

Update README.md

fecc757 verified 12 days ago

preview code

raw

history blame contribute delete

6.68 kB

metadata

title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
  - qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
  - openenv
  - workflow-orchestration
  - reinforcement-learning

WorkflowArena

WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers. Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait, and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.

Problem

This environment models a common orchestration problem:

tasks have dependencies, so not everything can start immediately
workers are limited, so not every ready task can run at once
deadlines and priorities are uneven, so the obvious greedy move is not always best
higher difficulties add time pressure and failure dynamics

The action space is intentionally small:

dispatch(task_ids=[...])
wait()

That keeps the challenge focused on decision quality rather than action syntax.

Episode Loop

reset() generates a deterministic episode from preset, seed, and worker_count.
The observation exposes ready, running, blocked, and completed tasks plus planner hints.
The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
Time advances only on wait().
The episode ends when:
- all tasks complete, or
- the preset time budget is exhausted, or
- the safety step limit is hit

Difficulty Presets

`easy`

smaller DAGs
softer deadlines
no fixed time budget
no failure events

This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.

`medium`

larger DAGs
tighter deadlines
fixed episode time budget
terminal penalty for unfinished work

This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything, so it must decide what is worth finishing before time runs out.

`hard`

denser DAGs
tighter deadlines
tighter time budget than medium
temporary worker outages
task retry failures

In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.

Rewards

WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.

Per-step reward channels

The observation exposes last_reward_breakdown with these channels:

completion_reward: reward for tasks that finished on the latest wait()
utilization_reward: reward for keeping workers occupied
deadline_reward: positive for on-time completion, negative for lateness
criticality_reward: reward for progress on high-impact work
idle_penalty: penalty for avoidable waiting or leaving useful capacity idle
invalid_action_penalty: penalty for malformed or infeasible actions
terminal_makespan_score: terminal efficiency score at episode end
unfinished_task_penalty: terminal penalty for incomplete work when the episode ends before all tasks finish

Reward design intent

The reward is set up to encourage:

filling worker capacity when good work is available
respecting deadlines
protecting high-priority and critical-path tasks
avoiding pointless waits
finishing as much important work as possible before the time budget expires

The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.

Failures and Constraints

The environment keeps the action space fixed, but higher presets change the transition dynamics.

Capacity constraint

dispatch(task_ids=[...]) cannot exceed current free capacity
only tasks in ready_tasks are legal to dispatch

Hard-mode worker outages

a temporary outage can reduce usable workers
total_workers stays constant
effective_workers reflects usable workers after degradation
free_workers is computed from effective_workers, not from the original total

Hard-mode retry failures

a running task may fail at completion
it consumes time but does not complete
it returns to ready_tasks
attempt_count shows how many retry failures that task has already consumed

Observation Contract

The main observation type is WorkflowArenaObservation. Important fields include:

current_time
total_workers
effective_workers
degraded_workers
free_workers
time_budget
time_remaining
progress
ready_tasks
running_tasks
completed_tasks
blocked_tasks
recent_failure_events
last_reward_breakdown
success_metrics
validation_error

Each task view includes:

task_id
duration
priority
deadline
criticality
slack
downstream_count
dependencies
attempt_count

Expected Agent Output

Agents are expected to return compact JSON actions in one of these exact forms:

{ "action_type": "wait", "task_ids": [] }

{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }

Rules:

dispatch only task ids that appear in ready_tasks
do not exceed free_workers
do not send duplicate ids
wait() must use an empty task_ids list

Success Metrics

The environment reports schedule quality through success_metrics:

makespan
worker_utilization
deadline_miss_count
unfinished_task_count
weighted_priority_completion
benchmark_score

Interpretation:

higher benchmark_score is better
lower deadline_miss_count is better
lower unfinished_task_count is better
makespan is only populated when everything completed

Expected Outputs for Evaluation

For benchmark use, an agent should produce:

a legal JSON action at every step
a full episode rollout until termination
a final observation containing the terminal score and success metrics

Typical downstream evaluation reads:

cumulative reward
final benchmark_score
whether the agent completed all tasks
how many deadlines were missed
how much important work remained unfinished

Benchmarks

Verified self-contained inference run using:

qwen/qwen3.5-9b

Results:

Preset	Success	Steps	Score
`easy`	`true`	`11`	`0.952`
`medium`	`true`	`20`	`0.945`
`hard`	`true`	`45`	`0.652`

Local Development

Validate the environment:

.venv/bin/python -m openenv.cli.__main__ validate workflow_arena

Run the server locally:

cd workflow_arena
uv run --project . server