workflow_arena / README.md
Cyber-Machine's picture
Update README.md
fecc757 verified
metadata
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
  - qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
  - openenv
  - workflow-orchestration
  - reinforcement-learning

WorkflowArena

WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers. Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait, and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.

Problem

This environment models a common orchestration problem:

  • tasks have dependencies, so not everything can start immediately
  • workers are limited, so not every ready task can run at once
  • deadlines and priorities are uneven, so the obvious greedy move is not always best
  • higher difficulties add time pressure and failure dynamics

The action space is intentionally small:

  1. dispatch(task_ids=[...])
  2. wait()

That keeps the challenge focused on decision quality rather than action syntax.

Episode Loop

  1. reset() generates a deterministic episode from preset, seed, and worker_count.
  2. The observation exposes ready, running, blocked, and completed tasks plus planner hints.
  3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
  4. Time advances only on wait().
  5. The episode ends when:
    • all tasks complete, or
    • the preset time budget is exhausted, or
    • the safety step limit is hit

Difficulty Presets

easy

  • smaller DAGs
  • softer deadlines
  • no fixed time budget
  • no failure events

This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.

medium

  • larger DAGs
  • tighter deadlines
  • fixed episode time budget
  • terminal penalty for unfinished work

This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything, so it must decide what is worth finishing before time runs out.

hard

  • denser DAGs
  • tighter deadlines
  • tighter time budget than medium
  • temporary worker outages
  • task retry failures

In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.

Rewards

WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.

Per-step reward channels

The observation exposes last_reward_breakdown with these channels:

  • completion_reward: reward for tasks that finished on the latest wait()
  • utilization_reward: reward for keeping workers occupied
  • deadline_reward: positive for on-time completion, negative for lateness
  • criticality_reward: reward for progress on high-impact work
  • idle_penalty: penalty for avoidable waiting or leaving useful capacity idle
  • invalid_action_penalty: penalty for malformed or infeasible actions
  • terminal_makespan_score: terminal efficiency score at episode end
  • unfinished_task_penalty: terminal penalty for incomplete work when the episode ends before all tasks finish

Reward design intent

The reward is set up to encourage:

  • filling worker capacity when good work is available
  • respecting deadlines
  • protecting high-priority and critical-path tasks
  • avoiding pointless waits
  • finishing as much important work as possible before the time budget expires

The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.

Failures and Constraints

The environment keeps the action space fixed, but higher presets change the transition dynamics.

Capacity constraint

  • dispatch(task_ids=[...]) cannot exceed current free capacity
  • only tasks in ready_tasks are legal to dispatch

Hard-mode worker outages

  • a temporary outage can reduce usable workers
  • total_workers stays constant
  • effective_workers reflects usable workers after degradation
  • free_workers is computed from effective_workers, not from the original total

Hard-mode retry failures

  • a running task may fail at completion
  • it consumes time but does not complete
  • it returns to ready_tasks
  • attempt_count shows how many retry failures that task has already consumed

Observation Contract

The main observation type is WorkflowArenaObservation. Important fields include:

  • current_time
  • total_workers
  • effective_workers
  • degraded_workers
  • free_workers
  • time_budget
  • time_remaining
  • progress
  • ready_tasks
  • running_tasks
  • completed_tasks
  • blocked_tasks
  • recent_failure_events
  • last_reward_breakdown
  • success_metrics
  • validation_error

Each task view includes:

  • task_id
  • duration
  • priority
  • deadline
  • criticality
  • slack
  • downstream_count
  • dependencies
  • attempt_count

Expected Agent Output

Agents are expected to return compact JSON actions in one of these exact forms:

{ "action_type": "wait", "task_ids": [] }
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }

Rules:

  • dispatch only task ids that appear in ready_tasks
  • do not exceed free_workers
  • do not send duplicate ids
  • wait() must use an empty task_ids list

Success Metrics

The environment reports schedule quality through success_metrics:

  • makespan
  • worker_utilization
  • deadline_miss_count
  • unfinished_task_count
  • weighted_priority_completion
  • benchmark_score

Interpretation:

  • higher benchmark_score is better
  • lower deadline_miss_count is better
  • lower unfinished_task_count is better
  • makespan is only populated when everything completed

Expected Outputs for Evaluation

For benchmark use, an agent should produce:

  1. a legal JSON action at every step
  2. a full episode rollout until termination
  3. a final observation containing the terminal score and success metrics

Typical downstream evaluation reads:

  • cumulative reward
  • final benchmark_score
  • whether the agent completed all tasks
  • how many deadlines were missed
  • how much important work remained unfinished

Benchmarks

Verified self-contained inference run using:

  1. qwen/qwen3.5-9b

Results:

Preset Success Steps Score
easy true 11 0.952
medium true 20 0.945
hard true 45 0.652

Local Development

Validate the environment:

.venv/bin/python -m openenv.cli.__main__ validate workflow_arena

Run the server locally:

cd workflow_arena
uv run --project . server