--- title: WorkflowArena emoji: 🏗️ colorFrom: blue colorTo: indigo sdk: docker pinned: false models: - qwen/qwen3.5-9b app_port: 8000 base_path: / tags: - openenv - workflow-orchestration - reinforcement-learning --- # WorkflowArena WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers. Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait, and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work. ## Problem This environment models a common orchestration problem: - tasks have dependencies, so not everything can start immediately - workers are limited, so not every ready task can run at once - deadlines and priorities are uneven, so the obvious greedy move is not always best - higher difficulties add time pressure and failure dynamics The action space is intentionally small: 1. `dispatch(task_ids=[...])` 2. `wait()` That keeps the challenge focused on decision quality rather than action syntax. ## Episode Loop 1. `reset()` generates a deterministic episode from `preset`, `seed`, and `worker_count`. 2. The observation exposes ready, running, blocked, and completed tasks plus planner hints. 3. The agent either dispatches a legal batch of ready tasks or waits for the next completion event. 4. Time advances only on `wait()`. 5. The episode ends when: - all tasks complete, or - the preset time budget is exhausted, or - the safety step limit is hit ## Difficulty Presets ### `easy` - smaller DAGs - softer deadlines - no fixed time budget - no failure events This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits. ### `medium` - larger DAGs - tighter deadlines - fixed episode time budget - terminal penalty for unfinished work This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything, so it must decide what is worth finishing before time runs out. ### `hard` - denser DAGs - tighter deadlines - tighter time budget than `medium` - temporary worker outages - task retry failures In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue. ## Rewards WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters. ### Per-step reward channels The observation exposes `last_reward_breakdown` with these channels: - `completion_reward`: reward for tasks that finished on the latest `wait()` - `utilization_reward`: reward for keeping workers occupied - `deadline_reward`: positive for on-time completion, negative for lateness - `criticality_reward`: reward for progress on high-impact work - `idle_penalty`: penalty for avoidable waiting or leaving useful capacity idle - `invalid_action_penalty`: penalty for malformed or infeasible actions - `terminal_makespan_score`: terminal efficiency score at episode end - `unfinished_task_penalty`: terminal penalty for incomplete work when the episode ends before all tasks finish ### Reward design intent The reward is set up to encourage: - filling worker capacity when good work is available - respecting deadlines - protecting high-priority and critical-path tasks - avoiding pointless waits - finishing as much important work as possible before the time budget expires The terminal score is bounded and deterministic. Higher values correspond to stronger schedules. ## Failures and Constraints The environment keeps the action space fixed, but higher presets change the transition dynamics. ### Capacity constraint - `dispatch(task_ids=[...])` cannot exceed current free capacity - only tasks in `ready_tasks` are legal to dispatch ### Hard-mode worker outages - a temporary outage can reduce usable workers - `total_workers` stays constant - `effective_workers` reflects usable workers after degradation - `free_workers` is computed from `effective_workers`, not from the original total ### Hard-mode retry failures - a running task may fail at completion - it consumes time but does not complete - it returns to `ready_tasks` - `attempt_count` shows how many retry failures that task has already consumed ## Observation Contract The main observation type is [`WorkflowArenaObservation`](workflow_arena/models.py). Important fields include: - `current_time` - `total_workers` - `effective_workers` - `degraded_workers` - `free_workers` - `time_budget` - `time_remaining` - `progress` - `ready_tasks` - `running_tasks` - `completed_tasks` - `blocked_tasks` - `recent_failure_events` - `last_reward_breakdown` - `success_metrics` - `validation_error` Each task view includes: - `task_id` - `duration` - `priority` - `deadline` - `criticality` - `slack` - `downstream_count` - `dependencies` - `attempt_count` ## Expected Agent Output Agents are expected to return compact JSON actions in one of these exact forms: ```json { "action_type": "wait", "task_ids": [] } ``` ```json { "action_type": "dispatch", "task_ids": ["task_01", "task_02"] } ``` Rules: - dispatch only task ids that appear in `ready_tasks` - do not exceed `free_workers` - do not send duplicate ids - `wait()` must use an empty `task_ids` list ## Success Metrics The environment reports schedule quality through `success_metrics`: - `makespan` - `worker_utilization` - `deadline_miss_count` - `unfinished_task_count` - `weighted_priority_completion` - `benchmark_score` Interpretation: - higher `benchmark_score` is better - lower `deadline_miss_count` is better - lower `unfinished_task_count` is better - `makespan` is only populated when everything completed ## Expected Outputs for Evaluation For benchmark use, an agent should produce: 1. a legal JSON action at every step 2. a full episode rollout until termination 3. a final observation containing the terminal score and success metrics Typical downstream evaluation reads: - cumulative reward - final `benchmark_score` - whether the agent completed all tasks - how many deadlines were missed - how much important work remained unfinished ## Benchmarks Verified self-contained inference run using: 1. `qwen/qwen3.5-9b` Results: | Preset | Success | Steps | Score | | -------- | ------- | ----- | ------- | | `easy` | `true` | `11` | `0.952` | | `medium` | `true` | `20` | `0.945` | | `hard` | `true` | `45` | `0.652` | ## Local Development Validate the environment: ```bash .venv/bin/python -m openenv.cli.__main__ validate workflow_arena ``` Run the server locally: ```bash cd workflow_arena uv run --project . server ```