Spaces:
Running
title: WorkflowArena
emoji: 🏗️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
models:
- qwen/qwen3.5-9b
app_port: 8000
base_path: /
tags:
- openenv
- workflow-orchestration
- reinforcement-learning
WorkflowArena
WorkflowArena is an OpenEnv benchmark for scheduling dependency-constrained work on limited workers. Each episode is a seeded workflow DAG. The agent must decide when to dispatch ready tasks, when to wait, and how to trade off deadline pressure, worker utilization, critical-path protection, and unfinished work.
Problem
This environment models a common orchestration problem:
- tasks have dependencies, so not everything can start immediately
- workers are limited, so not every ready task can run at once
- deadlines and priorities are uneven, so the obvious greedy move is not always best
- higher difficulties add time pressure and failure dynamics
The action space is intentionally small:
dispatch(task_ids=[...])wait()
That keeps the challenge focused on decision quality rather than action syntax.
Episode Loop
reset()generates a deterministic episode frompreset,seed, andworker_count.- The observation exposes ready, running, blocked, and completed tasks plus planner hints.
- The agent either dispatches a legal batch of ready tasks or waits for the next completion event.
- Time advances only on
wait(). - The episode ends when:
- all tasks complete, or
- the preset time budget is exhausted, or
- the safety step limit is hit
Difficulty Presets
easy
- smaller DAGs
- softer deadlines
- no fixed time budget
- no failure events
This is the baseline teaching mode. Good play mostly means keeping workers busy and avoiding obviously bad waits.
medium
- larger DAGs
- tighter deadlines
- fixed episode time budget
- terminal penalty for unfinished work
This is where the environment becomes a real tradeoff problem. The agent may not be able to finish everything, so it must decide what is worth finishing before time runs out.
hard
- denser DAGs
- tighter deadlines
- tighter time budget than
medium - temporary worker outages
- task retry failures
In hard mode, usable capacity can shrink temporarily and a task may fail at completion and return to the ready queue.
Rewards
WorkflowArena uses shaped rewards so local decisions have immediate feedback, while terminal scoring still matters.
Per-step reward channels
The observation exposes last_reward_breakdown with these channels:
completion_reward: reward for tasks that finished on the latestwait()utilization_reward: reward for keeping workers occupieddeadline_reward: positive for on-time completion, negative for latenesscriticality_reward: reward for progress on high-impact workidle_penalty: penalty for avoidable waiting or leaving useful capacity idleinvalid_action_penalty: penalty for malformed or infeasible actionsterminal_makespan_score: terminal efficiency score at episode endunfinished_task_penalty: terminal penalty for incomplete work when the episode ends before all tasks finish
Reward design intent
The reward is set up to encourage:
- filling worker capacity when good work is available
- respecting deadlines
- protecting high-priority and critical-path tasks
- avoiding pointless waits
- finishing as much important work as possible before the time budget expires
The terminal score is bounded and deterministic. Higher values correspond to stronger schedules.
Failures and Constraints
The environment keeps the action space fixed, but higher presets change the transition dynamics.
Capacity constraint
dispatch(task_ids=[...])cannot exceed current free capacity- only tasks in
ready_tasksare legal to dispatch
Hard-mode worker outages
- a temporary outage can reduce usable workers
total_workersstays constanteffective_workersreflects usable workers after degradationfree_workersis computed fromeffective_workers, not from the original total
Hard-mode retry failures
- a running task may fail at completion
- it consumes time but does not complete
- it returns to
ready_tasks attempt_countshows how many retry failures that task has already consumed
Observation Contract
The main observation type is WorkflowArenaObservation.
Important fields include:
current_timetotal_workerseffective_workersdegraded_workersfree_workerstime_budgettime_remainingprogressready_tasksrunning_taskscompleted_tasksblocked_tasksrecent_failure_eventslast_reward_breakdownsuccess_metricsvalidation_error
Each task view includes:
task_iddurationprioritydeadlinecriticalityslackdownstream_countdependenciesattempt_count
Expected Agent Output
Agents are expected to return compact JSON actions in one of these exact forms:
{ "action_type": "wait", "task_ids": [] }
{ "action_type": "dispatch", "task_ids": ["task_01", "task_02"] }
Rules:
- dispatch only task ids that appear in
ready_tasks - do not exceed
free_workers - do not send duplicate ids
wait()must use an emptytask_idslist
Success Metrics
The environment reports schedule quality through success_metrics:
makespanworker_utilizationdeadline_miss_countunfinished_task_countweighted_priority_completionbenchmark_score
Interpretation:
- higher
benchmark_scoreis better - lower
deadline_miss_countis better - lower
unfinished_task_countis better makespanis only populated when everything completed
Expected Outputs for Evaluation
For benchmark use, an agent should produce:
- a legal JSON action at every step
- a full episode rollout until termination
- a final observation containing the terminal score and success metrics
Typical downstream evaluation reads:
- cumulative reward
- final
benchmark_score - whether the agent completed all tasks
- how many deadlines were missed
- how much important work remained unfinished
Benchmarks
Verified self-contained inference run using:
qwen/qwen3.5-9b
Results:
| Preset | Success | Steps | Score |
|---|---|---|---|
easy |
true |
11 |
0.952 |
medium |
true |
20 |
0.945 |
hard |
true |
45 |
0.652 |
Local Development
Validate the environment:
.venv/bin/python -m openenv.cli.__main__ validate workflow_arena
Run the server locally:
cd workflow_arena
uv run --project . server