File size: 4,001 Bytes
1c73813
 
 
 
 
 
 
 
 
ede4899
1c73813
 
 
ede4899
1c73813
 
 
d7937ef
 
336fd86
 
ede4899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8169c7f
ede4899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8169c7f
 
 
 
 
 
d7937ef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-VL-8B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers/
tags:
- toolcua
- VLM
- Computer-Use-Agent
- OS-Agent
- GUI
- MLLM-Agent
- agentic-rl
- sandbox-rl
---

<h1 style="
  font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
  font-size:48px;
  font-weight:700;
  line-height:1.25;
  text-align:center;
  margin:0 0 24px;">
  ToolCUA-8B
</h1>

<div style="
  display:flex;
  justify-content:center;
  gap:12px;
  flex-wrap:wrap;
  margin-bottom:28px;">

  <a href="https://x-plug.github.io/ToolCUA/" style="
     display:inline-block;
     padding:8px 24px;
     background:#2b2b2b;
     color:#ffffff;
     border-radius:36px;
     text-decoration:none;
     font-weight:600;
     font-size:16px;">
    🌐 Website
  </a>

  <a href="https://arxiv.org/abs/2605.12481" style="
     display:inline-block;
     padding:8px 24px;
     background:#2b2b2b;
     color:#ffffff;
     border-radius:36px;
     text-decoration:none;
     font-weight:600;
     font-size:16px;">
    📑 Paper 
  </a>

  <a href="https://github.com/X-PLUG/ToolCUA" style="
     display:inline-block;
     padding:8px 24px;
     background:#2b2b2b;
     color:#ffffff;
     border-radius:36px;
     text-decoration:none;
     font-weight:600;
     font-size:16px;">
    💻 Code
  </a>

</div>

ToolCUA-8B is an end-to-end computer-use agent for orchestrating GUI actions and structured tool calls. It learns when to continue through GUI interaction, when to invoke tools, and when to switch back, enabling shorter and more reliable desktop task trajectories.

<p align="center">
  <img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/main_teaser.png" width="760" alt="ToolCUA teaser">
</p>

## Method

ToolCUA uses a staged training pipeline for GUI-Tool path selection:

1. Scale interleaved GUI-Tool trajectories from existing GUI-only data via trajectory-aware tool synthesis.
2. Apply Tool-Bootstrapped GUI RFT to learn tool-calling behavior and calibrate local switching decisions.
3. Optimize with Online Agentic RL in a GUI-Tool environment using a Tool-Efficient Path Reward for appropriate tool use and shorter execution paths.

<p align="center">
  <img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/method_overview.png" width="760" alt="ToolCUA method overview">
</p>

## Results

On feasible OSWorld-MCP tasks, ToolCUA-8B reaches **46.85%** overall accuracy, **24.32%** Tool Invocation Rate (TIR), and **14.93** Average Completion Steps (ACS). Compared with Qwen3-VL-8B-Instruct, it improves accuracy by **+18.62**, improves TIR by **+15.91**, and reduces ACS by **4.41** steps.

<p align="center">
  <img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/main_results.png" width="760" alt="ToolCUA main results">
</p>

<p align="center">
  <img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/app_results.png" width="760" alt="ToolCUA application results">
</p>

## vLLM Serve

We recommend vLLM for deployment. Use `vllm>=0.12.0` and enable `--trust-remote-code`.

```bash
MAX_IMAGE=${MAX_IMAGE:-5}
IMAGE_LIMIT_ARGS='{"image": '"$MAX_IMAGE"'}'
PIXEL_ARGS='{"size": {"longest_edge": 3072000, "shortest_edge": 65536}}'

vllm serve X-PLUG/ToolCUA-8B \
  --trust-remote-code \
  --max-model-len 32768 \
  --mm-processor-kwargs "$PIXEL_ARGS" \
  --limit-mm-per-prompt "$IMAGE_LIMIT_ARGS" \
  --tensor-parallel-size 1 \
  --allowed-local-media-path '/' \
  --port 4243 \
  --gpu-memory-utilization 0.85 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching \
  --enforce-eager \
  --max-logprobs 50
```

## Citation

```bibtex
@article{hu2026toolcua,
  title={ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents},
  author={Hu, Xuhao and Zhang, Xi and Xu, Haiyang and Qiao, Kyle and Yang, Jingyi and Huang, Xuanjing and Shao, Jing and Yan, Ming and Ye, Jieping},
  journal={arXiv preprint arXiv:2605.12481},
  year={2026}
}
```