Update README.md
Browse files
README.md
CHANGED
|
@@ -7,206 +7,132 @@ metrics:
|
|
| 7 |
base_model:
|
| 8 |
- Qwen/Qwen3-VL-8B-Instruct
|
| 9 |
pipeline_tag: image-text-to-text
|
| 10 |
-
library_name: transformers
|
| 11 |
tags:
|
| 12 |
- toolcua
|
| 13 |
- VLM
|
| 14 |
-
-
|
| 15 |
- OS-Agent
|
| 16 |
- GUI
|
| 17 |
- MLLM-Agent
|
| 18 |
---
|
| 19 |
-
# Model Card for Model ID
|
| 20 |
|
| 21 |
-
<
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
<
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
-
|
| 34 |
-
|
| 35 |
-
-
|
| 36 |
-
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
#
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
#
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
#
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
<
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
<
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
##
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
### Results
|
| 141 |
-
|
| 142 |
-
[More Information Needed]
|
| 143 |
-
|
| 144 |
-
#### Summary
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
## Model Examination [optional]
|
| 149 |
-
|
| 150 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 151 |
-
|
| 152 |
-
[More Information Needed]
|
| 153 |
-
|
| 154 |
-
## Environmental Impact
|
| 155 |
-
|
| 156 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 157 |
-
|
| 158 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 159 |
-
|
| 160 |
-
- **Hardware Type:** [More Information Needed]
|
| 161 |
-
- **Hours used:** [More Information Needed]
|
| 162 |
-
- **Cloud Provider:** [More Information Needed]
|
| 163 |
-
- **Compute Region:** [More Information Needed]
|
| 164 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 165 |
-
|
| 166 |
-
## Technical Specifications [optional]
|
| 167 |
-
|
| 168 |
-
### Model Architecture and Objective
|
| 169 |
-
|
| 170 |
-
[More Information Needed]
|
| 171 |
-
|
| 172 |
-
### Compute Infrastructure
|
| 173 |
-
|
| 174 |
-
[More Information Needed]
|
| 175 |
-
|
| 176 |
-
#### Hardware
|
| 177 |
-
|
| 178 |
-
[More Information Needed]
|
| 179 |
-
|
| 180 |
-
#### Software
|
| 181 |
-
|
| 182 |
-
[More Information Needed]
|
| 183 |
-
|
| 184 |
-
## Citation [optional]
|
| 185 |
-
|
| 186 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 187 |
-
|
| 188 |
-
**BibTeX:**
|
| 189 |
-
|
| 190 |
-
[More Information Needed]
|
| 191 |
-
|
| 192 |
-
**APA:**
|
| 193 |
-
|
| 194 |
-
[More Information Needed]
|
| 195 |
-
|
| 196 |
-
## Glossary [optional]
|
| 197 |
-
|
| 198 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 199 |
-
|
| 200 |
-
[More Information Needed]
|
| 201 |
-
|
| 202 |
-
## More Information [optional]
|
| 203 |
-
|
| 204 |
-
[More Information Needed]
|
| 205 |
-
|
| 206 |
-
## Model Card Authors [optional]
|
| 207 |
-
|
| 208 |
-
[More Information Needed]
|
| 209 |
-
|
| 210 |
-
## Model Card Contact
|
| 211 |
-
|
| 212 |
-
[More Information Needed]
|
|
|
|
| 7 |
base_model:
|
| 8 |
- Qwen/Qwen3-VL-8B-Instruct
|
| 9 |
pipeline_tag: image-text-to-text
|
| 10 |
+
library_name: transformers/
|
| 11 |
tags:
|
| 12 |
- toolcua
|
| 13 |
- VLM
|
| 14 |
+
- Computer-Use-Agent
|
| 15 |
- OS-Agent
|
| 16 |
- GUI
|
| 17 |
- MLLM-Agent
|
| 18 |
---
|
|
|
|
| 19 |
|
| 20 |
+
<h1 style="
|
| 21 |
+
font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
|
| 22 |
+
font-size:48px;
|
| 23 |
+
font-weight:700;
|
| 24 |
+
line-height:1.25;
|
| 25 |
+
text-align:center;
|
| 26 |
+
margin:0 0 24px;">
|
| 27 |
+
ToolCUA-8B
|
| 28 |
+
</h1>
|
| 29 |
+
|
| 30 |
+
<div style="
|
| 31 |
+
display:flex;
|
| 32 |
+
justify-content:center;
|
| 33 |
+
gap:12px;
|
| 34 |
+
flex-wrap:wrap;
|
| 35 |
+
margin-bottom:28px;">
|
| 36 |
+
|
| 37 |
+
<a href="https://x-plug.github.io/ToolCUA/" style="
|
| 38 |
+
display:inline-block;
|
| 39 |
+
padding:8px 24px;
|
| 40 |
+
background:#2b2b2b;
|
| 41 |
+
color:#ffffff;
|
| 42 |
+
border-radius:36px;
|
| 43 |
+
text-decoration:none;
|
| 44 |
+
font-weight:600;
|
| 45 |
+
font-size:16px;">
|
| 46 |
+
🌐 Website
|
| 47 |
+
</a>
|
| 48 |
+
|
| 49 |
+
<a href="https://arxiv.org/abs/2508.09123" style="
|
| 50 |
+
display:inline-block;
|
| 51 |
+
padding:8px 24px;
|
| 52 |
+
background:#2b2b2b;
|
| 53 |
+
color:#ffffff;
|
| 54 |
+
border-radius:36px;
|
| 55 |
+
text-decoration:none;
|
| 56 |
+
font-weight:600;
|
| 57 |
+
font-size:16px;">
|
| 58 |
+
📑 Paper
|
| 59 |
+
</a>
|
| 60 |
+
|
| 61 |
+
<a href="https://github.com/X-PLUG/ToolCUA" style="
|
| 62 |
+
display:inline-block;
|
| 63 |
+
padding:8px 24px;
|
| 64 |
+
background:#2b2b2b;
|
| 65 |
+
color:#ffffff;
|
| 66 |
+
border-radius:36px;
|
| 67 |
+
text-decoration:none;
|
| 68 |
+
font-weight:600;
|
| 69 |
+
font-size:16px;">
|
| 70 |
+
💻 Code
|
| 71 |
+
</a>
|
| 72 |
+
|
| 73 |
+
</div>
|
| 74 |
+
|
| 75 |
+
ToolCUA-8B is an end-to-end computer-use agent for orchestrating GUI actions and structured tool calls. It learns when to continue through GUI interaction, when to invoke tools, and when to switch back, enabling shorter and more reliable desktop task trajectories.
|
| 76 |
+
|
| 77 |
+
<p align="center">
|
| 78 |
+
<img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/main_teaser.png" width="760" alt="ToolCUA teaser">
|
| 79 |
+
</p>
|
| 80 |
+
|
| 81 |
+
## Method
|
| 82 |
+
|
| 83 |
+
ToolCUA uses a staged training pipeline for GUI-Tool path selection:
|
| 84 |
+
|
| 85 |
+
1. Scale interleaved GUI-Tool trajectories from existing GUI-only data via trajectory-aware tool synthesis.
|
| 86 |
+
2. Apply Tool-Bootstrapped GUI RFT to learn tool-calling behavior and calibrate local switching decisions.
|
| 87 |
+
3. Optimize with Online Agentic RL in a GUI-Tool environment using a Tool-Efficient Path Reward for appropriate tool use and shorter execution paths.
|
| 88 |
+
|
| 89 |
+
<p align="center">
|
| 90 |
+
<img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/method_overview.png" width="760" alt="ToolCUA method overview">
|
| 91 |
+
</p>
|
| 92 |
+
|
| 93 |
+
## Results
|
| 94 |
+
|
| 95 |
+
On feasible OSWorld-MCP tasks, ToolCUA-8B reaches **46.85%** overall accuracy, **24.32%** Tool Invocation Rate (TIR), and **14.93** Average Completion Steps (ACS). Compared with Qwen3-VL-8B-Instruct, it improves accuracy by **+18.62**, improves TIR by **+15.91**, and reduces ACS by **4.41** steps.
|
| 96 |
+
|
| 97 |
+
<p align="center">
|
| 98 |
+
<img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/main_results.png" width="760" alt="ToolCUA main results">
|
| 99 |
+
</p>
|
| 100 |
+
|
| 101 |
+
<p align="center">
|
| 102 |
+
<img src="https://github.com/X-PLUG/ToolCUA/raw/main/assets/app_results.png" width="760" alt="ToolCUA application results">
|
| 103 |
+
</p>
|
| 104 |
+
|
| 105 |
+
## vLLM Serve
|
| 106 |
+
|
| 107 |
+
We recommend vLLM for deployment. Use `vllm>=0.12.0` and enable `--trust-remote-code`.
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
MAX_IMAGE=${MAX_IMAGE:-5}
|
| 111 |
+
IMAGE_LIMIT_ARGS='{"image": '"$MAX_IMAGE"'}'
|
| 112 |
+
PIXEL_ARGS='{"size": {"longest_edge": 3072000, "shortest_edge": 65536}}'
|
| 113 |
+
|
| 114 |
+
vllm serve X-PLUG/ToolCUA-8B \
|
| 115 |
+
--trust-remote-code \
|
| 116 |
+
--max-model-len 32768 \
|
| 117 |
+
--mm-processor-kwargs "$PIXEL_ARGS" \
|
| 118 |
+
--limit-mm-per-prompt "$IMAGE_LIMIT_ARGS" \
|
| 119 |
+
--tensor-parallel-size 1 \
|
| 120 |
+
--allowed-local-media-path '/' \
|
| 121 |
+
--port 4243 \
|
| 122 |
+
--gpu-memory-utilization 0.85 \
|
| 123 |
+
--mm-processor-cache-gb 0 \
|
| 124 |
+
--no-enable-prefix-caching \
|
| 125 |
+
--enforce-eager \
|
| 126 |
+
--max-logprobs 50
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## Citation
|
| 130 |
+
|
| 131 |
+
```bibtex
|
| 132 |
+
@article{hu2026toolcua,
|
| 133 |
+
title={ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents},
|
| 134 |
+
author={Hu, Xuhao and Zhang, Xi and Xu, Haiyang and Qiao, Kyle and Yang, Jingyi and Huang, Xuanjing and Shao, Jing and Yan, Ming and Ye, Jieping},
|
| 135 |
+
journal={arXiv preprint arXiv:2508.09123},
|
| 136 |
+
year={2026}
|
| 137 |
+
}
|
| 138 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|