Foreshhh commited on
Commit
e08f4aa
·
verified ·
1 Parent(s): 9f6d88b

Create README.md

Browse files

<h1 style="
font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
font-size:48px;
font-weight:700;
line-height:1.25;
text-align:center;
margin:0 0 24px;">
OpenCUA-7B
</h1>

<div style="
display:flex;
justify-content:center;
gap:12px;
flex-wrap:wrap;
margin-bottom:28px;">

<a href="https://opencua.xlang.ai/" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
🌐 Website
</a>

<a href="https://arxiv.org/abs/2508.09123" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
📝 Paper
</a>

<a href="https://github.com/xlang-ai/OpenCUA" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
💻 Code
</a>
</div>

<div style="max-width:900px;margin:0 auto;">

# 🚀 vLLM Serve (Recommended)

We recommend using vLLM for production deployment. Requires **vllm>=0.12.0** with `--trust-remote-code`.

```bash
vllm serve xlangai/OpenCUA-7B \
--trust-remote-code \
--served-model-name opencua-7b \
--host 0.0.0.0 \
--port 8000
```

Adjust `--gpu-memory-utilization` based on your hardware configuration.

---

# Introduction
<div style="
max-width: 880px; /* 可按需调节整体宽度 */
margin: 0 auto; /* 居中容器 */
text-align: justify; /* 关键:两端对齐 */
text-justify: inter-word; /* 优化英文对齐效果 */
line-height: 1.6;">

OpenCUA models (OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B) are end-to-end computer-use foundation models that can produce executable actions in the computer environments with great planning and grounding capabilities. They are based on the Qwen2.5-VL model family.

With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of **45.0%** on [OSWorld-Verified](https://os-world.github.io/), establishing a new state-of-the-art (SOTA) among open-source models. OpenCUA-72B also has strong grounding ability, achieving 37.3% (SOTA) on [UI-Vision](https://arxiv.org/abs/2504.07981) and 60.8% on [ScreenSpot-Pro](https://arxiv.org/abs/2504.07981).
</div>

## 📢 Updates
- 2026-01-17: 🎉 **vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B!** Thanks to the [Meituan EvoCUA Team](https://github.com/meituan) for their contributions to vLLM integration.

- 2025-10-12: <span style="font-weight:bold">[OpenCUA-7B-exl2](https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2) is now live!</span> ⚡️
Thanks to [Sujit Vasanth](https://huggingface.co/sujitvasanth) for producing a quantized **exllamav2** version of OpenCUA-7B — enabling much faster inference with lower VRAM usage.

### Key Features

- **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
- **Multi-OS Support**: Trained on demonstrations across Ubuntu, Windows, and macOS
- **Visual Grounding**: Strong GUI element recognition and spatial reasoning capabilities
- **Multi-Image Context**: Processes up to 3 screenshot history for better context understanding
- **Reflective Reasoning**: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning


# Performance

### Online Agent Evaluation
OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
OpenCUA-72B achieves the best performance among all open-source models with an average success rate of 45.0%, establishing a new state-of-the-art (SOTA).
<div align="center">

| **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
|-------------------------------|:--------:|:--------:|:---------:|
| **Proprietary** | | | |
| OpenAI CUA | 26.0 | 31.3 | 31.4 |
| Seed 1.5-VL | 27.9 | — | 34.1 |
| Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
| Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
| **Open-Source** | | | |
| Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
| Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
| Kimi-VL-A3B | 9.7 | — | 10.3 |
| UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
| UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
| OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
| OpenCUA-32B *(Ours)* | 29.7 | 34.1 | 34.8 |
| **OpenCUA-72B *(Ours)*** | **39.0** | **44.9** | **45.0** |
</div>

*OpenCUA scores are the mean of 3 independent runs.*

### GUI Grounding Performance
<div align="center">

| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** | **UI-Vision** |
|-------|-----------|---------------|----------------|----------|
| Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 |
| Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - |
| UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 |
| **OpenCUA-7B** | 55.3 | 92.3 | 50.0 | 29.7 |
| **OpenCUA-32B** | 59.6 | 93.4 | 55.3 | 33.3 |
| **OpenCUA-72B** | **59.2** | **92.9** | **60.8** | **37.3** |
</div>


### AgentNetBench (Offline Evaluation)
<div align="center">

| **Model** | **Coordinate Actions** | **Content Actions** | **Function Actions** | **Average** |
|-------|-------------------|-----------------|------------------|---------|
| Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
| Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
| Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
| OpenAI CUA | 71.7 | 57.3 | **80.0** | 73.1 |
| **OpenCUA-7B** | 79.0 | 62.0 | 44.3 | 75.2 |
| **OpenCUA-32B** | **81.9** | 66.1 | 55.7 | **79.1** |
</div>

# 🚀 Quick Start
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
<strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>

To align with our training infrastructure, we have modified the model in two places:
<ul style="margin-top: 8px;">
<li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
<li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
<li>vLLM supported via <code>--trust-remote-code</code> flag. Tokenizer and Chat Template should be aligned if training the models.</li>
</ul>
</div>


## Installation & Download

First, install the required dependencies:

```bash
conda create -n opencua python=3.12
conda activate opencua
pip install openai>=1.0.0
```

Download the model weight from huggingface (optional, vLLM can download automatically):
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="xlangai/OpenCUA-7B",
local_dir="OpenCUA-7B",
local_dir_use_symlinks=False
)
```

## 🎯 GUI Grounding

First, start the vLLM server:

```bash
vllm serve xlangai/OpenCUA-7B \
--trust-remote-code \
--served-model-name opencua-7b \
--host 0.0.0.0 \
--port 8000
```

Then run the following code to test GUI grounding:

```python
import base64
from openai import OpenAI

# vLLM server configuration
VLLM_BASE_URL = "http://localhost:8000/v1"
MODEL_NAME = "opencua-7b" # Should match --served-model-name in vllm serve

def encode_image(image_path: str) -> str:
"""Encode image to base64 string."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode()

def run_grounding(image_path: str, instruction: str) -> str:
"""Run GUI grounding inference via vLLM."""
client = OpenAI(base_url=VLLM_BASE_URL, api_key="EMPTY")

system_prompt = (
"You are a GUI agent. You are given a task and a screenshot of the screen. "
"You need to perform a series of pyautogui actions to complete the task."
)

messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
},
{"type": "text", "text": instruction},
],
},
]

response = client.chat.completions.create(
model=MODEL_NAME,
messages=messages,
max_tokens=512,
temperature=0,
)

return response.choices[0].message.content

# Example usage
image_path = "screenshot.png"
instruction = "Click on the submit button"

result = run_grounding(image_path, instruction)
print("Model output:", result)
```

<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
<em>Expected result:</em> ```python\npyautogui.click(x=1443, y=343)\n```
</div>

You can also run the grounding examples in [OpenCUA/model/inference/](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/):
```bash
cd ./model/inference/

# vLLM (requires running vLLM server first)
python vllm_inference.py

# HuggingFace Transformers
python huggingface_inference.py
```

## 🖥️ Computer Use Agent
**[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.

Command for running OpenCUA-7B in OSWorld:
```
python run_multienv_opencua.py \
--headless \
--observation_type screenshot \
--model OpenCUA-7B \
--re

Files changed (1) hide show
  1. README.md +18 -0
README.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - Qwen/Qwen3-VL-8B-Instruct
9
+ pipeline_tag: image-text-to-text
10
+ library_name: transformers
11
+ tags:
12
+ - toolcua
13
+ - VLM
14
+ - Compuer-Use-Agent
15
+ - OS-Agent
16
+ - GUI
17
+ - MLLM-Agent
18
+ ---