File size: 11,715 Bytes
26ddada
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d27c087
 
 
 
26c6589
 
 
 
 
 
 
 
 
 
 
d27c087
 
 
 
 
 
 
 
 
 
 
 
 
 
26c6589
 
 
 
 
d27c087
26c6589
d27c087
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61d4884
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
license_link: https://www.apache.org/licenses/LICENSE-2.0
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
library_name: transformers
tags:
- WebWorld
- web-agent
- world-model
- simulator
- browser
- a11y
- html
- xml
- markdown
- long-horizon
- long-context
- synthetic-trajectories
- instruction-tuning
base_model_relation: finetune
base_model:
- Qwen/Qwen3-8B
datasets:
- Qwen/WebWorldData
---

# WebWorld ๐ŸŒ

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/LICENSE-2.0) 
[![GitHub](https://img.shields.io/badge/GitHub-WebWorld-4b32c3?logo=github)](https://github.com/QwenLM/WebWorld) 
[![Dataset](https://img.shields.io/badge/HF%20Dataset-WebWorldData-yellow?logo=huggingface)](https://huggingface.co/datasets/Qwen/WebWorldData) 
[![MS Dataset](https://img.shields.io/badge/ModelScope-Dataset-7B42BC)](https://modelscope.cn/datasets/Qwen/WebWorldData) 
[![8B](https://img.shields.io/badge/Model-8B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-8B) 
[![MS 8B](https://img.shields.io/badge/ModelScope-8B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-8B) 
[![14B](https://img.shields.io/badge/Model-14B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-14B) 
[![MS 14B](https://img.shields.io/badge/ModelScope-14B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-14B) 
[![32B](https://img.shields.io/badge/Model-32B-green?logo=huggingface)](https://huggingface.co/Qwen/WebWorld-32B) 
[![MS 32B](https://img.shields.io/badge/ModelScope-32B-7B42BC)](https://modelscope.cn/models/Qwen/WebWorld-32B)


## ๐Ÿ“š Introduction

**WebWorld** is a large-scale **open-web world model** series for training and evaluating web agents. It is trained on **1M+ real-world web interaction trajectories** via a scalable hierarchical data pipeline, supporting:

- **Long-horizon simulation** (30+ steps)
- **Multi-format state representations**: A11y Tree, HTML, XML, Markdown, and natural language
- **CoT-activated reasoning** for transition prediction
- **Cross-domain generalization** to code, GUI, and game environments

Agents trained on WebWorld-synthesized trajectories achieve **+9.9% on MiniWob++** and **+10.9% on WebArena**. When used for inference-time lookahead search, WebWorld **outperforms GPT-5** as a world model.

## ๐ŸŽฏ Model Series

| Model | Base Model | HuggingFace Link | ModelScope Link |
|---|---|---|---|
| **WebWorld-8B** | [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | [๐Ÿค— HuggingFace](https://huggingface.co/Qwen/WebWorld-8B) | [๐Ÿค– ModelScope](https://modelscope.cn/models/Qwen/WebWorld-8B) |
| **WebWorld-14B** | [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) | [๐Ÿค— HuggingFace](https://huggingface.co/Qwen/WebWorld-14B) | [๐Ÿค– ModelScope](https://modelscope.cn/models/Qwen/WebWorld-14B) |
| **WebWorld-32B** | [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | [๐Ÿค— HuggingFace](https://huggingface.co/Qwen/WebWorld-32B) | [๐Ÿค– ModelScope](https://modelscope.cn/models/Qwen/WebWorld-32B) |

**WebWorldData**: [Huggingface: Qwen/WebWorldData](https://huggingface.co/datasets/Qwen/WebWorldData), [ModelScope: Qwen/WebWorldData](https://modelscope.cn/datasets/Qwen/WebWorldData)

๐Ÿ’ก **Recommendation**: Use 8B for fast simulation and data synthesis; use 14B/32B for higher-fidelity simulation and better long-horizon robustness. For best results in a specific environment, we recommend task-specific fine-tuning on in-domain trajectories.

## ๐Ÿ› ๏ธ Requirements

- `transformers` (recommended: latest version)
- `torch`
- Optional: `accelerate`, `vllm` for efficient serving

## ๐Ÿš€ Quick Start

**Key Notes:**
- WebWorld predicts the next page state given the current state and an action.
- It strictly preserves the input/output format (A11y / HTML / XML / Markdown / NL).
- Supports multi-turn trajectory simulation up to 30+ steps.

### Single-Step Prediction

<details>
<summary>๐Ÿ’ป Click to expand code</summary>

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Qwen/WebWorld-8B"  # or WebWorld-14B, WebWorld-32B
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()

system_prompt = (
    "You are a web world model. I will provide you with an initial page state "
    "and a sequence of actions. For each action, predict the resulting page state.\n"
    "Strictly maintain the original format. Output only the full page state "
    "without explanations, code, or truncation."
)

current_state = """RootWebArea 'Global Start - Your Daily Portal', focused
\t[1] banner 'Top Header', visible
\t\t[2] link 'Set as Homepage', clickable, visible
\t\t[3] link 'Feedback', clickable, visible
\t\t[5] region 'Weather Widget', visible
\t\t\tStaticText 'New York, USA'
\t\t\t[6] image 'Sunny', visible
\t\t\tStaticText '24ยฐC'
\t\t[8] link 'Sign In', clickable, visible
\t[10] region 'Search Area', visible
\t\t[11] image 'Global Start Logo', visible
\t\tStaticText 'Search the entire web'
\t\t[12] tablist 'Search Engine Selector', orientation='horizontal'
\t\t\t[13] tab 'Google', selected=True, clickable
\t\t\t[14] tab 'Bing', selected=False, clickable
\t\t\t[15] tab 'DuckDuckGo', selected=False, clickable
\t\t[18] combobox 'Web Search', clickable, visible, autocomplete='both', expanded=False
\t\t\t[19] textbox 'Type keywords or URL...', clickable, visible, editable, value=''
\t\t[20] button 'Search', clickable, visible
\t[30] navigation 'Category Bar', visible
\t\t[31] link 'Home', clickable, selected=True
\t\t[32] link 'News', clickable
\t\t[33] link 'Video', clickable
\t\t[34] link 'Shopping', clickable
\t\t[35] link 'Social', clickable
\t[50] main 'Site Directory', visible
\t\t[51] region 'Top Recommended', visible
\t\t\t[52] heading 'Most Popular', visible
\t\t\t[53] list 'Top Sites Grid', visible
\t\t\t\t[54] link 'Facebook', clickable
\t\t\t\t[56] link 'YouTube', clickable
\t\t\t\t[58] link 'Amazon', clickable
\t\t\t\t[60] link 'Twitter / X', clickable
\t\t\t\t[62] link 'Instagram', clickable
\t\t\t\t[64] link 'Wikipedia', clickable
\t\t\t\t[66] link 'Netflix', clickable
\t\t\t\t[68] link 'LinkedIn', clickable
\t\t[80] region 'News & Media', visible
\t\t\t[81] heading 'Latest News', visible
\t\t\t[82] link 'CNN', clickable
\t\t\t[83] link 'BBC', clickable
\t\t\t[84] link 'The Verge', clickable
\t\t[90] region 'Shopping', visible
\t\t\t[91] heading 'E-Commerce', visible
\t\t\t[92] link 'eBay', clickable
\t\t\t[93] link 'Walmart', clickable
\t\t\t[94] link 'Best Buy', clickable
\t[200] complementary 'Ads', visible
\t\t[201] image 'Ad: Travel to Japan'
\t\t[202] link 'Book Now', clickable
\t[300] contentinfo 'Footer', visible
\t\tStaticText 'ยฉ 2026 Global Start Inc.'"""

user_message = (
    f"Initial Page State:\n{current_state}\n\n"
    f"First Action: 'click([32])'\n\n"
    f"Next Page State:"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_message},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=4096,
        do_sample=False,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
```

</details>

### Multi-Turn Simulation

The first turn provides the initial state and first action. Each subsequent turn uses a fixed continuation prompt:

<details>
<summary>๐Ÿ’ป Click to expand code</summary>

```python
CONTINUE_PROMPT = (
    "Continue the trajectory. Given the previous state, "
    "predict the next page state after this action.\n\n"
    "Action: '{action}'\n\nNext Page State:"
)

# Turn 1
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Initial Page State:\n{state_0}\n\nFirst Action: '{action_0}'\n\nNext Page State:"},
]
state_1 = generate(messages)  # your generate function

# Turn 2
messages.append({"role": "assistant", "content": state_1})
messages.append({"role": "user", "content": CONTINUE_PROMPT.format(action=action_1)})
state_2 = generate(messages)

# Turn 3, 4, ... up to 30+ turns: repeat the same pattern
messages.append({"role": "assistant", "content": state_2})
messages.append({"role": "user", "content": CONTINUE_PROMPT.format(action=action_2)})
state_3 = generate(messages)
```

</details>

## ๐ŸŽฎ Action Space

WebWorld supports a unified action space as Python-style function calls:

| Category | Action | Description |
|---|---|---|
| **Element** | `click(bid, button, modifiers)` | Click a DOM element by its ID |
| | `fill(bid, text, press_enter)` | Type text into an input field |
| | `select_option(bid, options)` | Select from a dropdown / combobox |
| | `hover(bid)` | Hover over an element |
| **Mouse** | `mouse_move(x, y)` | Move cursor to coordinates |
| | `mouse_click(x, y, button)` | Click at coordinates |
| | `mouse_down(x, y)` / `mouse_up(x, y)` | Press / release (drag-and-drop) |
| **Keyboard** | `keyboard_press(key)` | Press a key (e.g., `Enter`, `Tab`) |
| | `keyboard_type(text)` | Type a string sequentially |
| **Browser** | `scroll(dx, dy)` | Scroll the viewport |
| | `goto(url)` | Navigate to a URL |
| | `go_back()` / `go_forward()` | Browser history navigation |
| | `tab_new()` / `tab_close()` / `tab_focus(index)` | Manage browser tabs |
| **Meta** | `send_msg_to_user(text)` | Send a message to the user |
| | `noop(wait_ms)` | Wait for a duration |
| | `infeasible(reason)` | Declare the task impossible |

## ๐Ÿ“Š Performance

### Intrinsic Evaluation (WebWorld-Bench)

WebWorld-Bench evaluates models using **Factuality Score** (functional correctness) and **Web Turing Score** (perceptual realism) across nine dimensions:

| Model | Avg Factuality | Avg Turing |
|---|---|---|
| GPT-4o | 59.5 | 35.4 |
| Claude-Opus-4.1 | **71.3** | **47.4** |
| Gemini-3-Pro | 70.3 | 43.2 |
| Qwen3-8B (base) | 26.9 | 17.4 |
| **WebWorld-8B** | **70.1** | **42.2** |
| **WebWorld-14B** | 70.7 | 44.7 |
| **WebWorld-32B** | **71.0** | **45.6** |

### Extrinsic Evaluation (Agent Training)

| Model | MiniWob++ SR | WebArena SR |
|---|---|---|
| GPT-4o | 64.3% | 26.6% |
| Qwen3-8B (base) | 49.4% | 9.8% |
| **Qwen3-8B + WebWorld** | **59.3%** (+9.9%) | **20.7%** (+10.9%) |
| Qwen3-14B (base) | 54.9% | 15.1% |
| **Qwen3-14B + WebWorld** | **63.2%** (+8.3%) | **24.3%** (+9.2%) |

### Cross-Domain Generalization

| Environment | Qwen3-8B | WebWorld-8B | Gain |
|---|---|---|---|
| API Services | 0.088 | **0.299** | +0.211 |
| Code | 0.147 | **0.396** | +0.249 |
| Game | 0.253 | **0.473** | +0.220 |
| GUI Desktop | 0.322 | **0.705** | +0.383 |

## โš ๏ธ Limitations

- **Sycophancy / optimism bias**: the model may generate outcomes that are overly favorable to the agent's intended action.
- **Content generation fidelity**: long-form, high-precision content (e.g., scientific articles) is not the primary target.
- **Text-only**: WebWorld does not simulate visual / pixel-level rendering.

## ๐Ÿ“ Citation

```bibtex
@misc{xiao2026webworldlargescaleworldmodel,
      title={WebWorld: A Large-Scale World Model for Web Agent Training}, 
      author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu},
      year={2026},
      eprint={2602.14721},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.14721}, 
}