File size: 7,996 Bytes
c313dcb
9c72715
816cb27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b7cf4e
816cb27
 
 
 
8c46337
816cb27
52e9b5a
816cb27
25424b5
816cb27
 
 
 
 
 
723e34e
 
 
 
816cb27
 
 
723e34e
816cb27
723e34e
816cb27
723e34e
816cb27
 
 
723e34e
 
 
 
 
816cb27
 
723e34e
 
 
 
 
 
 
 
 
816cb27
 
 
 
723e34e
816cb27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
723e34e
 
816cb27
 
 
 
 
 
 
 
 
 
 
 
 
723e34e
816cb27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
500b475
 
 
 
 
 
 
 
816cb27
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
license: apache-2.0
language:
  - en
  - zh
tags:
  - world-model
  - video-generation
  - action-conditioned
  - game-world-model
  - first-person-shooter
  - diffusion
  - transformer
  - wan2.2
library_name: diffsynth
pipeline_tag: image-to-video
base_model:
  - Wan-AI/Wan2.2-TI2V-5B
---

# <span style="color:rgb(30,100,200)">**S**</span><span style="color:rgb(60,160,220)">**C**</span><span style="color:rgb(60,160,220)">**O**</span><span style="color:rgb(210,120,40)">**P**</span><span style="color:rgb(180,80,30)">**E**</span>: <span style="color:rgb(30,100,200)">**S**</span>imulating <span style="color:rgb(60,160,220)">**C**</span>ross-game <span style="color:rgb(60,160,220)">**O**</span>perations in <span style="color:rgb(210,120,40)">**P**</span>layable <span style="color:rgb(180,80,30)">**E**</span>nvironments for FPS World Models

<div align="center">
  <img src="assets/teaser.jpg" alt="SCOPE Teaser" width="90%">

  <p><b><span style="color:rgb(30,100,200)">S</span><span style="color:rgb(60,160,220)">C</span><span style="color:rgb(60,160,220)">O</span><span style="color:rgb(210,120,40)">P</span><span style="color:rgb(180,80,30)">E</span></b> is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.</p>

  [![Project Page](https://img.shields.io/badge/Project%20Page-SCOPE-blue)](https://z2tong.github.io/SCOPE/)
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black?logo=github)](https://github.com/z2tong/SCOPE)
  [![arXiv](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/2605.23345)
  [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

</div>

## Highlights

- **Hybrid Action Space** β€” Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework β€” the first FPS world model to do so.
- **Dense Per-Frame Conditioning** β€” Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.
- **Cross-Game Generalization** β€” Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.
- **In-Scope / Out-of-Scope Decoupling** β€” Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation β€” without any segmentation labels.

## Model Overview

SCOPE is an interactive world model for first-person shooter (FPS) games. Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β€” without any segmentation labels.

Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.

## Architecture

| Component | Details |
|-----------|---------|
| **Base Model** | Wan2.2-TI2V-5B (DiT, 30 transformer layers) |
| **Action Module** | Per-block conditioning with per-pixel temporal sequences |
| **Text Encoder** | UMT5-XXL (4096-dim hidden) |
| **VAE** | Wan2.2 Video VAE (4Γ— temporal compression, 8Γ— spatial compression) |
| **Total Parameters** | ~5B (1575 tensors, of which 750 are action-related) |
| **Precision** | BFloat16 |

### ActionModule Design

Each of the 30 DiT blocks contains an `ActionModule` with two conditioning paths:

- **Mouse/Joystick Path**: Sliding-window temporal features β†’ MLP fusion β†’ pixel-wise temporal self-attention with RoPE
- **Keyboard/Button Path**: Button embedding β†’ temporal windowing β†’ cross-attention (video queries, keyboard keys/values)

Both output projections are zero-initialized for stable residual training on top of frozen pretrained weights.

### Generation Specs

| Property | Value |
|----------|-------|
| Resolution | 480 Γ— 832 |
| Frame Count | 81 frames |
| Frame Rate | 20 FPS |
| Duration | ~4 seconds |
| Inference Steps | 30 (default) |

## Action Input Format

SCOPE accepts 10-DoF action inputs per frame via a Parquet file:

**Controller Buttons (6D binary):**

| Index | Column | Action |
|:-----:|--------|--------|
| 0 | `right_trigger` | Fire (RT) |
| 1 | `left_trigger` | Aim Down Sights (LT) |
| 2 | `south` | Jump (A) |
| 3 | `right_thumb` | Melee (R3) |
| 4 | `west` | Reload (X) |
| 5 | `north` | Weapon Switch (Y) |

**Dual Joystick (4D continuous):**

| Column | Axes | Function |
|--------|------|----------|
| `j_left` | [x, y] | Character movement (left stick) |
| `j_right` | [x, y] | Camera rotation (right stick) |

## Quick Start

### Requirements

- Python >= 3.10
- PyTorch >= 2.0 with CUDA support
- GPU: NVIDIA with >= 24 GB VRAM (single GPU inference with CPU offload)

### Installation

```bash
git clone https://github.com/z2tong/SCOPE.git
cd SCOPE
pip install -e .
```

### Download Weights

```bash
# Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
huggingface-cli download zizhaotong/SCOPE --local-dir ./SCOPE
```

### Inference

**Single image + action sequence:**

```bash
python inference.py \
    --model_dir ./SCOPE \
    --input_image input.png \
    --action_path action.parquet \
    --prompt "First-person shooter perspective in a modern city" \
    --seed 42
```

**Batch processing (directory of images):**

```bash
python inference.py \
    --model_dir ./SCOPE \
    --input_image_dir ./images \
    --action_path action.parquet \
    --prompt "First-person view in a battlefield" \
    --output_dir ./outputs
```

For full usage details and advanced options, see the [GitHub repository](https://github.com/z2tong/SCOPE).

## Repository Contents

This repo contains **all weights** needed for inference in a single download:

| File | Component | Size |
|------|-----------|------|
| `model-00001-of-00003.safetensors` | SCOPE DiT shard 1 | ~5.0 GB |
| `model-00002-of-00003.safetensors` | SCOPE DiT shard 2 | ~5.0 GB |
| `model-00003-of-00003.safetensors` | SCOPE DiT shard 3 | ~4.6 GB |
| `model.safetensors.index.json` | Shard index mapping | β€” |
| `models_t5_umt5-xxl-enc-bf16.pth` | Text Encoder (UMT5-XXL) | ~20 GB |
| `Wan2.2_VAE.pth` | Video VAE | ~700 MB |
| `google/umt5-xxl/` | Tokenizer | ~10 MB |
| `config.json` | Model architecture config | β€” |

> **Inference code** is available at [github.com/z2tong/SCOPE](https://github.com/z2tong/SCOPE).

## CrossFPS Dataset

SCOPE is trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps), the first multi-game FPS dataset with frame-aligned action telemetry:

| Property | Value |
|----------|-------|
| Games | 7 diverse FPS titles |
| Total Clips | 69,000+ |
| Action Dimensions | 10-DoF (6 buttons + 4D joystick) |
| Annotation | Frame-aligned action telemetry |
| Curation | Gameplay-bias removal for general visual-to-action mapping |

## Citation

```bibtex
@misc{scope2026,
      title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
      author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
      year={2026},
      eprint={2605.23345},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.23345}, 
}
```

## Acknowledgements

We thank the [Wan Team](https://huggingface.co/Wan-AI) for open-sourcing Wan2.2 and the [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) team for the inference framework.