zizhaotong commited on
Commit
816cb27
Β·
verified Β·
1 Parent(s): 9c72715

Upload 2 files

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +195 -1
  3. assets/teaser.jpg +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/teaser.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,197 @@
1
  ---
2
  license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - world-model
8
+ - video-generation
9
+ - action-conditioned
10
+ - game-world-model
11
+ - first-person-shooter
12
+ - diffusion
13
+ - transformer
14
+ - wan2.2
15
+ library_name: diffsynth
16
+ pipeline_tag: image-to-video
17
+ base_model:
18
+ - Wan-AI/Wan2.2-TI2V-5B
19
+ ---
20
+
21
+ # <span style="color:rgb(30,100,200)">S</span><span style="color:rgb(60,160,220)">C</span><span style="color:rgb(60,160,220)">O</span><span style="color:rgb(210,120,40)">P</span><span style="color:rgb(180,80,30)">E</span>: Simulating Cross-game Operations in Playable Environments for FPS World Models
22
+
23
+ <div align="center">
24
+ <img src="assets/teaser.jpg" alt="SCOPE Teaser" width="90%">
25
+
26
+ <p><i><b><span style="color:rgb(30,100,200)">S</span><span style="color:rgb(60,160,220)">C</span><span style="color:rgb(60,160,220)">O</span><span style="color:rgb(210,120,40)">P</span><span style="color:rgb(180,80,30)">E</span></b> is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.</i></p>
27
+
28
+ [![Project Page](https://img.shields.io/badge/Project%20Page-SCOPE-blue)](https://github.com/z2tong/SCOPE)
29
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-black?logo=github)](https://github.com/z2tong/SCOPE)
30
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/XXXX.XXXXX)
31
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
32
+
33
+ </div>
34
+
35
+ ## Highlights
36
+
37
+ - **Hybrid Action Space** β€” First FPS world model to handle both continuous (4D joystick) and discrete (6 binary buttons) actions simultaneously in a unified framework
38
+ - **Dense High-Frequency Control** β€” Resolves overlapping control signals at every frame, unlike prior methods limited to sparse inputs
39
+ - **Action Composition** β€” Supports simultaneous multi-action combinations (e.g., moving + aiming + firing), reflecting real gameplay complexity
40
+ - **Cross-Game Generalization** β€” A single world model trained on 7 diverse FPS games that generalizes zero-shot to unseen game environments
41
+ - **In-Scope / Out-of-Scope Decoupling** β€” Spatially selective conditioning that separates localized in-scope effects (weapon, HUD) from stable out-of-scope world generation, without segmentation labels
42
+
43
+ ## Model Overview
44
+
45
+ SCOPE is an interactive world model for first-person shooter (FPS) games. Unlike prior game world models that only handle sparse, single-modality actions, SCOPE processes **hybrid action inputs** (continuous joystick + discrete buttons) at **dense, high-frequency** rates, supporting real-time **action composition** β€” multiple simultaneous inputs such as moving, aiming, and firing in the same frame.
46
+
47
+ Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β€” without any segmentation labels. Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.
48
+
49
+ ### Architecture
50
+
51
+ | Component | Details |
52
+ |-----------|---------|
53
+ | **Base Model** | Wan2.2-TI2V-5B (DiT, 30 layers) |
54
+ | **Action Module** | Per-block conditioning, per-pixel temporal sequences |
55
+ | **Text Encoder** | UMT5-XXL |
56
+ | **VAE** | Wan2.2 Video VAE (temporal compression 4x, spatial compression 8x) |
57
+ | **Parameters** | ~5B total (1575 tensors) |
58
+ | **Precision** | BFloat16 |
59
+
60
+ ### Generation Specs
61
+
62
+ | Property | Value |
63
+ |----------|-------|
64
+ | Resolution | 480 x 832 |
65
+ | Frame Count | 81 frames |
66
+ | Frame Rate | 20 FPS |
67
+ | Duration | ~4 seconds |
68
+ | Inference Steps | 30 (default) |
69
+
70
+ ## Action Input Format
71
+
72
+ SCOPE accepts 10-DoF action inputs per frame via a Parquet file:
73
+
74
+ **Controller Buttons (6D binary):**
75
+
76
+ | Index | Column | Action |
77
+ |:-----:|--------|--------|
78
+ | 0 | `right_trigger` | Fire (RT) |
79
+ | 1 | `left_trigger` | Aim Down Sights (LT) |
80
+ | 2 | `south` | Jump (A) |
81
+ | 3 | `right_thumb` | Melee (R3) |
82
+ | 4 | `west` | Reload (X) |
83
+ | 5 | `north` | Weapon Switch (Y) |
84
+
85
+ **Dual Joystick (4D continuous):**
86
+
87
+ | Column | Axes | Function |
88
+ |--------|------|----------|
89
+ | `j_left` | [x, y] | Character movement (left stick) |
90
+ | `j_right` | [x, y] | Camera rotation (right stick) |
91
+
92
+ ## Quick Start
93
+
94
+ ### Requirements
95
+
96
+ - Python >= 3.10
97
+ - PyTorch >= 2.5 with CUDA support
98
+ - GPU: NVIDIA with >= 24GB VRAM (single GPU inference with CPU offload)
99
+ - [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) framework
100
+
101
+ ### Installation
102
+
103
+ ```bash
104
+ git clone https://github.com/z2tong/SCOPE.git
105
+ cd SCOPE
106
+ pip install -e .
107
+ ```
108
+
109
+ ### Download Weights
110
+
111
+ ```bash
112
+ # Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
113
+ huggingface-cli download z2tong/SCOPE --local-dir ./SCOPE
114
+ ```
115
+
116
+ The model directory contains everything needed for inference:
117
+
118
+ ```
119
+ SCOPE/
120
+ β”œβ”€β”€ model-00001-of-00003.safetensors # SCOPE DiT shard 1
121
+ β”œβ”€β”€ model-00002-of-00003.safetensors # SCOPE DiT shard 2
122
+ β”œβ”€β”€ model-00003-of-00003.safetensors # SCOPE DiT shard 3
123
+ β”œβ”€β”€ model.safetensors.index.json # Shard index
124
+ β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth # Text Encoder (UMT5-XXL)
125
+ β”œβ”€β”€ Wan2.2_VAE.pth # Video VAE
126
+ β”œβ”€β”€ google/umt5-xxl/ # Tokenizer
127
+ └── config.json # Model config
128
+ ```
129
+
130
+ ### Inference
131
+
132
+ **Single image + action sequence:**
133
+
134
+ ```bash
135
+ python inference.py \
136
+ --model_dir ./SCOPE \
137
+ --input_image input.png \
138
+ --action_path action.parquet \
139
+ --prompt "First-person shooter perspective in a modern city" \
140
+ --seed 42
141
+ ```
142
+
143
+ **Batch processing (directory of images):**
144
+
145
+ ```bash
146
+ python inference.py \
147
+ --model_dir ./SCOPE \
148
+ --input_image_dir ./images \
149
+ --action_path action.parquet \
150
+ --prompt "First-person view in a battlefield" \
151
+ --output_dir ./outputs
152
+ ```
153
+
154
+ For full usage details and advanced options, see the [GitHub repository](https://github.com/z2tong/SCOPE).
155
+
156
+ ## Repository Contents
157
+
158
+ This repo contains **all weights** needed for inference in a single download:
159
+
160
+ | File | Component | Size |
161
+ |------|-----------|------|
162
+ | `model-00001-of-00003.safetensors` | SCOPE DiT shard 1 | ~5.0 GB |
163
+ | `model-00002-of-00003.safetensors` | SCOPE DiT shard 2 | ~5.0 GB |
164
+ | `model-00003-of-00003.safetensors` | SCOPE DiT shard 3 | ~4.6 GB |
165
+ | `model.safetensors.index.json` | Shard index mapping | β€” |
166
+ | `models_t5_umt5-xxl-enc-bf16.pth` | Text Encoder (UMT5-XXL) | ~20 GB |
167
+ | `Wan2.2_VAE.pth` | Video VAE | ~700 MB |
168
+ | `google/umt5-xxl/` | Tokenizer | ~10 MB |
169
+ | `config.json` | Model architecture config | β€” |
170
+
171
+ > **Inference code** is available at [github.com/z2tong/SCOPE](https://github.com/z2tong/SCOPE).
172
+
173
+ ## CrossFPS Dataset
174
+
175
+ SCOPE is trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps), the first multi-game FPS dataset with frame-aligned action telemetry:
176
+
177
+ | Property | Value |
178
+ |----------|-------|
179
+ | Games | 7 diverse FPS titles |
180
+ | Total Clips | 69,000+ |
181
+ | Action Dimensions | 10-DoF (6 buttons + 4D joystick) |
182
+ | Annotation | Frame-aligned action telemetry |
183
+ | Curation | Gameplay-bias removal for general visual-to-action mapping |
184
+
185
+ ## Citation
186
+
187
+ ```bibtex
188
+ @article{scope2025,
189
+ title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
190
+ author={Tong, Zizhao and Lai, Hongfeng and Wang, Zeqing and Xing, Zhaohu and Cheng, Kexu and Xu, Haoran and Pu, Zhao and Zhu, Shangwen and Feng, Ruili and Zhao, Jian and Zhang, Yan and Tang, Hao and Jin, Yeying and Shao, Ling},
191
+ year={2025}
192
+ }
193
+ ```
194
+
195
+ ## Acknowledgements
196
+
197
+ We thank the [Wan Team](https://huggingface.co/Wan-AI) for open-sourcing Wan2.2 and the [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) team for the inference framework.
assets/teaser.jpg ADDED

Git LFS Details

  • SHA256: 5c6b4bcf841ab7c9fd0597cd9089481341a5f9a782c9256a76ec4aabdcf7c196
  • Pointer size: 132 Bytes
  • Size of remote file: 6.06 MB