zizhaotong commited on
Commit
723e34e
Β·
verified Β·
1 Parent(s): f6c34e4

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -35
README.md CHANGED
@@ -34,34 +34,42 @@ base_model:
34
 
35
  ## Highlights
36
 
37
- - **Hybrid Action Space** β€” First FPS world model to handle both continuous (4D joystick) and discrete (6 binary buttons) actions simultaneously in a unified framework
38
- - **Dense High-Frequency Control** β€” Resolves overlapping control signals at every frame, unlike prior methods limited to sparse inputs
39
- - **Action Composition** β€” Supports simultaneous multi-action combinations (e.g., moving + aiming + firing), reflecting real gameplay complexity
40
- - **Cross-Game Generalization** β€” A single world model trained on 7 diverse FPS games that generalizes zero-shot to unseen game environments
41
- - **In-Scope / Out-of-Scope Decoupling** β€” Spatially selective conditioning that separates localized in-scope effects (weapon, HUD) from stable out-of-scope world generation, without segmentation labels
42
 
43
  ## Model Overview
44
 
45
- SCOPE is an interactive world model for first-person shooter (FPS) games. Unlike prior game world models that only handle sparse, single-modality actions, SCOPE processes **hybrid action inputs** (continuous joystick + discrete buttons) at **dense, high-frequency** rates, supporting real-time **action composition** β€” multiple simultaneous inputs such as moving, aiming, and firing in the same frame.
46
 
47
- Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β€” without any segmentation labels. Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.
48
 
49
- ### Architecture
50
 
51
  | Component | Details |
52
  |-----------|---------|
53
- | **Base Model** | Wan2.2-TI2V-5B (DiT, 30 layers) |
54
- | **Action Module** | Per-block conditioning, per-pixel temporal sequences |
55
- | **Text Encoder** | UMT5-XXL |
56
- | **VAE** | Wan2.2 Video VAE (temporal compression 4x, spatial compression 8x) |
57
- | **Parameters** | ~5B total (1575 tensors) |
58
  | **Precision** | BFloat16 |
59
 
 
 
 
 
 
 
 
 
 
60
  ### Generation Specs
61
 
62
  | Property | Value |
63
  |----------|-------|
64
- | Resolution | 480 x 832 |
65
  | Frame Count | 81 frames |
66
  | Frame Rate | 20 FPS |
67
  | Duration | ~4 seconds |
@@ -94,9 +102,8 @@ SCOPE accepts 10-DoF action inputs per frame via a Parquet file:
94
  ### Requirements
95
 
96
  - Python >= 3.10
97
- - PyTorch >= 2.5 with CUDA support
98
- - GPU: NVIDIA with >= 24GB VRAM (single GPU inference with CPU offload)
99
- - [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) framework
100
 
101
  ### Installation
102
 
@@ -110,21 +117,7 @@ pip install -e .
110
 
111
  ```bash
112
  # Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
113
- huggingface-cli download z2tong/SCOPE --local-dir ./SCOPE
114
- ```
115
-
116
- The model directory contains everything needed for inference:
117
-
118
- ```
119
- SCOPE/
120
- β”œβ”€β”€ model-00001-of-00003.safetensors # SCOPE DiT shard 1
121
- β”œβ”€β”€ model-00002-of-00003.safetensors # SCOPE DiT shard 2
122
- β”œβ”€β”€ model-00003-of-00003.safetensors # SCOPE DiT shard 3
123
- β”œβ”€β”€ model.safetensors.index.json # Shard index
124
- β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth # Text Encoder (UMT5-XXL)
125
- β”œβ”€β”€ Wan2.2_VAE.pth # Video VAE
126
- β”œβ”€β”€ google/umt5-xxl/ # Tokenizer
127
- └── config.json # Model config
128
  ```
129
 
130
  ### Inference
@@ -185,10 +178,10 @@ SCOPE is trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong
185
  ## Citation
186
 
187
  ```bibtex
188
- @article{scope2025,
189
  title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
190
- author={Tong, Zizhao and Lai, Hongfeng and Wang, Zeqing and Xing, Zhaohu and Cheng, Kexu and Xu, Haoran and Pu, Zhao and Zhu, Shangwen and Feng, Ruili and Zhao, Jian and Zhang, Yan and Tang, Hao and Jin, Yeying and Shao, Ling},
191
- year={2025}
192
  }
193
  ```
194
 
 
34
 
35
  ## Highlights
36
 
37
+ - **Hybrid Action Space** β€” Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework β€” the first FPS world model to do so.
38
+ - **Dense Per-Frame Conditioning** β€” Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.
39
+ - **Cross-Game Generalization** β€” Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.
40
+ - **In-Scope / Out-of-Scope Decoupling** β€” Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation β€” without any segmentation labels.
 
41
 
42
  ## Model Overview
43
 
44
+ SCOPE is an interactive world model for first-person shooter (FPS) games. Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β€” without any segmentation labels.
45
 
46
+ Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.
47
 
48
+ ## Architecture
49
 
50
  | Component | Details |
51
  |-----------|---------|
52
+ | **Base Model** | Wan2.2-TI2V-5B (DiT, 30 transformer layers) |
53
+ | **Action Module** | Per-block conditioning with per-pixel temporal sequences |
54
+ | **Text Encoder** | UMT5-XXL (4096-dim hidden) |
55
+ | **VAE** | Wan2.2 Video VAE (4Γ— temporal compression, 8Γ— spatial compression) |
56
+ | **Total Parameters** | ~5B (1575 tensors, of which 750 are action-related) |
57
  | **Precision** | BFloat16 |
58
 
59
+ ### ActionModule Design
60
+
61
+ Each of the 30 DiT blocks contains an `ActionModule` with two conditioning paths:
62
+
63
+ - **Mouse/Joystick Path**: Sliding-window temporal features β†’ MLP fusion β†’ pixel-wise temporal self-attention with RoPE
64
+ - **Keyboard/Button Path**: Button embedding β†’ temporal windowing β†’ cross-attention (video queries, keyboard keys/values)
65
+
66
+ Both output projections are zero-initialized for stable residual training on top of frozen pretrained weights.
67
+
68
  ### Generation Specs
69
 
70
  | Property | Value |
71
  |----------|-------|
72
+ | Resolution | 480 Γ— 832 |
73
  | Frame Count | 81 frames |
74
  | Frame Rate | 20 FPS |
75
  | Duration | ~4 seconds |
 
102
  ### Requirements
103
 
104
  - Python >= 3.10
105
+ - PyTorch >= 2.0 with CUDA support
106
+ - GPU: NVIDIA with >= 24 GB VRAM (single GPU inference with CPU offload)
 
107
 
108
  ### Installation
109
 
 
117
 
118
  ```bash
119
  # Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
120
+ huggingface-cli download zizhaotong/SCOPE --local-dir ./SCOPE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ```
122
 
123
  ### Inference
 
178
  ## Citation
179
 
180
  ```bibtex
181
+ @article{scope2026,
182
  title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
183
+ author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
184
+ year={2026}
185
  }
186
  ```
187