Image-to-Video
Safetensors
Wan2.2
English
Chinese
diffsynth
scope
world-model
video-generation
action-conditioned
game-world-model
first-person-shooter
diffusion
transformer
Instructions to use zizhaotong/SCOPE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use zizhaotong/SCOPE with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Upload 2 files
Browse files- .gitattributes +1 -0
- README.md +195 -1
- assets/teaser.jpg +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
assets/teaser.jpg filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,197 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
tags:
|
| 7 |
+
- world-model
|
| 8 |
+
- video-generation
|
| 9 |
+
- action-conditioned
|
| 10 |
+
- game-world-model
|
| 11 |
+
- first-person-shooter
|
| 12 |
+
- diffusion
|
| 13 |
+
- transformer
|
| 14 |
+
- wan2.2
|
| 15 |
+
library_name: diffsynth
|
| 16 |
+
pipeline_tag: image-to-video
|
| 17 |
+
base_model:
|
| 18 |
+
- Wan-AI/Wan2.2-TI2V-5B
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# <span style="color:rgb(30,100,200)">S</span><span style="color:rgb(60,160,220)">C</span><span style="color:rgb(60,160,220)">O</span><span style="color:rgb(210,120,40)">P</span><span style="color:rgb(180,80,30)">E</span>: Simulating Cross-game Operations in Playable Environments for FPS World Models
|
| 22 |
+
|
| 23 |
+
<div align="center">
|
| 24 |
+
<img src="assets/teaser.jpg" alt="SCOPE Teaser" width="90%">
|
| 25 |
+
|
| 26 |
+
<p><i><b><span style="color:rgb(30,100,200)">S</span><span style="color:rgb(60,160,220)">C</span><span style="color:rgb(60,160,220)">O</span><span style="color:rgb(210,120,40)">P</span><span style="color:rgb(180,80,30)">E</span></b> is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.</i></p>
|
| 27 |
+
|
| 28 |
+
[](https://github.com/z2tong/SCOPE)
|
| 29 |
+
[](https://github.com/z2tong/SCOPE)
|
| 30 |
+
[](https://arxiv.org/abs/XXXX.XXXXX)
|
| 31 |
+
[](LICENSE)
|
| 32 |
+
|
| 33 |
+
</div>
|
| 34 |
+
|
| 35 |
+
## Highlights
|
| 36 |
+
|
| 37 |
+
- **Hybrid Action Space** β First FPS world model to handle both continuous (4D joystick) and discrete (6 binary buttons) actions simultaneously in a unified framework
|
| 38 |
+
- **Dense High-Frequency Control** β Resolves overlapping control signals at every frame, unlike prior methods limited to sparse inputs
|
| 39 |
+
- **Action Composition** β Supports simultaneous multi-action combinations (e.g., moving + aiming + firing), reflecting real gameplay complexity
|
| 40 |
+
- **Cross-Game Generalization** β A single world model trained on 7 diverse FPS games that generalizes zero-shot to unseen game environments
|
| 41 |
+
- **In-Scope / Out-of-Scope Decoupling** β Spatially selective conditioning that separates localized in-scope effects (weapon, HUD) from stable out-of-scope world generation, without segmentation labels
|
| 42 |
+
|
| 43 |
+
## Model Overview
|
| 44 |
+
|
| 45 |
+
SCOPE is an interactive world model for first-person shooter (FPS) games. Unlike prior game world models that only handle sparse, single-modality actions, SCOPE processes **hybrid action inputs** (continuous joystick + discrete buttons) at **dense, high-frequency** rates, supporting real-time **action composition** β multiple simultaneous inputs such as moving, aiming, and firing in the same frame.
|
| 46 |
+
|
| 47 |
+
Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β without any segmentation labels. Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.
|
| 48 |
+
|
| 49 |
+
### Architecture
|
| 50 |
+
|
| 51 |
+
| Component | Details |
|
| 52 |
+
|-----------|---------|
|
| 53 |
+
| **Base Model** | Wan2.2-TI2V-5B (DiT, 30 layers) |
|
| 54 |
+
| **Action Module** | Per-block conditioning, per-pixel temporal sequences |
|
| 55 |
+
| **Text Encoder** | UMT5-XXL |
|
| 56 |
+
| **VAE** | Wan2.2 Video VAE (temporal compression 4x, spatial compression 8x) |
|
| 57 |
+
| **Parameters** | ~5B total (1575 tensors) |
|
| 58 |
+
| **Precision** | BFloat16 |
|
| 59 |
+
|
| 60 |
+
### Generation Specs
|
| 61 |
+
|
| 62 |
+
| Property | Value |
|
| 63 |
+
|----------|-------|
|
| 64 |
+
| Resolution | 480 x 832 |
|
| 65 |
+
| Frame Count | 81 frames |
|
| 66 |
+
| Frame Rate | 20 FPS |
|
| 67 |
+
| Duration | ~4 seconds |
|
| 68 |
+
| Inference Steps | 30 (default) |
|
| 69 |
+
|
| 70 |
+
## Action Input Format
|
| 71 |
+
|
| 72 |
+
SCOPE accepts 10-DoF action inputs per frame via a Parquet file:
|
| 73 |
+
|
| 74 |
+
**Controller Buttons (6D binary):**
|
| 75 |
+
|
| 76 |
+
| Index | Column | Action |
|
| 77 |
+
|:-----:|--------|--------|
|
| 78 |
+
| 0 | `right_trigger` | Fire (RT) |
|
| 79 |
+
| 1 | `left_trigger` | Aim Down Sights (LT) |
|
| 80 |
+
| 2 | `south` | Jump (A) |
|
| 81 |
+
| 3 | `right_thumb` | Melee (R3) |
|
| 82 |
+
| 4 | `west` | Reload (X) |
|
| 83 |
+
| 5 | `north` | Weapon Switch (Y) |
|
| 84 |
+
|
| 85 |
+
**Dual Joystick (4D continuous):**
|
| 86 |
+
|
| 87 |
+
| Column | Axes | Function |
|
| 88 |
+
|--------|------|----------|
|
| 89 |
+
| `j_left` | [x, y] | Character movement (left stick) |
|
| 90 |
+
| `j_right` | [x, y] | Camera rotation (right stick) |
|
| 91 |
+
|
| 92 |
+
## Quick Start
|
| 93 |
+
|
| 94 |
+
### Requirements
|
| 95 |
+
|
| 96 |
+
- Python >= 3.10
|
| 97 |
+
- PyTorch >= 2.5 with CUDA support
|
| 98 |
+
- GPU: NVIDIA with >= 24GB VRAM (single GPU inference with CPU offload)
|
| 99 |
+
- [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) framework
|
| 100 |
+
|
| 101 |
+
### Installation
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
git clone https://github.com/z2tong/SCOPE.git
|
| 105 |
+
cd SCOPE
|
| 106 |
+
pip install -e .
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Download Weights
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
# Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
|
| 113 |
+
huggingface-cli download z2tong/SCOPE --local-dir ./SCOPE
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
The model directory contains everything needed for inference:
|
| 117 |
+
|
| 118 |
+
```
|
| 119 |
+
SCOPE/
|
| 120 |
+
βββ model-00001-of-00003.safetensors # SCOPE DiT shard 1
|
| 121 |
+
βββ model-00002-of-00003.safetensors # SCOPE DiT shard 2
|
| 122 |
+
βββ model-00003-of-00003.safetensors # SCOPE DiT shard 3
|
| 123 |
+
βββ model.safetensors.index.json # Shard index
|
| 124 |
+
βββ models_t5_umt5-xxl-enc-bf16.pth # Text Encoder (UMT5-XXL)
|
| 125 |
+
βββ Wan2.2_VAE.pth # Video VAE
|
| 126 |
+
βββ google/umt5-xxl/ # Tokenizer
|
| 127 |
+
βββ config.json # Model config
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### Inference
|
| 131 |
+
|
| 132 |
+
**Single image + action sequence:**
|
| 133 |
+
|
| 134 |
+
```bash
|
| 135 |
+
python inference.py \
|
| 136 |
+
--model_dir ./SCOPE \
|
| 137 |
+
--input_image input.png \
|
| 138 |
+
--action_path action.parquet \
|
| 139 |
+
--prompt "First-person shooter perspective in a modern city" \
|
| 140 |
+
--seed 42
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
**Batch processing (directory of images):**
|
| 144 |
+
|
| 145 |
+
```bash
|
| 146 |
+
python inference.py \
|
| 147 |
+
--model_dir ./SCOPE \
|
| 148 |
+
--input_image_dir ./images \
|
| 149 |
+
--action_path action.parquet \
|
| 150 |
+
--prompt "First-person view in a battlefield" \
|
| 151 |
+
--output_dir ./outputs
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
For full usage details and advanced options, see the [GitHub repository](https://github.com/z2tong/SCOPE).
|
| 155 |
+
|
| 156 |
+
## Repository Contents
|
| 157 |
+
|
| 158 |
+
This repo contains **all weights** needed for inference in a single download:
|
| 159 |
+
|
| 160 |
+
| File | Component | Size |
|
| 161 |
+
|------|-----------|------|
|
| 162 |
+
| `model-00001-of-00003.safetensors` | SCOPE DiT shard 1 | ~5.0 GB |
|
| 163 |
+
| `model-00002-of-00003.safetensors` | SCOPE DiT shard 2 | ~5.0 GB |
|
| 164 |
+
| `model-00003-of-00003.safetensors` | SCOPE DiT shard 3 | ~4.6 GB |
|
| 165 |
+
| `model.safetensors.index.json` | Shard index mapping | β |
|
| 166 |
+
| `models_t5_umt5-xxl-enc-bf16.pth` | Text Encoder (UMT5-XXL) | ~20 GB |
|
| 167 |
+
| `Wan2.2_VAE.pth` | Video VAE | ~700 MB |
|
| 168 |
+
| `google/umt5-xxl/` | Tokenizer | ~10 MB |
|
| 169 |
+
| `config.json` | Model architecture config | β |
|
| 170 |
+
|
| 171 |
+
> **Inference code** is available at [github.com/z2tong/SCOPE](https://github.com/z2tong/SCOPE).
|
| 172 |
+
|
| 173 |
+
## CrossFPS Dataset
|
| 174 |
+
|
| 175 |
+
SCOPE is trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps), the first multi-game FPS dataset with frame-aligned action telemetry:
|
| 176 |
+
|
| 177 |
+
| Property | Value |
|
| 178 |
+
|----------|-------|
|
| 179 |
+
| Games | 7 diverse FPS titles |
|
| 180 |
+
| Total Clips | 69,000+ |
|
| 181 |
+
| Action Dimensions | 10-DoF (6 buttons + 4D joystick) |
|
| 182 |
+
| Annotation | Frame-aligned action telemetry |
|
| 183 |
+
| Curation | Gameplay-bias removal for general visual-to-action mapping |
|
| 184 |
+
|
| 185 |
+
## Citation
|
| 186 |
+
|
| 187 |
+
```bibtex
|
| 188 |
+
@article{scope2025,
|
| 189 |
+
title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
|
| 190 |
+
author={Tong, Zizhao and Lai, Hongfeng and Wang, Zeqing and Xing, Zhaohu and Cheng, Kexu and Xu, Haoran and Pu, Zhao and Zhu, Shangwen and Feng, Ruili and Zhao, Jian and Zhang, Yan and Tang, Hao and Jin, Yeying and Shao, Ling},
|
| 191 |
+
year={2025}
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
## Acknowledgements
|
| 196 |
+
|
| 197 |
+
We thank the [Wan Team](https://huggingface.co/Wan-AI) for open-sourcing Wan2.2 and the [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) team for the inference framework.
|
assets/teaser.jpg
ADDED
|
Git LFS Details
|