Image-to-Video
Safetensors
Wan2.2
English
Chinese
diffsynth
scope
world-model
video-generation
action-conditioned
game-world-model
first-person-shooter
diffusion
transformer
Instructions to use zizhaotong/SCOPE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use zizhaotong/SCOPE with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Upload README.md
Browse files
README.md
CHANGED
|
@@ -34,34 +34,42 @@ base_model:
|
|
| 34 |
|
| 35 |
## Highlights
|
| 36 |
|
| 37 |
-
- **Hybrid Action Space** β
|
| 38 |
-
- **Dense
|
| 39 |
-
- **
|
| 40 |
-
- **
|
| 41 |
-
- **In-Scope / Out-of-Scope Decoupling** β Spatially selective conditioning that separates localized in-scope effects (weapon, HUD) from stable out-of-scope world generation, without segmentation labels
|
| 42 |
|
| 43 |
## Model Overview
|
| 44 |
|
| 45 |
-
SCOPE is an interactive world model for first-person shooter (FPS) games.
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
| Component | Details |
|
| 52 |
|-----------|---------|
|
| 53 |
-
| **Base Model** | Wan2.2-TI2V-5B (DiT, 30 layers) |
|
| 54 |
-
| **Action Module** | Per-block conditioning
|
| 55 |
-
| **Text Encoder** | UMT5-XXL |
|
| 56 |
-
| **VAE** | Wan2.2 Video VAE (temporal compression
|
| 57 |
-
| **Parameters** | ~5B
|
| 58 |
| **Precision** | BFloat16 |
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
### Generation Specs
|
| 61 |
|
| 62 |
| Property | Value |
|
| 63 |
|----------|-------|
|
| 64 |
-
| Resolution | 480
|
| 65 |
| Frame Count | 81 frames |
|
| 66 |
| Frame Rate | 20 FPS |
|
| 67 |
| Duration | ~4 seconds |
|
|
@@ -94,9 +102,8 @@ SCOPE accepts 10-DoF action inputs per frame via a Parquet file:
|
|
| 94 |
### Requirements
|
| 95 |
|
| 96 |
- Python >= 3.10
|
| 97 |
-
- PyTorch >= 2.
|
| 98 |
-
- GPU: NVIDIA with >=
|
| 99 |
-
- [DiffSynth](https://github.com/modelscope/DiffSynth-Studio) framework
|
| 100 |
|
| 101 |
### Installation
|
| 102 |
|
|
@@ -110,21 +117,7 @@ pip install -e .
|
|
| 110 |
|
| 111 |
```bash
|
| 112 |
# Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
|
| 113 |
-
huggingface-cli download
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
The model directory contains everything needed for inference:
|
| 117 |
-
|
| 118 |
-
```
|
| 119 |
-
SCOPE/
|
| 120 |
-
βββ model-00001-of-00003.safetensors # SCOPE DiT shard 1
|
| 121 |
-
βββ model-00002-of-00003.safetensors # SCOPE DiT shard 2
|
| 122 |
-
βββ model-00003-of-00003.safetensors # SCOPE DiT shard 3
|
| 123 |
-
βββ model.safetensors.index.json # Shard index
|
| 124 |
-
βββ models_t5_umt5-xxl-enc-bf16.pth # Text Encoder (UMT5-XXL)
|
| 125 |
-
βββ Wan2.2_VAE.pth # Video VAE
|
| 126 |
-
βββ google/umt5-xxl/ # Tokenizer
|
| 127 |
-
βββ config.json # Model config
|
| 128 |
```
|
| 129 |
|
| 130 |
### Inference
|
|
@@ -185,10 +178,10 @@ SCOPE is trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong
|
|
| 185 |
## Citation
|
| 186 |
|
| 187 |
```bibtex
|
| 188 |
-
@article{
|
| 189 |
title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
|
| 190 |
-
author={
|
| 191 |
-
year={
|
| 192 |
}
|
| 193 |
```
|
| 194 |
|
|
|
|
| 34 |
|
| 35 |
## Highlights
|
| 36 |
|
| 37 |
+
- **Hybrid Action Space** β Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework β the first FPS world model to do so.
|
| 38 |
+
- **Dense Per-Frame Conditioning** β Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.
|
| 39 |
+
- **Cross-Game Generalization** β Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.
|
| 40 |
+
- **In-Scope / Out-of-Scope Decoupling** β Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation β without any segmentation labels.
|
|
|
|
| 41 |
|
| 42 |
## Model Overview
|
| 43 |
|
| 44 |
+
SCOPE is an interactive world model for first-person shooter (FPS) games. Built on [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B), SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) β without any segmentation labels.
|
| 45 |
|
| 46 |
+
Trained on [**CrossFPS**](https://huggingface.co/collections/zizhaotong/crossfps) (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.
|
| 47 |
|
| 48 |
+
## Architecture
|
| 49 |
|
| 50 |
| Component | Details |
|
| 51 |
|-----------|---------|
|
| 52 |
+
| **Base Model** | Wan2.2-TI2V-5B (DiT, 30 transformer layers) |
|
| 53 |
+
| **Action Module** | Per-block conditioning with per-pixel temporal sequences |
|
| 54 |
+
| **Text Encoder** | UMT5-XXL (4096-dim hidden) |
|
| 55 |
+
| **VAE** | Wan2.2 Video VAE (4Γ temporal compression, 8Γ spatial compression) |
|
| 56 |
+
| **Total Parameters** | ~5B (1575 tensors, of which 750 are action-related) |
|
| 57 |
| **Precision** | BFloat16 |
|
| 58 |
|
| 59 |
+
### ActionModule Design
|
| 60 |
+
|
| 61 |
+
Each of the 30 DiT blocks contains an `ActionModule` with two conditioning paths:
|
| 62 |
+
|
| 63 |
+
- **Mouse/Joystick Path**: Sliding-window temporal features β MLP fusion β pixel-wise temporal self-attention with RoPE
|
| 64 |
+
- **Keyboard/Button Path**: Button embedding β temporal windowing β cross-attention (video queries, keyboard keys/values)
|
| 65 |
+
|
| 66 |
+
Both output projections are zero-initialized for stable residual training on top of frozen pretrained weights.
|
| 67 |
+
|
| 68 |
### Generation Specs
|
| 69 |
|
| 70 |
| Property | Value |
|
| 71 |
|----------|-------|
|
| 72 |
+
| Resolution | 480 Γ 832 |
|
| 73 |
| Frame Count | 81 frames |
|
| 74 |
| Frame Rate | 20 FPS |
|
| 75 |
| Duration | ~4 seconds |
|
|
|
|
| 102 |
### Requirements
|
| 103 |
|
| 104 |
- Python >= 3.10
|
| 105 |
+
- PyTorch >= 2.0 with CUDA support
|
| 106 |
+
- GPU: NVIDIA with >= 24 GB VRAM (single GPU inference with CPU offload)
|
|
|
|
| 107 |
|
| 108 |
### Installation
|
| 109 |
|
|
|
|
| 117 |
|
| 118 |
```bash
|
| 119 |
# Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
|
| 120 |
+
huggingface-cli download zizhaotong/SCOPE --local-dir ./SCOPE
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
```
|
| 122 |
|
| 123 |
### Inference
|
|
|
|
| 178 |
## Citation
|
| 179 |
|
| 180 |
```bibtex
|
| 181 |
+
@article{scope2026,
|
| 182 |
title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
|
| 183 |
+
author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
|
| 184 |
+
year={2026}
|
| 185 |
}
|
| 186 |
```
|
| 187 |
|