Jayeep BianYx commited on
Commit
b4b5bd2
·
0 Parent(s):

Duplicate from TencentARC/VideoPainter

Browse files

Co-authored-by: Yuxuan BIAN <BianYx@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/method.jpg filter=lfs diff=lfs merge=lfs -text
37
+ assets/teaser.jpg filter=lfs diff=lfs merge=lfs -text
License.txt ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This project, "VideoPainter", is fine-tuned with the assistance of "CogVideo 5B", which is subject to the The CogVideoX License. Details of the The CogVideoX License can be found in this file.
2
+
3
+ In addition, usage of any components originally developed or modified by us, is also subject to the following requirement:
4
+
5
+ Copyright (C) 2025 THL A29 Limited, a Tencent company. All rights reserved.
6
+
7
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this Software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sublicense copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
8
+
9
+ - You agree to use the VideoPainter only for academic, research and education purposes, and refrain from using it for any commercial or production purposes under any circumstances.
10
+
11
+ - The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
12
+
13
+ For avoidance of doubts, "Software" means the VideoPainter model inference code, training code, parameters and weights made available under this license.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
+
17
+
18
+
19
+
20
+
21
+ Open Source Model Licensed under the The CogVideoX License:
22
+ --------------------------------------------------------------------
23
+ 1. CogVideo 5B
24
+ Copyright 2024 CogVideo Model Team @ Zhipu AI
25
+
26
+
27
+ Terms of the The CogVideoX License:
28
+ --------------------------------------------------------------------
29
+ The CogVideoX License
30
+
31
+ 1. Definitions
32
+
33
+ “Licensor” means the CogVideoX Model Team that distributes its Software.
34
+
35
+ “Software” means the CogVideoX model parameters made available under this license.
36
+
37
+ 2. License Grant
38
+
39
+ Under the terms and conditions of this license, the licensor hereby grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license. The intellectual property rights of the generated content belong to the user to the extent permitted by applicable local laws.
40
+ This license allows you to freely use all open-source models in this repository for academic research. Users who wish to use the models for commercial purposes must register and obtain a basic commercial license in https://open.bigmodel.cn/mla/form .
41
+ Users who have registered and obtained the basic commercial license can use the models for commercial activities for free, but must comply with all terms and conditions of this license. Additionally, the number of service users (visits) for your commercial activities must not exceed 1 million visits per month.
42
+ If the number of service users (visits) for your commercial activities exceeds 1 million visits per month, you need to contact our business team to obtain more commercial licenses.
43
+ The above copyright statement and this license statement should be included in all copies or significant portions of this software.
44
+
45
+ 3. Restriction
46
+
47
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
48
+
49
+ You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
50
+
51
+ 4. Disclaimer
52
+
53
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
54
+
55
+ 5. Limitation of Liability
56
+
57
+ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
58
+
59
+ 6. Dispute Resolution
60
+
61
+ This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
62
+
63
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
64
+
65
+ 1. 定义
66
+
67
+ “许可方”是指分发其软件的 CogVideoX 模型团队。
68
+
69
+ “软件”是指根据本许可提供的 CogVideoX 模型参数。
70
+
71
+ 2. 许可授予
72
+
73
+ 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。生成内容的知识产权所属,可根据适用当地法律的规定,在法律允许的范围内由用户享有生成内容的知识产权或其他权利。
74
+ 本许可允许您免费使用本仓库中的所有开源模型进行学术研究。对于希望将模型用于商业目的的用户,需在 https://open.bigmodel.cn/mla/form 完成登记并获得基础商用授权。
75
+
76
+ 经过登记并获得基础商用授权的用户可以免费使用本模型进行商业活动,但必须遵守本许可的所有条款和条件。
77
+ 在本许可证下,您的商业活动的服务用户数量(访问量)不得超过100万人次访问 / 每月。如果超过,您需要与我们的商业团队联系以获得更多的商业许可。
78
+ 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
79
+
80
+ 3.限制
81
+
82
+ 您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
83
+
84
+ 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
85
+
86
+ 4.免责声明
87
+
88
+ 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。
89
+ 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
90
+
91
+ 5. 责任限制
92
+
93
+ 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
94
+
95
+ 6.争议解决
96
+
97
+ 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
98
+
99
+ 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 license@zhipuai.cn 与我们联系。
README.md ADDED
@@ -0,0 +1,489 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - THUDM/CogVideoX-5b
6
+ - THUDM/CogVideoX-5b-I2V
7
+ - THUDM/CogVideoX1.5-5B
8
+ - THUDM/CogVideoX1.5-5B-I2V
9
+ tags:
10
+ - video
11
+ - video inpainting
12
+ - video editing
13
+ ---
14
+
15
+
16
+ # VideoPainter
17
+
18
+ This repository contains the implementation of the paper "VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control"
19
+
20
+ Keywords: Video Inpainting, Video Editing, Video Generation
21
+
22
+ > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‡</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2✉</sup><br>
23
+ > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‡</sup>Project Lead <sup>✉</sup>Corresponding Author
24
+
25
+
26
+
27
+ <p align="center">
28
+ <a href='https://yxbian23.github.io/project/video-painter'><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp;
29
+ <a href="https://arxiv.org/abs/2503.05639"><img src="https://img.shields.io/badge/arXiv-2503.05639-b31b1b.svg"></a> &nbsp;
30
+ <a href="https://github.com/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> &nbsp;
31
+ <a href="https://youtu.be/HYzNfsD3A0s"><img src="https://img.shields.io/badge/YouTube-Video-red?logo=youtube"></a> &nbsp;
32
+ <a href='https://huggingface.co/datasets/TencentARC/VPData'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a> &nbsp;
33
+ <a href='https://huggingface.co/datasets/TencentARC/VPBench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Benchmark-blue'></a> &nbsp;
34
+ <a href="https://huggingface.co/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue"></a>
35
+ </p>
36
+
37
+ **Your star means a lot for us to develop this project!** ⭐⭐⭐
38
+
39
+ **VPData and VPBench have been fully uploaded (contain 390K mask sequences and video captions). Welcome to use our biggest video segmentation dataset VPData with video captions!** 🔥🔥🔥
40
+
41
+
42
+ **📖 Table of Contents**
43
+
44
+
45
+ - [VideoPainter](#videopainter)
46
+ - [🔥 Update Log](#-update-log)
47
+ - [📌 TODO](#todo)
48
+ - [🛠️ Method Overview](#️-method-overview)
49
+ - [🚀 Getting Started](#-getting-started)
50
+ - [Environment Requirement 🌍](#environment-requirement-)
51
+ - [Data Download ⬇️](#data-download-️)
52
+ - [🏃🏼 Running Scripts](#-running-scripts)
53
+ - [Training 🤯](#training-)
54
+ - [Inference 📜](#inference-)
55
+ - [Evaluation 📏](#evaluation-)
56
+ - [🤝🏼 Cite Us](#-cite-us)
57
+ - [💖 Acknowledgement](#-acknowledgement)
58
+
59
+
60
+
61
+ ## 🔥 Update Log
62
+ - [2025/3/09] 📢 📢 [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
63
+ - [2025/3/09] 📢 📢 [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
64
+ - [2025/3/25] 📢 📢 The 390K+ high-quality video segmentation masks of [VPData](https://huggingface.co/datasets/TencentARC/VPData) have been fully released.
65
+ - [2025/3/25] 📢 📢 The raw videos of videovo subset have been uploaded to [VPData](https://huggingface.co/datasets/TencentARC/VPData), to solve the raw video link expiration issue.
66
+
67
+ ## TODO
68
+
69
+ - [x] Release trainig and inference code
70
+ - [x] Release evaluation code
71
+ - [x] Release [VideoPainter checkpoints](https://huggingface.co/TencentARC/VideoPainter) (based on CogVideoX-5B)
72
+ - [x] Release [VPData and VPBench](https://huggingface.co/collections/TencentARC/videopainter-67cc49c6146a48a2ba93d159) for large-scale training and evaluation.
73
+ - [x] Release gradio demo
74
+ - [ ] Data preprocessing code
75
+ ## 🛠️ Method Overview
76
+
77
+ We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
78
+ ![](assets/teaser.jpg)
79
+
80
+
81
+
82
+ ## 🚀 Getting Started
83
+
84
+ <details>
85
+ <summary><b>Environment Requirement 🌍</b></summary>
86
+
87
+
88
+ Clone the repo:
89
+
90
+ ```
91
+ git clone https://github.com/TencentARC/VideoPainter.git
92
+ ```
93
+
94
+ We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
95
+
96
+
97
+ ```
98
+ conda create -n videopainter python=3.10 -y
99
+ conda activate videopainter
100
+ pip install -r requirements.txt
101
+ ```
102
+
103
+ Then, you can install diffusers (implemented in this repo) with:
104
+
105
+ ```
106
+ cd ./diffusers
107
+ pip install -e .
108
+ ```
109
+
110
+ After that, you can install required ffmpeg thourgh:
111
+
112
+ ```
113
+ conda install -c conda-forge ffmpeg -y
114
+ ```
115
+
116
+ Optional, you can install sam2 for gradio demo thourgh:
117
+
118
+ ```
119
+ cd ./app
120
+ pip install -e .
121
+ ```
122
+ </details>
123
+
124
+ <details>
125
+ <summary><b>VPBench and VPData Download ⬇️</b></summary>
126
+
127
+ You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
128
+
129
+ ```
130
+ |-- data
131
+ |-- davis
132
+ |-- JPEGImages_432_240
133
+ |-- test_masks
134
+ |-- davis_caption
135
+ |-- test.json
136
+ |-- train.json
137
+ |-- videovo/raw_video
138
+ |-- 000005000
139
+ |-- 000005000000.0.mp4
140
+ |-- 000005000001.0.mp4
141
+ |-- ...
142
+ |-- 000005001
143
+ |-- ...
144
+ |-- pexels/pexels/raw_video
145
+ |-- 000000000
146
+ |-- 000000000000_852038.mp4
147
+ |-- 000000000001_852057.mp4
148
+ |-- ...
149
+ |-- 000000001
150
+ |-- ...
151
+ |-- video_inpainting
152
+ |-- videovo
153
+ |-- 000005000000/all_masks.npz
154
+ |-- 000005000001/all_masks.npz
155
+ |-- ...
156
+ |-- pexels
157
+ |-- ...
158
+ |-- pexels_videovo_train_dataset.csv
159
+ |-- pexels_videovo_val_dataset.csv
160
+ |-- pexels_videovo_test_dataset.csv
161
+ |-- our_video_inpaint.csv
162
+ |-- our_video_inpaint_long.csv
163
+ |-- our_video_edit.csv
164
+ |-- our_video_edit_long.csv
165
+ |-- pexels.csv
166
+ |-- videovo.csv
167
+
168
+ ```
169
+
170
+ You can download the VPBench, and put the benchmark to the `data` folder by:
171
+ ```
172
+ git lfs install
173
+ git clone https://huggingface.co/datasets/TencentARC/VPBench
174
+ mv VPBench data
175
+ cd data
176
+ unzip pexels.zip
177
+ unzip videovo.zip
178
+ unzip davis.zip
179
+ unzip video_inpainting.zip
180
+ ```
181
+
182
+ You can download the VPData (only mask and text annotations due to the space limit), and put the dataset to the `data` folder by:
183
+ ```
184
+ git lfs install
185
+ git clone https://huggingface.co/datasets/TencentARC/VPData
186
+ mv VPBench data
187
+
188
+ # 1. unzip the masks in VPData
189
+ python data_utils/unzip_folder.py --source_dir ./data/videovo_masks --target_dir ./data/video_inpainting/videovo
190
+ python data_utils/unzip_folder.py --source_dir ./data/pexels_masks --target_dir ./data/video_inpainting/pexels
191
+
192
+ # 2. unzip the raw videos in Videovo subset in VPData
193
+ python data_utils/unzip_folder.py --source_dir ./data/videovo_raw_videos --target_dir ./data/videovo/raw_video
194
+ ```
195
+
196
+ Noted: *Due to the space limit, you need to run the following script to download the raw videos of the Pexels subset in VPData. The format should be consistent with VPData/VPBench above (After download the VPData/VPBench, the script will automatically place the raw videos of VPData into the corresponding dataset directories that have been created by VPBench).*
197
+
198
+ ```
199
+ cd data_utils
200
+ python VPData_download.py
201
+ ```
202
+
203
+ </details>
204
+
205
+ <details>
206
+ <summary><b>Checkpoints</b></summary>
207
+
208
+ Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
209
+
210
+ - VideoPainter pretrained checkpoints for CogVideoX-5b-I2V
211
+ - VideoPainter IP Adapter pretrained checkpoints for CogVideoX-5b-I2V
212
+ - pretrinaed CogVideoX-5b-I2V checkpoint from [HuggingFace](https://huggingface.co/THUDM/CogVideoX-5b-I2V).
213
+
214
+ You can download the checkpoints, and put the checkpoints to the `ckpt` folder by:
215
+ ```
216
+ git lfs install
217
+ git clone https://huggingface.co/TencentARC/VideoPainter
218
+ mv VideoPainter ckpt
219
+ ```
220
+
221
+ You also need to download the base model [CogVideoX-5B-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) by:
222
+ ```
223
+ git lfs install
224
+ cd ckpt
225
+ git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V
226
+ ```
227
+
228
+ [Optional]You need to download [FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev/) for first frame inpainting:
229
+ ```
230
+ git lfs install
231
+ cd ckpt
232
+ git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
233
+ mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
234
+ ```
235
+
236
+ [Optional]You need to download [SAM2](https://huggingface.co/facebook/sam2-hiera-large) for video segmentation in gradio demo:
237
+ ```
238
+ git lfs install
239
+ cd ckpt
240
+ wget https://huggingface.co/facebook/sam2-hiera-large/resolve/main/sam2_hiera_large.pt
241
+ ```
242
+ You can also choose the segmentation checkpoints of other sizes to balance efficiency and performance, such as [SAM2-Tiny](https://huggingface.co/facebook/sam2-hiera-tiny).
243
+
244
+ The ckpt structure should be like:
245
+
246
+ ```
247
+ |-- ckpt
248
+ |-- VideoPainter/checkpoints
249
+ |-- branch
250
+ |-- config.json
251
+ |-- diffusion_pytorch_model.safetensors
252
+ |-- VideoPainterID/checkpoints
253
+ |-- pytorch_lora_weights.safetensors
254
+ |-- CogVideoX-5b-I2V
255
+ |-- scheduler
256
+ |-- transformer
257
+ |-- vae
258
+ |-- ...
259
+ |-- flux_inp
260
+ |-- scheduler
261
+ |-- transformer
262
+ |-- vae
263
+ |-- ...
264
+ |-- sam2_hiera_large.pt
265
+ ```
266
+ </details>
267
+
268
+ ## 🏃🏼 Running Scripts
269
+
270
+ <details>
271
+ <summary><b>Training 🤯</b></summary>
272
+
273
+ You can train the VideoPainter using the script:
274
+
275
+ ```
276
+ # cd train
277
+ # bash VideoPainter.sh
278
+
279
+ export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
280
+ export CACHE_PATH="~/.cache"
281
+ export DATASET_PATH="../data/videovo/raw_video"
282
+ export PROJECT_NAME="pexels_videovo-inpainting"
283
+ export RUNS_NAME="VideoPainter"
284
+ export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
285
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
286
+ export TOKENIZERS_PARALLELISM=false
287
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
288
+
289
+ accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine_rank 0 \
290
+ train_cogvideox_inpainting_i2v_video.py \
291
+ --pretrained_model_name_or_path $MODEL_PATH \
292
+ --cache_dir $CACHE_PATH \
293
+ --meta_file_path ../data/pexels_videovo_train_dataset.csv \
294
+ --val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
295
+ --instance_data_root $DATASET_PATH \
296
+ --dataloader_num_workers 1 \
297
+ --num_validation_videos 1 \
298
+ --validation_epochs 1 \
299
+ --seed 42 \
300
+ --mixed_precision bf16 \
301
+ --output_dir $OUTPUT_PATH \
302
+ --height 480 \
303
+ --width 720 \
304
+ --fps 8 \
305
+ --max_num_frames 49 \
306
+ --video_reshape_mode "resize" \
307
+ --skip_frames_start 0 \
308
+ --skip_frames_end 0 \
309
+ --max_text_seq_length 226 \
310
+ --branch_layer_num 2 \
311
+ --train_batch_size 1 \
312
+ --num_train_epochs 10 \
313
+ --checkpointing_steps 1024 \
314
+ --validating_steps 256 \
315
+ --gradient_accumulation_steps 1 \
316
+ --learning_rate 1e-5 \
317
+ --lr_scheduler cosine_with_restarts \
318
+ --lr_warmup_steps 1000 \
319
+ --lr_num_cycles 1 \
320
+ --enable_slicing \
321
+ --enable_tiling \
322
+ --noised_image_dropout 0.05 \
323
+ --gradient_checkpointing \
324
+ --optimizer AdamW \
325
+ --adam_beta1 0.9 \
326
+ --adam_beta2 0.95 \
327
+ --max_grad_norm 1.0 \
328
+ --allow_tf32 \
329
+ --report_to wandb \
330
+ --tracker_name $PROJECT_NAME \
331
+ --runs_name $RUNS_NAME \
332
+ --inpainting_loss_weight 1.0 \
333
+ --mix_train_ratio 0 \
334
+ --first_frame_gt \
335
+ --mask_add \
336
+ --mask_transform_prob 0.3 \
337
+ --p_brush 0.4 \
338
+ --p_rect 0.1 \
339
+ --p_ellipse 0.1 \
340
+ --p_circle 0.1 \
341
+ --p_random_brush 0.3
342
+
343
+ # cd train
344
+ # bash VideoPainterID.sh
345
+ export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
346
+ export BRANCH_MODEL_PATH="../ckpt/VideoPainter/checkpoints/branch"
347
+ export CACHE_PATH="~/.cache"
348
+ export DATASET_PATH="../data/videovo/raw_video"
349
+ export PROJECT_NAME="pexels_videovo-inpainting"
350
+ export RUNS_NAME="VideoPainterID"
351
+ export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
352
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
353
+ export TOKENIZERS_PARALLELISM=false
354
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
355
+
356
+ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
357
+ train_cogvideox_inpainting_i2v_video_resample.py \
358
+ --pretrained_model_name_or_path $MODEL_PATH \
359
+ --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
360
+ --cache_dir $CACHE_PATH \
361
+ --meta_file_path ../data/pexels_videovo_train_dataset.csv \
362
+ --val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
363
+ --instance_data_root $DATASET_PATH \
364
+ --dataloader_num_workers 1 \
365
+ --num_validation_videos 1 \
366
+ --validation_epochs 1 \
367
+ --seed 42 \
368
+ --rank 256 \
369
+ --lora_alpha 128 \
370
+ --mixed_precision bf16 \
371
+ --output_dir $OUTPUT_PATH \
372
+ --height 480 \
373
+ --width 720 \
374
+ --fps 8 \
375
+ --max_num_frames 49 \
376
+ --video_reshape_mode "resize" \
377
+ --skip_frames_start 0 \
378
+ --skip_frames_end 0 \
379
+ --max_text_seq_length 226 \
380
+ --branch_layer_num 2 \
381
+ --train_batch_size 1 \
382
+ --num_train_epochs 10 \
383
+ --checkpointing_steps 256 \
384
+ --validating_steps 128 \
385
+ --gradient_accumulation_steps 1 \
386
+ --learning_rate 5e-5 \
387
+ --lr_scheduler cosine_with_restarts \
388
+ --lr_warmup_steps 200 \
389
+ --lr_num_cycles 1 \
390
+ --enable_slicing \
391
+ --enable_tiling \
392
+ --noised_image_dropout 0.05 \
393
+ --gradient_checkpointing \
394
+ --optimizer AdamW \
395
+ --adam_beta1 0.9 \
396
+ --adam_beta2 0.95 \
397
+ --max_grad_norm 1.0 \
398
+ --allow_tf32 \
399
+ --report_to wandb \
400
+ --tracker_name $PROJECT_NAME \
401
+ --runs_name $RUNS_NAME \
402
+ --inpainting_loss_weight 1.0 \
403
+ --mix_train_ratio 0 \
404
+ --first_frame_gt \
405
+ --mask_add \
406
+ --mask_transform_prob 0.3 \
407
+ --p_brush 0.4 \
408
+ --p_rect 0.1 \
409
+ --p_ellipse 0.1 \
410
+ --p_circle 0.1 \
411
+ --p_random_brush 0.3 \
412
+ --id_pool_resample_learnable
413
+ ```
414
+ </details>
415
+
416
+
417
+ <details>
418
+ <summary><b>Inference 📜</b></summary>
419
+
420
+ You can inference for the video inpainting or editing with the script:
421
+
422
+ ```
423
+ cd infer
424
+ # video inpainting
425
+ bash inpaint.sh
426
+ # video inpainting with ID resampling
427
+ bash inpaint_id_resample.sh
428
+ # video editing
429
+ bash edit.sh
430
+ ```
431
+
432
+ Our VideoPainter can also function as a video editing pair data generator, you can inference with the script:
433
+ ```
434
+ bash edit_bench.sh
435
+ ```
436
+
437
+ Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
438
+ </details>
439
+
440
+ <details>
441
+ <summary><b>Gradio Demo 🖌️</b></summary>
442
+
443
+ You can also inference through gradio demo:
444
+
445
+ ```
446
+ # cd app
447
+ CUDA_VISIBLE_DEVICES=0 python app.py \
448
+ --model_path ../ckpt/CogVideoX-5b-I2V \
449
+ --inpainting_branch ../ckpt/VideoPainter/checkpoints/branch \
450
+ --id_adapter ../ckpt/VideoPainterID/checkpoints \
451
+ --img_inpainting_model ../ckpt/flux_inp
452
+ ```
453
+ </details>
454
+
455
+
456
+ <details>
457
+ <summary><b>Evaluation 📏</b></summary>
458
+
459
+ You can evaluate using the script:
460
+
461
+ ```
462
+ cd evaluate
463
+ # video inpainting
464
+ bash eval_inpainting.sh
465
+ # video inpainting with ID resampling
466
+ bash eval_inpainting_id_resample.sh
467
+ # video editing
468
+ bash eval_edit.sh
469
+ # video editing with ID resampling
470
+ bash eval_editing_id_resample.sh
471
+ ```
472
+ </details>
473
+
474
+ ## 🤝🏼 Cite Us
475
+
476
+ ```
477
+ @article{bian2025videopainter,
478
+ title={VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control},
479
+ author={Bian, Yuxuan and Zhang, Zhaoyang and Ju, Xuan and Cao, Mingdeng and Xie, Liangbin and Shan, Ying and Xu, Qiang},
480
+ journal={arXiv preprint arXiv:2503.05639},
481
+ year={2025}
482
+ }
483
+ ```
484
+
485
+
486
+ ## 💖 Acknowledgement
487
+ <span id="acknowledgement"></span>
488
+
489
+ Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!
VideoPainter/checkpoints/branch/config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "CogvideoXBranchModel",
3
+ "_diffusers_version": "0.31.0.dev0",
4
+ "_name_or_path": "/group/40005/yuxuanbian/hf_models/CogVideoX-5b-I2V",
5
+ "activation_fn": "gelu-approximate",
6
+ "attention_bias": true,
7
+ "attention_head_dim": 64,
8
+ "dropout": 0.0,
9
+ "flip_sin_to_cos": true,
10
+ "freq_shift": 0,
11
+ "id_pool_resample_learnable": false,
12
+ "in_channels": 32,
13
+ "max_text_seq_length": 226,
14
+ "norm_elementwise_affine": true,
15
+ "norm_eps": 1e-05,
16
+ "num_attention_heads": 48,
17
+ "num_layers": 2,
18
+ "out_channels": 16,
19
+ "patch_size": 2,
20
+ "sample_frames": 49,
21
+ "sample_height": 60,
22
+ "sample_width": 90,
23
+ "spatial_interpolation_scale": 1.875,
24
+ "temporal_compression_ratio": 4,
25
+ "temporal_interpolation_scale": 1.0,
26
+ "text_embed_dim": 4096,
27
+ "time_embed_dim": 512,
28
+ "timestep_activation_fn": "silu",
29
+ "use_learned_positional_embeddings": true,
30
+ "use_rotary_positional_embeddings": true,
31
+ "wo_text": false
32
+ }
VideoPainter/checkpoints/branch/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d01728cb0cb605b591f41cbea033db22d5ae72d0b37565957feae71b089be8e
3
+ size 712360464
VideoPainterID/checkpoints/pytorch_lora_weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8651aac99c672b7e9ca497e62449d223640432c7dd26a85e6697c0efd87dad1f
3
+ size 528527480
assets/method.jpg ADDED

Git LFS Details

  • SHA256: 3f52eb3838b2447353603a76be5881389848f471f23a5e4f4d8d346be86bfbbc
  • Pointer size: 131 Bytes
  • Size of remote file: 995 kB
assets/teaser.jpg ADDED

Git LFS Details

  • SHA256: 9374cf8c7765411bc1b7dd00b3ec3fe8dbdcd50fe3f14eadfbe6b3c33029707d
  • Pointer size: 132 Bytes
  • Size of remote file: 2.72 MB
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "VideoPainter",
3
+ "_diffusers_version": "0.31.0.dev0",
4
+ "_name_or_path": "hf_models/CogVideoX-5b-I2V",
5
+ "activation_fn": "gelu-approximate",
6
+ "attention_bias": true,
7
+ "attention_head_dim": 64,
8
+ "dropout": 0.0,
9
+ "flip_sin_to_cos": true,
10
+ "freq_shift": 0,
11
+ "id_pool_resample_learnable": false,
12
+ "in_channels": 32,
13
+ "max_text_seq_length": 226,
14
+ "norm_elementwise_affine": true,
15
+ "norm_eps": 1e-05,
16
+ "num_attention_heads": 48,
17
+ "num_layers": 2,
18
+ "out_channels": 16,
19
+ "patch_size": 2,
20
+ "sample_frames": 49,
21
+ "sample_height": 60,
22
+ "sample_width": 90,
23
+ "spatial_interpolation_scale": 1.875,
24
+ "temporal_compression_ratio": 4,
25
+ "temporal_interpolation_scale": 1.0,
26
+ "text_embed_dim": 4096,
27
+ "time_embed_dim": 512,
28
+ "timestep_activation_fn": "silu",
29
+ "use_learned_positional_embeddings": true,
30
+ "use_rotary_positional_embeddings": true,
31
+ "wo_text": false
32
+ }
i3d_rgb_imagenet.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2609088c2e8c868187c9921c50bc225329a9057ed75e76120e0b4a397a2c7538
3
+ size 50883138