Video-to-Video

Improve model card metadata and documentation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +37 -22
README.md CHANGED
@@ -1,49 +1,64 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - KlingTeam/MultiCamVideo-Dataset
5
  - nkp37/OpenVid-1M
6
- base_model:
7
- - Wan-AI/Wan2.1-T2V-14B
8
- pipeline_tag: video-to-video
9
  ---
10
 
11
- # Vista4D: Video Reshooting with 4D Point Clouds (CVPR 2026 Highlight) – Model Checkpoints
12
 
13
  [![Project Page](https://img.shields.io/badge/Project-Page-yellow?logo=data:image/svg%2Bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0ibm9uZSIgc3Ryb2tlPSJ5ZWxsb3ciIHN0cm9rZS13aWR0aD0iMiIgc3Ryb2tlLWxpbmVjYXA9InJvdW5kIiBzdHJva2UtbGluZWpvaW49InJvdW5kIj48Y2lyY2xlIGN4PSIxMiIgY3k9IjEyIiByPSIxMCIvPjxsaW5lIHgxPSIyIiB5MT0iMTIiIHgyPSIyMiIgeTI9IjEyIi8+PHBhdGggZD0iTTEyIDJhMTUuMyAxNS4zIDAgMCAxIDQgMTAgMTUuMyAxNS4zIDAgMCAxLTQgMTAgMTUuMyAxNS4zIDAgMCAxLTQtMTAgMTUuMyAxNS4zIDAgMCAxIDQtMTB6Ii8+PC9zdmc+)](https://eyeline-labs.github.io/Vista4D)
14
- [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2604.21915)
 
15
  [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Vista4D-blue)](https://huggingface.co/Eyeline-Labs/Vista4D)
16
- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Eval%20Data-blue)](https://huggingface.co/datasets/Eyeline-Labs/Vista4D-Eval-Data)
17
-
18
- [Kuan Heng Lin](https://kuanhenglin.github.io)<sup>1,3&lowast;</sup>, [Zhizheng Liu](https://bosmallear.github.io)<sup>1,4&lowast;</sup>, [Pablo Salamanca](https://pablosalaman.ca)<sup>1,2</sup>, [Yash Kant](https://yashkant.github.io)<sup>1,2</sup>, [Ryan Burgert](https://ryanndagreat.github.io)<sup>1,2,5&lowast;</sup>, [Yuancheng Xu](https://yuancheng-xu.github.io)<sup>1,2</sup>, [Koichi Namekata](https://kmcode1.github.io)<sup>1,2,6&lowast;</sup>, [Yiwei Zhao](https://zhaoyw007.github.io)<sup>2</sup>, [Bolei Zhou](https://boleizhou.github.io)<sup>4</sup>, [Micah Goldblum](https://goldblum.github.io)<sup>3</sup>, [Paul Debevec](https://www.pauldebevec.com)<sup>1,2</sup>, [Ning Yu](https://ningyu1991.github.io)<sup>1,2</sup> <br/>
19
- <sup>1</sup>Eyeline Labs, <sup>2</sup>Netflix, <sup>3</sup>Columbia University, <sup>4</sup>UCLA, <sup>5</sup>Stony Brook University, <sup>6</sup>University of Oxford<br>
20
 
21
- <sup>&lowast;</sup>*Work done during an internship at Eyeline Labs*
22
 
23
  <div align="center">
24
  <video controls autoplay muted style="width: 100%;" src="https://media.githubusercontent.com/media/Eyeline-Labs/Vista4D/website/media/vista4d.mp4"></video>
25
  </div>
26
 
27
- **Vista4D** is a *video reshooting* framework which synthesizes the dynamic scene represented by an input source video from novel camera trajectories and viewpoints. We bridge the distribution shift between training and inference for point-cloud-grounded video reshooting, as Vista4D is robust to point cloud artifacts from imprecise 4D reconstruction of real-world videos by training on noisy, reconstructed multiview videos. Our 4D point cloud with temporally-persistent static points also explicitly preserves scene content and improved camera control. Vista4D generalizes to real-world applications such as dynamic scene expansion (casual video capture of scene as background reference), 4D scene recomposition (point cloud editing), and long video inference with memory.
28
 
29
- This is the Hugging Face repository containing our model weights. We provide two Vista4D checkpoints finetuned on [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)(https://huggingface.co/Wan-AI/Wan2.1-T2V-14B):
30
 
31
  | Checkpoint | Base model | Training resolution | Training steps | Notes |
32
  |---|---|---|---|---|
33
  | `384p49_step=30000` | [`Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | 672 &times; 384, 49 frames | 30000 | N/A |
34
  | `720p49_step=3000` | [`Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | 1280 &times; 720, 49 frames | 3000 | Finetuned from `384p49_step=30000` |
35
 
36
- To do Vista4D inference, first download the Wan 2.1 and Vista4D checkpoints to `./checkpoints/`. The Vista4D checkpoints are hosted on [Eyeline-Labs/Vista4D](https://huggingface.co/Eyeline-Labs/Vista4D). Download both the `384p` and `720p` checkpoints into `./checkpoints/vista4d/` with
37
- ```bash
38
- hf download Eyeline-Labs/Vista4D --local-dir ./checkpoints/vista4d
39
- ```
40
- If you only need one resolution, pass `--include` to grab just that variant with
 
41
  ```bash
42
- hf download Eyeline-Labs/Vista4D --local-dir ./checkpoints/vista4d --include "384p49_step=30000/*" OR "720p49_step=3000/*"
 
 
 
 
43
  ```
44
- You'll also need the `Wan2.1-T2V-14B` base model. Download it from [Wan-AI/Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) into `./checkpoints/wan/Wan2.1-T2V-14B/` with
 
 
 
 
45
  ```bash
46
- hf download Wan-AI/Wan2.1-T2V-14B --local-dir ./checkpoints/wan/Wan2.1-T2V-14B
47
  ```
48
 
49
- Instructions on how to use these weights, more results, and paper can be found on our [project page](https://eyeline-labs.github.io/Vista4D/) and [GitHub repository](https://github.com/Eyeline-Labs/Vista4D/tree/main).
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Wan-AI/Wan2.1-T2V-14B
4
  datasets:
5
  - KlingTeam/MultiCamVideo-Dataset
6
  - nkp37/OpenVid-1M
7
+ license: apache-2.0
8
+ pipeline_tag: image-to-video
 
9
  ---
10
 
11
+ # Vista4D: Video Reshooting with 4D Point Clouds (CVPR 2026 Highlight)
12
 
13
  [![Project Page](https://img.shields.io/badge/Project-Page-yellow?logo=data:image/svg%2Bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0ibm9uZSIgc3Ryb2tlPSJ5ZWxsb3ciIHN0cm9rZS13aWR0aD0iMiIgc3Ryb2tlLWxpbmVjYXA9InJvdW5kIiBzdHJva2UtbGluZWpvaW49InJvdW5kIj48Y2lyY2xlIGN4PSIxMiIgY3k9IjEyIiByPSIxMCIvPjxsaW5lIHgxPSIyIiB5MT0iMTIiIHgyPSIyMiIgeTI9IjEyIi8+PHBhdGggZD0iTTEyIDJhMTUuMyAxNS4zIDAgMCAxIDQgMTAgMTUuMyAxNS4zIDAgMCAxLTQgMTAgMTUuMyAxNS4zIDAgMCAxLTQtMTAgMTUuMyAxNS4zIDAgMCAxIDQtMTB6Ii8+PC9zdmc+)](https://eyeline-labs.github.io/Vista4D)
14
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=red)](https://huggingface.co/papers/2604.21915)
15
+ [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/Eyeline-Labs/Vista4D)
16
  [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Vista4D-blue)](https://huggingface.co/Eyeline-Labs/Vista4D)
 
 
 
 
17
 
18
+ **Vista4D** is a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, the method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint.
19
 
20
  <div align="center">
21
  <video controls autoplay muted style="width: 100%;" src="https://media.githubusercontent.com/media/Eyeline-Labs/Vista4D/website/media/vista4d.mp4"></video>
22
  </div>
23
 
24
+ ## Model Checkpoints
25
 
26
+ This repository provides two Vista4D checkpoints finetuned on [`Wan-AI/Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B):
27
 
28
  | Checkpoint | Base model | Training resolution | Training steps | Notes |
29
  |---|---|---|---|---|
30
  | `384p49_step=30000` | [`Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | 672 &times; 384, 49 frames | 30000 | N/A |
31
  | `720p49_step=3000` | [`Wan2.1-T2V-14B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | 1280 &times; 720, 49 frames | 3000 | Finetuned from `384p49_step=30000` |
32
 
33
+ ## Usage
34
+
35
+ To perform Vista4D inference, you need to download both the Wan 2.1 base model and the Vista4D checkpoints.
36
+
37
+ ### Download Weights
38
+
39
  ```bash
40
+ # Download Vista4D checkpoints
41
+ huggingface-cli download Eyeline-Labs/Vista4D --local-dir ./checkpoints/vista4d
42
+
43
+ # Download the Wan2.1-T2V-14B base model
44
+ huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./checkpoints/wan/Wan2.1-T2V-14B
45
  ```
46
+
47
+ ### Inference
48
+
49
+ After environment setup and preprocessing as described in the [official repository](https://github.com/Eyeline-Labs/Vista4D), run inference with:
50
+
51
  ```bash
52
+ EXAMPLE=couple-newspaper RESOLUTION=720p bash scripts/inference/example_inference_single.sh
53
  ```
54
 
55
+ ## Citation
56
+
57
+ ```bibtex
58
+ @inproceedings{lin2026vista4d,
59
+ author = {Lin, {Kuan Heng} and Liu, Zhizheng and Salamanca, Pablo and Kant, Yash and Burgert, Ryan and Xu, Yuancheng and Namekata, Koichi and Zhao, Yiwei and Zhou, Bolei and Goldblum, Micah and Debevec, Paul and Yu, Ning},
60
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
61
+ title = {{Vista4D}: Video Reshooting with 4D Point Clouds},
62
+ year = {2026}
63
+ }
64
+ ```