midhunharikumar commited on
Commit
06e3166
·
verified ·
1 Parent(s): 2dbf02b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -3
README.md CHANGED
@@ -1,3 +1,181 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - video-generation
7
+ - video-editing
8
+ - novel-view-synthesis
9
+ - camera-control
10
+ - diffusion-transformer
11
+ - wan2.2
12
+ - lora
13
+ - video-reshooting
14
+ - 4d-reconstruction
15
+ pipeline_tag: video-to-video
16
+ base_model: Wan-AI/Wan2.2-I2V-A14B
17
+ ---
18
+
19
+ # Reshoot-Anything
20
+
21
+ [![GitHub](https://img.shields.io/badge/GitHub-video--to--video-black?logo=github)](https://github.com/morphicfilms/video-to-video)
22
+ [![arXiv](https://img.shields.io/badge/arXiv-2604.21776-red?logo=arxiv)](https://arxiv.org/abs/2604.21776)
23
+ [![Website](https://img.shields.io/badge/Website-morphic.com-purple)](https://adithyaiyer1999.github.io/reshoot-anything/)
24
+
25
+ **Reshoot-Anything** is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency — trained entirely on in-the-wild monocular videos.
26
+
27
+ > **Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting**
28
+ > Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
29
+ > Morphic Inc. · [arXiv:2604.21776](https://arxiv.org/abs/2604.21776)
30
+
31
+ ---
32
+
33
+ <table>
34
+ <tr>
35
+ <td align="center"><b>Source Video</b></td>
36
+ <td align="center"><b>Reshot Video</b></td>
37
+ </tr>
38
+ <tr>
39
+ <td><img src="assets/woman_og.gif" width="400" alt="Source video"></td>
40
+ <td><img src="assets/woman_01.gif" width="400" alt="Reshot video"></td>
41
+ </tr>
42
+ </table>
43
+
44
+
45
+ ## Model Files
46
+
47
+ This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):
48
+
49
+ | File | Role | Notes |
50
+ |------|------|-------|
51
+ | `jan06_scaling_80k_ckpt1400.safetensors` | **High-noise expert** | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. |
52
+ | `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` | **Low-noise expert** | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. |
53
+
54
+ Both files are ~9.82 GB each.
55
+
56
+ ---
57
+
58
+ ## Quickstart
59
+
60
+ ### 1. Clone the repository
61
+
62
+ ```bash
63
+ git clone https://github.com/morphicfilms/video-to-video.git
64
+ cd video-to-video
65
+ ```
66
+
67
+ Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run:
68
+
69
+ ```bash
70
+ bash setup_env.sh
71
+ ```
72
+
73
+ ### 2. Download the weights
74
+
75
+ Download the Wan2.2 I2V base weights:
76
+
77
+ ```bash
78
+ huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
79
+ ```
80
+
81
+ Download the Reshoot-Anything LoRA weights:
82
+
83
+ ```bash
84
+ huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
85
+ ```
86
+
87
+ ### 3. Prepare your anchor video
88
+
89
+ At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details.
90
+
91
+ ### 4. Run reshooting
92
+
93
+ ```bash
94
+ torchrun --nproc_per_node=8 generate.py \
95
+ --task v2v-A14B \
96
+ --size 1280*720 \
97
+ --frame_num 81 \
98
+ --ckpt_dir ./Wan2.2-I2V-A14B \
99
+ --high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
100
+ --low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
101
+ --source_video examples/source.mp4 \
102
+ --anchor_video examples/anchor.mp4 \
103
+ --dit_fsdp \
104
+ --t5_fsdp \
105
+ --ulysses_size 8
106
+ ```
107
+
108
+ > **Note:** Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage.
109
+
110
+ ---
111
+
112
+ ## How It Works
113
+
114
+ Reshoot-Anything adapts the **Wan2.2-14B Mixture-of-Experts (MoE)** DiT with two key architectural changes:
115
+
116
+ **Dual-stream token conditioning** — Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.
117
+
118
+ **Offset RoPE** — A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.
119
+
120
+ The model learns **implicit 4D spatiotemporal routing** — when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.
121
+
122
+ ### Self-Supervised Training Pipeline
123
+
124
+ Training requires no paired multi-view data. From a single monocular video:
125
+ 1. Two independent smooth random-walk crop trajectories are sampled → source `V_s` and target `V_t`
126
+ 2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset → anchor `V_a`
127
+ 3. The triplet `(V_s, V_a, V_t)` forms the training signal
128
+
129
+ A **hybrid dataset strategy** augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120°+) orbital camera trajectories.
130
+
131
+ ---
132
+
133
+ ## Training Details
134
+
135
+ | Parameter | Value |
136
+ |-----------|-------|
137
+ | Base model | Wan2.2-I2V-A14B (14B MoE) |
138
+ | LoRA rank | 512 (attention + FFN) |
139
+ | Training steps | 2,000 per expert |
140
+ | Batch size | 24 |
141
+ | Learning rate | 1e-5 |
142
+ | Optimizer | AdamW (β₁=0.9, β₂=0.999) |
143
+ | Loss | MSE + 0.1 × L1 source reconstruction |
144
+ | Latent frames | 20 |
145
+ | Primary data | ~100k clips from 30k monocular videos |
146
+ | Synthetic mixture | 15% ReCamMaster paired clips |
147
+
148
+ **Key augmentations:** 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.
149
+
150
+ ---
151
+
152
+ ## Evaluation
153
+
154
+ Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p):
155
+
156
+ | Method | CLIP-F ↑ | RotErr ↓ | TransErr ↓ | Mat. Pix ↑ | FVD-V ↓ | CLIP-V ↑ |
157
+ |--------|----------|----------|------------|------------|---------|----------|
158
+ | ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
159
+ | EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
160
+ | TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
161
+ | **Ours** | **99.03** | 2.76 | 4.23 | **2720.83** | **586.24** | **93.16** |
162
+ | **Ours (49f)** | **99.01** | **2.61** | **2.73** | **2737.65** | **488.22** | **94.96** |
163
+
164
+ ---
165
+
166
+ ## Citation
167
+
168
+ ```bibtex
169
+ @article{paliwal2026reshootanything,
170
+ title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
171
+ author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
172
+ journal={arXiv preprint arXiv:2604.21776},
173
+ year={2026}
174
+ }
175
+ ```
176
+
177
+ ---
178
+
179
+ ## License
180
+
181
+ Model weights are released under the **Apache 2.0** license, consistent with the Wan2.2 base model.