Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,181 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- video-generation
|
| 7 |
+
- video-editing
|
| 8 |
+
- novel-view-synthesis
|
| 9 |
+
- camera-control
|
| 10 |
+
- diffusion-transformer
|
| 11 |
+
- wan2.2
|
| 12 |
+
- lora
|
| 13 |
+
- video-reshooting
|
| 14 |
+
- 4d-reconstruction
|
| 15 |
+
pipeline_tag: video-to-video
|
| 16 |
+
base_model: Wan-AI/Wan2.2-I2V-A14B
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Reshoot-Anything
|
| 20 |
+
|
| 21 |
+
[](https://github.com/morphicfilms/video-to-video)
|
| 22 |
+
[](https://arxiv.org/abs/2604.21776)
|
| 23 |
+
[](https://adithyaiyer1999.github.io/reshoot-anything/)
|
| 24 |
+
|
| 25 |
+
**Reshoot-Anything** is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency — trained entirely on in-the-wild monocular videos.
|
| 26 |
+
|
| 27 |
+
> **Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting**
|
| 28 |
+
> Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
|
| 29 |
+
> Morphic Inc. · [arXiv:2604.21776](https://arxiv.org/abs/2604.21776)
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
<table>
|
| 34 |
+
<tr>
|
| 35 |
+
<td align="center"><b>Source Video</b></td>
|
| 36 |
+
<td align="center"><b>Reshot Video</b></td>
|
| 37 |
+
</tr>
|
| 38 |
+
<tr>
|
| 39 |
+
<td><img src="assets/woman_og.gif" width="400" alt="Source video"></td>
|
| 40 |
+
<td><img src="assets/woman_01.gif" width="400" alt="Reshot video"></td>
|
| 41 |
+
</tr>
|
| 42 |
+
</table>
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
## Model Files
|
| 46 |
+
|
| 47 |
+
This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):
|
| 48 |
+
|
| 49 |
+
| File | Role | Notes |
|
| 50 |
+
|------|------|-------|
|
| 51 |
+
| `jan06_scaling_80k_ckpt1400.safetensors` | **High-noise expert** | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. |
|
| 52 |
+
| `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` | **Low-noise expert** | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. |
|
| 53 |
+
|
| 54 |
+
Both files are ~9.82 GB each.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Quickstart
|
| 59 |
+
|
| 60 |
+
### 1. Clone the repository
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
git clone https://github.com/morphicfilms/video-to-video.git
|
| 64 |
+
cd video-to-video
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run:
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
bash setup_env.sh
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### 2. Download the weights
|
| 74 |
+
|
| 75 |
+
Download the Wan2.2 I2V base weights:
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
Download the Reshoot-Anything LoRA weights:
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### 3. Prepare your anchor video
|
| 88 |
+
|
| 89 |
+
At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details.
|
| 90 |
+
|
| 91 |
+
### 4. Run reshooting
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
torchrun --nproc_per_node=8 generate.py \
|
| 95 |
+
--task v2v-A14B \
|
| 96 |
+
--size 1280*720 \
|
| 97 |
+
--frame_num 81 \
|
| 98 |
+
--ckpt_dir ./Wan2.2-I2V-A14B \
|
| 99 |
+
--high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
|
| 100 |
+
--low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
|
| 101 |
+
--source_video examples/source.mp4 \
|
| 102 |
+
--anchor_video examples/anchor.mp4 \
|
| 103 |
+
--dit_fsdp \
|
| 104 |
+
--t5_fsdp \
|
| 105 |
+
--ulysses_size 8
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
> **Note:** Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## How It Works
|
| 113 |
+
|
| 114 |
+
Reshoot-Anything adapts the **Wan2.2-14B Mixture-of-Experts (MoE)** DiT with two key architectural changes:
|
| 115 |
+
|
| 116 |
+
**Dual-stream token conditioning** — Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.
|
| 117 |
+
|
| 118 |
+
**Offset RoPE** — A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.
|
| 119 |
+
|
| 120 |
+
The model learns **implicit 4D spatiotemporal routing** — when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.
|
| 121 |
+
|
| 122 |
+
### Self-Supervised Training Pipeline
|
| 123 |
+
|
| 124 |
+
Training requires no paired multi-view data. From a single monocular video:
|
| 125 |
+
1. Two independent smooth random-walk crop trajectories are sampled → source `V_s` and target `V_t`
|
| 126 |
+
2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset → anchor `V_a`
|
| 127 |
+
3. The triplet `(V_s, V_a, V_t)` forms the training signal
|
| 128 |
+
|
| 129 |
+
A **hybrid dataset strategy** augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120°+) orbital camera trajectories.
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## Training Details
|
| 134 |
+
|
| 135 |
+
| Parameter | Value |
|
| 136 |
+
|-----------|-------|
|
| 137 |
+
| Base model | Wan2.2-I2V-A14B (14B MoE) |
|
| 138 |
+
| LoRA rank | 512 (attention + FFN) |
|
| 139 |
+
| Training steps | 2,000 per expert |
|
| 140 |
+
| Batch size | 24 |
|
| 141 |
+
| Learning rate | 1e-5 |
|
| 142 |
+
| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
|
| 143 |
+
| Loss | MSE + 0.1 × L1 source reconstruction |
|
| 144 |
+
| Latent frames | 20 |
|
| 145 |
+
| Primary data | ~100k clips from 30k monocular videos |
|
| 146 |
+
| Synthetic mixture | 15% ReCamMaster paired clips |
|
| 147 |
+
|
| 148 |
+
**Key augmentations:** 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## Evaluation
|
| 153 |
+
|
| 154 |
+
Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p):
|
| 155 |
+
|
| 156 |
+
| Method | CLIP-F ↑ | RotErr ↓ | TransErr ↓ | Mat. Pix ↑ | FVD-V ↓ | CLIP-V ↑ |
|
| 157 |
+
|--------|----------|----------|------------|------------|---------|----------|
|
| 158 |
+
| ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
|
| 159 |
+
| EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
|
| 160 |
+
| TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
|
| 161 |
+
| **Ours** | **99.03** | 2.76 | 4.23 | **2720.83** | **586.24** | **93.16** |
|
| 162 |
+
| **Ours (49f)** | **99.01** | **2.61** | **2.73** | **2737.65** | **488.22** | **94.96** |
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## Citation
|
| 167 |
+
|
| 168 |
+
```bibtex
|
| 169 |
+
@article{paliwal2026reshootanything,
|
| 170 |
+
title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
|
| 171 |
+
author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
|
| 172 |
+
journal={arXiv preprint arXiv:2604.21776},
|
| 173 |
+
year={2026}
|
| 174 |
+
}
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## License
|
| 180 |
+
|
| 181 |
+
Model weights are released under the **Apache 2.0** license, consistent with the Wan2.2 base model.
|