File size: 7,405 Bytes
06e3166 2e4a86b 06e3166 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | ---
license: apache-2.0
language:
- en
tags:
- video-generation
- video-editing
- novel-view-synthesis
- camera-control
- diffusion-transformer
- wan2.2
- lora
- video-reshooting
- 4d-reconstruction
pipeline_tag: video-to-video
base_model: Wan-AI/Wan2.2-I2V-A14B
---
# Reshoot-Anything
[](https://github.com/morphicfilms/video-to-video)
[](https://arxiv.org/abs/2604.21776)
[](https://adithyaiyer1999.github.io/reshoot-anything/)
**Reshoot-Anything** is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency β trained entirely on in-the-wild monocular videos.
> **Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting**
> Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
> Morphic Inc. Β· [arXiv:2604.21776](https://arxiv.org/abs/2604.21776)
---
<table>
<tr>
<td align="center"><b>Source Video</b></td>
<td align="center"><b>Reshot Video</b></td>
</tr>
<tr>
<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_og.gif?raw=true" width="400" alt="Source video"></td>
<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_01.gif?raw=true" width="400" alt="Reshot video"></td>
</tr>
</table>
## Model Files
This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):
| File | Role | Notes |
|------|------|-------|
| `jan06_scaling_80k_ckpt1400.safetensors` | **High-noise expert** | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. |
| `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` | **Low-noise expert** | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. |
Both files are ~9.82 GB each.
---
## Quickstart
### 1. Clone the repository
```bash
git clone https://github.com/morphicfilms/video-to-video.git
cd video-to-video
```
Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run:
```bash
bash setup_env.sh
```
### 2. Download the weights
Download the Wan2.2 I2V base weights:
```bash
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
```
Download the Reshoot-Anything LoRA weights:
```bash
huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
```
### 3. Prepare your anchor video
At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details.
### 4. Run reshooting
```bash
torchrun --nproc_per_node=8 generate.py \
--task v2v-A14B \
--size 1280*720 \
--frame_num 81 \
--ckpt_dir ./Wan2.2-I2V-A14B \
--high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
--low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
--source_video examples/source.mp4 \
--anchor_video examples/anchor.mp4 \
--dit_fsdp \
--t5_fsdp \
--ulysses_size 8
```
> **Note:** Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage.
---
## How It Works
Reshoot-Anything adapts the **Wan2.2-14B Mixture-of-Experts (MoE)** DiT with two key architectural changes:
**Dual-stream token conditioning** β Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.
**Offset RoPE** β A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.
The model learns **implicit 4D spatiotemporal routing** β when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.
### Self-Supervised Training Pipeline
Training requires no paired multi-view data. From a single monocular video:
1. Two independent smooth random-walk crop trajectories are sampled β source `V_s` and target `V_t`
2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset β anchor `V_a`
3. The triplet `(V_s, V_a, V_t)` forms the training signal
A **hybrid dataset strategy** augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120Β°+) orbital camera trajectories.
---
## Training Details
| Parameter | Value |
|-----------|-------|
| Base model | Wan2.2-I2V-A14B (14B MoE) |
| LoRA rank | 512 (attention + FFN) |
| Training steps | 2,000 per expert |
| Batch size | 24 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) |
| Loss | MSE + 0.1 Γ L1 source reconstruction |
| Latent frames | 20 |
| Primary data | ~100k clips from 30k monocular videos |
| Synthetic mixture | 15% ReCamMaster paired clips |
**Key augmentations:** 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.
---
## Evaluation
Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p):
| Method | CLIP-F β | RotErr β | TransErr β | Mat. Pix β | FVD-V β | CLIP-V β |
|--------|----------|----------|------------|------------|---------|----------|
| ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
| EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
| TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
| **Ours** | **99.03** | 2.76 | 4.23 | **2720.83** | **586.24** | **93.16** |
| **Ours (49f)** | **99.01** | **2.61** | **2.73** | **2737.65** | **488.22** | **94.96** |
---
## Citation
```bibtex
@article{paliwal2026reshootanything,
title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
journal={arXiv preprint arXiv:2604.21776},
year={2026}
}
```
---
## License
Model weights are released under the **Apache 2.0** license, consistent with the Wan2.2 base model. |