File size: 7,405 Bytes
06e3166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e4a86b
 
06e3166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: apache-2.0
language:
- en
tags:
- video-generation
- video-editing
- novel-view-synthesis
- camera-control
- diffusion-transformer
- wan2.2
- lora
- video-reshooting
- 4d-reconstruction
pipeline_tag: video-to-video
base_model: Wan-AI/Wan2.2-I2V-A14B
---

# Reshoot-Anything

[![GitHub](https://img.shields.io/badge/GitHub-video--to--video-black?logo=github)](https://github.com/morphicfilms/video-to-video)
[![arXiv](https://img.shields.io/badge/arXiv-2604.21776-red?logo=arxiv)](https://arxiv.org/abs/2604.21776)
[![Website](https://img.shields.io/badge/Website-morphic.com-purple)](https://adithyaiyer1999.github.io/reshoot-anything/)

**Reshoot-Anything** is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency β€” trained entirely on in-the-wild monocular videos.

> **Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting**  
> Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar  
> Morphic Inc. Β· [arXiv:2604.21776](https://arxiv.org/abs/2604.21776)

---

<table>
<tr>
<td align="center"><b>Source Video</b></td>
<td align="center"><b>Reshot Video</b></td>
</tr>
<tr>
<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_og.gif?raw=true" width="400" alt="Source video"></td>
<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_01.gif?raw=true" width="400" alt="Reshot video"></td>
</tr>
</table>


## Model Files

This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):

| File | Role | Notes |
|------|------|-------|
| `jan06_scaling_80k_ckpt1400.safetensors` | **High-noise expert** | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. |
| `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` | **Low-noise expert** | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. |

Both files are ~9.82 GB each.

---

## Quickstart

### 1. Clone the repository

```bash
git clone https://github.com/morphicfilms/video-to-video.git
cd video-to-video
```

Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run:

```bash
bash setup_env.sh
```

### 2. Download the weights

Download the Wan2.2 I2V base weights:

```bash
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
```

Download the Reshoot-Anything LoRA weights:

```bash
huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
```

### 3. Prepare your anchor video

At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details.

### 4. Run reshooting

```bash
torchrun --nproc_per_node=8 generate.py \
    --task v2v-A14B \
    --size 1280*720 \
    --frame_num 81 \
    --ckpt_dir ./Wan2.2-I2V-A14B \
    --high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
    --low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
    --source_video examples/source.mp4 \
    --anchor_video examples/anchor.mp4 \
    --dit_fsdp \
    --t5_fsdp \
    --ulysses_size 8
```

> **Note:** Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage.

---

## How It Works

Reshoot-Anything adapts the **Wan2.2-14B Mixture-of-Experts (MoE)** DiT with two key architectural changes:

**Dual-stream token conditioning** β€” Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.

**Offset RoPE** β€” A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.

The model learns **implicit 4D spatiotemporal routing** β€” when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.

### Self-Supervised Training Pipeline

Training requires no paired multi-view data. From a single monocular video:
1. Two independent smooth random-walk crop trajectories are sampled β†’ source `V_s` and target `V_t`
2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset β†’ anchor `V_a`
3. The triplet `(V_s, V_a, V_t)` forms the training signal

A **hybrid dataset strategy** augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120Β°+) orbital camera trajectories.

---

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | Wan2.2-I2V-A14B (14B MoE) |
| LoRA rank | 512 (attention + FFN) |
| Training steps | 2,000 per expert |
| Batch size | 24 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.999) |
| Loss | MSE + 0.1 Γ— L1 source reconstruction |
| Latent frames | 20 |
| Primary data | ~100k clips from 30k monocular videos |
| Synthetic mixture | 15% ReCamMaster paired clips |

**Key augmentations:** 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.

---

## Evaluation

Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p):

| Method | CLIP-F ↑ | RotErr ↓ | TransErr ↓ | Mat. Pix ↑ | FVD-V ↓ | CLIP-V ↑ |
|--------|----------|----------|------------|------------|---------|----------|
| ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
| EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
| TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
| **Ours** | **99.03** | 2.76 | 4.23 | **2720.83** | **586.24** | **93.16** |
| **Ours (49f)** | **99.01** | **2.61** | **2.73** | **2737.65** | **488.22** | **94.96** |

---

## Citation

```bibtex
@article{paliwal2026reshootanything,
  title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
  author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
  journal={arXiv preprint arXiv:2604.21776},
  year={2026}
}
```

---

## License

Model weights are released under the **Apache 2.0** license, consistent with the Wan2.2 base model.