Update README.md

2e4a86b verified 12 days ago

7.41 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- video-generation
	- video-editing
	- novel-view-synthesis
	- camera-control
	- diffusion-transformer
	- wan2.2
	- lora
	- video-reshooting
	- 4d-reconstruction
	pipeline_tag: video-to-video
	base_model: Wan-AI/Wan2.2-I2V-A14B
	---

	# Reshoot-Anything

	[![GitHub](https://img.shields.io/badge/GitHub-video--to--video-black?logo=github)](https://github.com/morphicfilms/video-to-video)
	[![arXiv](https://img.shields.io/badge/arXiv-2604.21776-red?logo=arxiv)](https://arxiv.org/abs/2604.21776)
	[![Website](https://img.shields.io/badge/Website-morphic.com-purple)](https://adithyaiyer1999.github.io/reshoot-anything/)

	Reshoot-Anything is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency — trained entirely on in-the-wild monocular videos.

	> Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
	> Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
	> Morphic Inc. · [arXiv:2604.21776](https://arxiv.org/abs/2604.21776)

	---

	<table>
	<tr>
	<td align="center"><b>Source Video</b></td>
	<td align="center"><b>Reshot Video</b></td>
	</tr>
	<tr>
	<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_og.gif?raw=true" width="400" alt="Source video"></td>
	<td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_01.gif?raw=true" width="400" alt="Reshot video"></td>
	</tr>
	</table>


	## Model Files

	This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):

	\| File \| Role \| Notes \|
	\|------\|------\|-------\|
	\| `jan06_scaling_80k_ckpt1400.safetensors` \| High-noise expert \| Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. \|
	\| `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` \| Low-noise expert \| Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. \|

	Both files are ~9.82 GB each.

	---

	## Quickstart

	### 1. Clone the repository

	```bash
	git clone https://github.com/morphicfilms/video-to-video.git
	cd video-to-video
	```

	Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run:

	```bash
	bash setup_env.sh
	```

	### 2. Download the weights

	Download the Wan2.2 I2V base weights:

	```bash
	huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
	```

	Download the Reshoot-Anything LoRA weights:

	```bash
	huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
	```

	### 3. Prepare your anchor video

	At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details.

	### 4. Run reshooting

	```bash
	torchrun --nproc_per_node=8 generate.py \
	--task v2v-A14B \
	--size 1280*720 \
	--frame_num 81 \
	--ckpt_dir ./Wan2.2-I2V-A14B \
	--high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
	--low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
	--source_video examples/source.mp4 \
	--anchor_video examples/anchor.mp4 \
	--dit_fsdp \
	--t5_fsdp \
	--ulysses_size 8
	```

	> Note: Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage.

	---

	## How It Works

	Reshoot-Anything adapts the Wan2.2-14B Mixture-of-Experts (MoE) DiT with two key architectural changes:

	Dual-stream token conditioning — Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.

	Offset RoPE — A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.

	The model learns implicit 4D spatiotemporal routing — when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.

	### Self-Supervised Training Pipeline

	Training requires no paired multi-view data. From a single monocular video:
	1. Two independent smooth random-walk crop trajectories are sampled → source `V_s` and target `V_t`
	2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset → anchor `V_a`
	3. The triplet `(V_s, V_a, V_t)` forms the training signal

	A hybrid dataset strategy augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120°+) orbital camera trajectories.

	---

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| Wan2.2-I2V-A14B (14B MoE) \|
	\| LoRA rank \| 512 (attention + FFN) \|
	\| Training steps \| 2,000 per expert \|
	\| Batch size \| 24 \|
	\| Learning rate \| 1e-5 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.999) \|
	\| Loss \| MSE + 0.1 × L1 source reconstruction \|
	\| Latent frames \| 20 \|
	\| Primary data \| ~100k clips from 30k monocular videos \|
	\| Synthetic mixture \| 15% ReCamMaster paired clips \|

	Key augmentations: 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.

	---

	## Evaluation

	Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p):

	\| Method \| CLIP-F ↑ \| RotErr ↓ \| TransErr ↓ \| Mat. Pix ↑ \| FVD-V ↓ \| CLIP-V ↑ \|
	\|--------\|----------\|----------\|------------\|------------\|---------\|----------\|
	\| ReCamMaster \| 98.49 \| 11.29 \| 19.59 \| 1314.00 \| 732.52 \| 88.91 \|
	\| EX-4D \| 98.94 \| 3.94 \| 4.21 \| 2188.98 \| 685.63 \| 89.77 \|
	\| TrajectoryCrafter (49f) \| 98.80 \| 2.26 \| 3.03 \| 1851.80 \| 582.56 \| 92.40 \|
	\| Ours \| 99.03 \| 2.76 \| 4.23 \| 2720.83 \| 586.24 \| 93.16 \|
	\| Ours (49f) \| 99.01 \| 2.61 \| 2.73 \| 2737.65 \| 488.22 \| 94.96 \|

	---

	## Citation

	```bibtex
	@article{paliwal2026reshootanything,
	title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
	author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
	journal={arXiv preprint arXiv:2604.21776},
	year={2026}
	}
	```

	---

	## License

	Model weights are released under the Apache 2.0 license, consistent with the Wan2.2 base model.