| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - video-generation |
| - video-editing |
| - novel-view-synthesis |
| - camera-control |
| - diffusion-transformer |
| - wan2.2 |
| - lora |
| - video-reshooting |
| - 4d-reconstruction |
| pipeline_tag: video-to-video |
| base_model: Wan-AI/Wan2.2-I2V-A14B |
| --- |
| |
| # Reshoot-Anything |
|
|
| [](https://github.com/morphicfilms/video-to-video) |
| [](https://arxiv.org/abs/2604.21776) |
| [](https://adithyaiyer1999.github.io/reshoot-anything/) |
|
|
| **Reshoot-Anything** is a self-supervised video reshooting model built on top of [Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B). Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency β trained entirely on in-the-wild monocular videos. |
|
|
| > **Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting** |
| > Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar |
| > Morphic Inc. Β· [arXiv:2604.21776](https://arxiv.org/abs/2604.21776) |
|
|
| --- |
|
|
| <table> |
| <tr> |
| <td align="center"><b>Source Video</b></td> |
| <td align="center"><b>Reshot Video</b></td> |
| </tr> |
| <tr> |
| <td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_og.gif?raw=true" width="400" alt="Source video"></td> |
| <td><img src="https://github.com/morphicfilms/video-to-video/blob/main/assets/woman_01.gif?raw=true" width="400" alt="Reshot video"></td> |
| </tr> |
| </table> |
|
|
|
|
| ## Model Files |
|
|
| This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B): |
|
|
| | File | Role | Notes | |
| |------|------|-------| |
| | `jan06_scaling_80k_ckpt1400.safetensors` | **High-noise expert** | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. | |
| | `dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors` | **Low-noise expert** | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. | |
|
|
| Both files are ~9.82 GB each. |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ### 1. Clone the repository |
|
|
| ```bash |
| git clone https://github.com/morphicfilms/video-to-video.git |
| cd video-to-video |
| ``` |
|
|
| Follow the [Wan2.2 installation guide](https://github.com/Wan-Video/Wan2.2) to set up the environment, or run: |
|
|
| ```bash |
| bash setup_env.sh |
| ``` |
|
|
| ### 2. Download the weights |
|
|
| Download the Wan2.2 I2V base weights: |
|
|
| ```bash |
| huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B |
| ``` |
|
|
| Download the Reshoot-Anything LoRA weights: |
|
|
| ```bash |
| huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights |
| ``` |
|
|
| ### 3. Prepare your anchor video |
|
|
| At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's `anchor_generation/` scripts for details. |
|
|
| ### 4. Run reshooting |
|
|
| ```bash |
| torchrun --nproc_per_node=8 generate.py \ |
| --task v2v-A14B \ |
| --size 1280*720 \ |
| --frame_num 81 \ |
| --ckpt_dir ./Wan2.2-I2V-A14B \ |
| --high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \ |
| --low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \ |
| --source_video examples/source.mp4 \ |
| --anchor_video examples/anchor.mp4 \ |
| --dit_fsdp \ |
| --t5_fsdp \ |
| --ulysses_size 8 |
| ``` |
|
|
| > **Note:** Refer to the [GitHub README](https://github.com/morphicfilms/video-to-video) for the authoritative argument names and single-GPU usage. |
|
|
| --- |
|
|
| ## How It Works |
|
|
| Reshoot-Anything adapts the **Wan2.2-14B Mixture-of-Experts (MoE)** DiT with two key architectural changes: |
|
|
| **Dual-stream token conditioning** β Both the anchor video `V_a` (geometric guide) and source video `V_s` (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions. |
|
|
| **Offset RoPE** β A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory. |
|
|
| The model learns **implicit 4D spatiotemporal routing** β when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video. |
|
|
| ### Self-Supervised Training Pipeline |
|
|
| Training requires no paired multi-view data. From a single monocular video: |
| 1. Two independent smooth random-walk crop trajectories are sampled β source `V_s` and target `V_t` |
| 2. `V_s[0]` is forward-warped via [AllTracker](https://arxiv.org/abs/2504.11111) dense flow + crop offset β anchor `V_a` |
| 3. The triplet `(V_s, V_a, V_t)` forms the training signal |
|
|
| A **hybrid dataset strategy** augments the monocular pipeline with a 15% mixture of paired synthetic data from [ReCamMaster](https://github.com/KwaiVGI/ReCamMaster), enabling generalization to extreme (120Β°+) orbital camera trajectories. |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Base model | Wan2.2-I2V-A14B (14B MoE) | |
| | LoRA rank | 512 (attention + FFN) | |
| | Training steps | 2,000 per expert | |
| | Batch size | 24 | |
| | Learning rate | 1e-5 | |
| | Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) | |
| | Loss | MSE + 0.1 Γ L1 source reconstruction | |
| | Latent frames | 20 | |
| | Primary data | ~100k clips from 30k monocular videos | |
| | Synthetic mixture | 15% ReCamMaster paired clips | |
|
|
| **Key augmentations:** 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Results on 100 five-second videos from [OpenSora-Mixkit](https://huggingface.co/datasets/opensora/OpenSora-MixKit) (16fps, 480p): |
|
|
| | Method | CLIP-F β | RotErr β | TransErr β | Mat. Pix β | FVD-V β | CLIP-V β | |
| |--------|----------|----------|------------|------------|---------|----------| |
| | ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 | |
| | EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 | |
| | TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 | |
| | **Ours** | **99.03** | 2.76 | 4.23 | **2720.83** | **586.24** | **93.16** | |
| | **Ours (49f)** | **99.01** | **2.61** | **2.73** | **2737.65** | **488.22** | **94.96** | |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{paliwal2026reshootanything, |
| title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting}, |
| author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun}, |
| journal={arXiv preprint arXiv:2604.21776}, |
| year={2026} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| Model weights are released under the **Apache 2.0** license, consistent with the Wan2.2 base model. |