quanhaol
/

FlashMotion

+---
+license: mit
+datasets:
+- quanhaol/MagicData
+base_model:
+- quanhaol/Wan2.2-TI2V-5B-Turbo
+- Wan-AI/Wan2.2-TI2V-5B
+tags:
+- image-to-video
+- Trajectory-Control
+- Fewstep-video-gen
+---
+<br>
+<a href="https://arxiv.org/pdf/2603.12146"><img src="https://img.shields.io/static/v1?label=Paper&message=2603.12146&color=red&logo=arxiv"></a>
+<a href="https://quanhaol.github.io/flashmotion-site/"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green&logo=github-pages"></a>
+<a href="https://huggingface.co/quanhaol/FlashMotion"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
+<a href="https://huggingface.co/datasets/quanhaol/FlashBench"><img src="https://img.shields.io/badge/🤗_HuggingFace-Benchmark-ffbd45.svg" alt="HuggingFace"></a>
+> **FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance**
+> <br>
+> [Quanhao Li](https://github.com/quanhaol)<sup>1</sup>, [Zhen Xing](https://chenhsing.github.io/)<sup>1</sup>, [Rui Wang](https://scholar.google.com/citations?user=116smmsAAAAJ&hl=en)<sup>1</sup>, Haidong Cao<sup>1</sup>, [Qi Dai](https://daiqi1989.github.io/)<sup>2</sup>, Daoguo Dong<sup>1</sup> and [Zuxuan Wu](https://zxwu.azurewebsites.net/)<sup>1</sup>
+>
+> <sup>1</sup> Fudan University; <sup>2</sup> Microsoft Research Asia
+## 💡 Abstract
+Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.
+However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.
+While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.
+To bridge this gap, we introduce **FlashMotion**, a novel training framework designed for few-step trajectory-controllable video generation.
+We first train a trajectory adapter on a multi-step video generator for precise trajectory control.
+Then, we distill the generator into a few-step version to accelerate video generation.
+Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.
+For evaluation, we introduce **FlashBench**, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects.
+Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
+## 📣 Updates
+- `2026/03/13` 🔥🔥We released FlashMotion, including its training code, inference code, model weights and also the evaluation benchmark.
+- `2026/02` 🔥🔥🔥 FlashMotion has been accepted by CVPR2026!
+## 📑 Table of Contents
+- [💡 Abstract](#-abstract)
+- [📣 Updates](#-updates)
+- [📑 Table of Contents](#-table-of-contents)
+- [✅ TODO List](#-todo-list)
+- [🐍 Installation](#-installation)
+- [📦 Model Weights](#-model-weights)
+  - [Folder Structure](#folder-structure)
+  - [Download Links](#download-links)
+- [⛽️ Dataset Prepare](#️-dataset-prepare)
+- [🔄 Inference](#-inference)
+  - [Scripts](#scripts)
+- [🏎️ Train](#️-train)
+  - [SlowAdapter Training](#slowadapter-training)
+  - [FastGenerator Training](#fastgenerator-training)
+  - [FastAdapter Training](#fastadapter-training)
+- [🤝 Acknowledgements](#-acknowledgements)
+- [📚 Contact](#-contact)
+## ✅ TODO List
+- [x] Release our inference code and model weights
+- [x] Release our training code
+- [x] Release our evaluation benchmark
+## 🐍 Installation
+```bash
+# Clone this repository.
+git clone https://github.com/quanhaol/FlashMotion
+cd FlashMotion
+# Install requirements
+conda create -n flashmotion python=3.10 -y
+conda activate flashmotion
+pip install -r requirements.txt
+pip install flash-attn --no-build-isolation
+python setup.py develop
+```
+## 📦 Model Weights
+### Folder Structure
+```
+FlashMotion
+└── ckpts
+    ├── FastGenerator
+    │   ├── model.pt
+    ├── SlowAdapter
+    │   ├── ResNet
+    │       └── model.pt
+    │   ├── ControlNet
+    │       └── model.pt
+    ├── FastAdapter
+    │   ├── ResNet
+    │       └── model.pt
+    │   ├── ControlNet
+    │       └── model.pt
+```
+### Download Links
+Please use the following commands to download the model weights
+```bash
+pip install "huggingface_hub[hf_transfer]"
+HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download quanhaol/FlashMotion --local-dir ckpts
+HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir wan_models/Wan2.2-TI2V-5B
+```
+## ⛽️ Dataset Prepare
+All three training stages of FlashMotion uses [MagicData](https://huggingface.co/datasets/quanhaol/MagicData), an open-sourced dataset built for trajectory-controllable video generation.
+Please follow [this README](https://huggingface.co/datasets/quanhaol/MagicData) to download and extract the data in a proper path on your machine.
+The dataset structure can be organized as follows:
+```
+MagicData
+├── videos
+│   ├── videoid_1.mp4
+│   ├── videoid_2.mp4
+│   ├── ...
+├── masks
+│   ├── videoid_1
+│   │   ├── annotated_frame_00000.png
+│   │   ├── annotated_frame_00001.png
+│   │   ├── ...
+│   ├── videoid_2
+│   │   ├── ...
+├── boxs
+│   ├── videoid_1
+│   │   ├── annotated_frame_00000.png
+│   │   ├── annotated_frame_00001.png
+│   │   ├── ...
+│   ├── videoid_2
+│   │   ├── ...
+├── MagicData.csv   # detailed information of each video
+```
+## 🔄 Inference
+The Inference process requires around 42 GiB GPU memory to use the ResNet FastAdapter and 50GiB GPU memory to use the ControlNet FastAdapter, all tested on a single NVIDIA A100 GPU.
+⚡️⚡️⚡️ It takes only 11 seconds for denoising a video using the ResNet Adapter, and around 24 seconds to denoise a video using the ControlNet Adapter.
+### Scripts
+We here provide demo scripts to run both types of trajectory adapter.
+```bash
+# Demo inference script of each adapter type
+bash running_scripts/inference/i2v_control_fewstep_controlnet.sh
+bash running_scripts/inference/i2v_control_fewstep_resnet.sh
+```
+We also provide sample input image and trajectory maps in `./assets`.
+Feel free to replace the `--prompt`, `--image`, `--trajectory` with your customized input prompt, input image and input trajectory maps.
+> **Note**: If you want to build your own trajectory maps, please refer to the box trajectory construction pipeline introduced in [MagicMotion](https://github.com/quanhaol/MagicMotion/tree/main/trajectory_construction#box-trajectory).
+## 🏎️ Train
+We here provide scripts for all three training stages of FlashMotion, including training the SlowAdapter, FastGenerator, and the FastAdapter.
+### SlowAdapter Training
+In this stage, we first train the SlowAdapter using the mask annotations in MagicData, and then finetune it using bounding box as the trajectory maps conditions.
+```bash
+# Demo training script of SlowAdapter
+bash running_scripts/train/stage1_mask.sh
+bash running_scripts/train/stage1_box.sh
+```
+### FastGenerator Training
+In this stage, we distill the Wan2.2-TI2V-5B model into a 4-steps image-to-video generation model, named as the FastGenerator.
+```bash
+# Demo training script of FastGenerator
+bash running_scripts/train/stage2.sh
+```
+### FastAdapter Training
+In this stage, we trains the FastAdapter to fit with the FastGenerator and enable few-step trajectory controllable video generation.
+```bash
+# Demo training script of FastGenerator
+bash running_scripts/train/stage3.sh
+```
+## 🤝 Acknowledgements
+We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
+- [Wan](https://github.com/Wan-Video/Wan2.2): An open sourced base video generation model.
+- [Self-Forcing](https://github.com/guandeh17/Self-Forcing) and [Causvid](https://github.com/tianweiy/CausVid): Two frameworks that pioneer the field of distilling video generation methods.
+- [MagicMotion](https://github.com/quanhaol/MagicMotion): An open source trajectory-controllable video generation framework.
+- [Wan2.2-TI2V-5B-Turbo](https://github.com/quanhaol/Wan2.2-TI2V-5B-Turbo): An open source step distillation image-to-video generation framework that distill Wan2.2-5B-TI2V model into 4 steps.
+Special thanks to the contributors of these libraries for their hard work and dedication!
+## 📚 Contact
+If you have any suggestions or find our work helpful, feel free to contact us
+Email: liqh24@m.fudan.edu.cn
+If you find our work useful, <b>please consider giving a star to this github repository and citing it</b>:
+```bibtex
+@misc{li2026flashmotionfewstepcontrollablevideo,
+      title={FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance},
+      author={Quanhao Li and Zhen Xing and Rui Wang and Haidong Cao and Qi Dai and Daoguo Dong and Zuxuan Wu},
+      year={2026},
+      eprint={2603.12146},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2603.12146},
+}
+```