metadata
license: apache-2.0
pipeline_tag: image-to-video
tags:
- video-editing
- diffusion
- wan
Overview
NOVA is a pair-free video editing model built on WAN 1.3B Fun InP. It uses sparse keyframe control (e.g., a single edited first frame) to guide dense video synthesis, trained without requiring paired before/after video data.
- Pair-free training via degradation simulation
- Sparse keyframe control: provide one or more edited keyframes
- Optional coarse mask for improved editing accuracy
The framework consists of a sparse branch providing semantic guidance through user-edited keyframes and a dense branch that incorporates motion and texture information from the original video to maintain high fidelity and coherence.
Usage
For full installation and training instructions, please visit the GitHub repository.
Inference via CLI
You can run inference using the infer_nova.py script. Below is an example for single GPU inference:
python infer_nova.py \
--dataset_path ./example_videos \
--metadata_file_name metadata.csv \
--ckpt_path /path/to/checkpoints/stepXXX.ckpt \
--output_path ./inference_results \
--text_encoder_path /path/to/models_t5_umt5-xxl-enc-bf16.pth \
--image_encoder_path /path/to/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth \
--vae_path /path/to/Wan2.1_VAE.pth \
--dit_path /path/to/diffusion_pytorch_model.safetensors \
--num_samples 5 \
--num_inference_steps 50 \
--num_frames 81 \
--height 480 \
--width 832 \
--first_only
Citation
@article{pan2026nova,
title={NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing},
author={Tianlin Pan and Jiayi Dai and Chenpu Yuan and Zhengyao Lv and Binxin Yang and Hubery Yin and Chen Li and Jing Lyu and Caifeng Shan and Chenyang Si},
journal={arXiv preprint arXiv:2603.02802},
year={2026}
}
