--- pipeline_tag: image-to-video --- # PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation [**Paper**](https://huggingface.co/papers/2508.05091) | [**Project Page**](https://jessie459.github.io/PoseGen-Page/) | [**GitHub**](https://github.com/Jessie459/PoseGen) **PoseGen** is a novel framework that generates temporally coherent, long-duration human videos from a single reference image and a driving video. It utilizes an in-context LoRA finetuning design to preserve identity fidelity and a segment-interleaved generation strategy to maintain consistency across extended durations.

PoseGen Teaser

## Installation ```bash conda create -n posegen python=3.10 -y conda activate posegen conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia -y pip install diffsynth==1.1.7 pip install wan@git+https://github.com/Wan-Video/Wan2.1 pip install -r requirements.txt ``` ## Inference To run inference, you need to prepare a reference image and a driving video. The process consists of extracting conditions, preparing a prompt, and generating the video chunks. ### 1. Extract conditions Extract pose and hand conditions from the driving video: ```bash python prepare_input_pose.py \ --pose_path "results/video1/sapiens/pose.pkl" \ --output_dir "results/video1/inputs" \ --video_path "examples/video1.mp4" python prepare_input_hand.py \ --normal_path "results/video1/sapiens/normal.npy" \ --seg_path "results/video1/sapiens/seg.npy" \ --output_dir "results/video1/inputs" \ --video_path "examples/video1.mp4" ``` ### 2. Generate Video Chunks Generate the anchor base chunk (which stabilizes the background for the rest of the sequence): ```bash python inference.py \ --mode anch \ --image_path "examples/image1.png" \ --prompt_path "results/image1/prompt.txt" \ --hand_path "results/video1/inputs/hand.mp4" \ --pose_path "results/video1/inputs/pose.mp4" \ --output_dir "results/generated" \ --seed 42 \ --anch_chunk_idx 0 \ -s 0 2 \ -b 34 40 \ -p "4*10**9" ``` Refer to the [GitHub README](https://github.com/Jessie459/PoseGen) for full details on generating subsequent base chunks and merging them. ## Acknowledgement This project is built upon [Sapiens](https://github.com/facebookresearch/sapiens), [Wan2.1](https://github.com/Wan-Video/Wan2.1), and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). ## Citation ```bibtex @article{he2025posegen, title={PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation}, author={He, Jingxuan and Su, Busheng and Wong, Finn}, journal={arXiv preprint arXiv:2508.05091}, year={2025} } ```