PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Paper | Project Page | GitHub

PoseGen is a novel framework that generates temporally coherent, long-duration human videos from a single reference image and a driving video. It utilizes an in-context LoRA finetuning design to preserve identity fidelity and a segment-interleaved generation strategy to maintain consistency across extended durations.

PoseGen Teaser

Installation

conda create -n posegen python=3.10 -y
conda activate posegen

conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia -y
pip install diffsynth==1.1.7
pip install wan@git+https://github.com/Wan-Video/Wan2.1
pip install -r requirements.txt

Inference

To run inference, you need to prepare a reference image and a driving video. The process consists of extracting conditions, preparing a prompt, and generating the video chunks.

1. Extract conditions

Extract pose and hand conditions from the driving video:

python prepare_input_pose.py \
    --pose_path "results/video1/sapiens/pose.pkl" \
    --output_dir "results/video1/inputs" \
    --video_path "examples/video1.mp4"

python prepare_input_hand.py \
    --normal_path "results/video1/sapiens/normal.npy" \
    --seg_path "results/video1/sapiens/seg.npy" \
    --output_dir "results/video1/inputs" \
    --video_path "examples/video1.mp4"

2. Generate Video Chunks

Generate the anchor base chunk (which stabilizes the background for the rest of the sequence):

python inference.py \
    --mode anch \
    --image_path "examples/image1.png" \
    --prompt_path "results/image1/prompt.txt" \
    --hand_path "results/video1/inputs/hand.mp4" \
    --pose_path "results/video1/inputs/pose.mp4" \
    --output_dir "results/generated" \
    --seed 42 \
    --anch_chunk_idx 0 \
    -s 0 2 \
    -b 34 40 \
    -p "4*10**9"

Refer to the GitHub README for full details on generating subsequent base chunks and merging them.

Acknowledgement

This project is built upon Sapiens, Wan2.1, and DiffSynth-Studio.

Citation

@article{he2025posegen,
  title={PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation},
  author={He, Jingxuan and Su, Busheng and Wong, Finn},
  journal={arXiv preprint arXiv:2508.05091},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Jessie459/PoseGen