Jessie459 nielsr HF Staff commited on
Commit
81aad25
·
1 Parent(s): 7fd9e56

Add model card for PoseGen (#1)

Browse files

- Add model card for PoseGen (d7311ac7088f7dc3ec31f1fe8c22f927da35ee53)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-to-video
3
+ ---
4
+
5
+ # PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
6
+
7
+ [**Paper**](https://huggingface.co/papers/2508.05091) | [**Project Page**](https://jessie459.github.io/PoseGen-Page/) | [**GitHub**](https://github.com/Jessie459/PoseGen)
8
+
9
+ **PoseGen** is a novel framework that generates temporally coherent, long-duration human videos from a single reference image and a driving video. It utilizes an in-context LoRA finetuning design to preserve identity fidelity and a segment-interleaved generation strategy to maintain consistency across extended durations.
10
+
11
+ <p align="center">
12
+ <img src="https://github.com/Jessie459/PoseGen/raw/main/assets/teaser.png" alt="PoseGen Teaser">
13
+ </p>
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ conda create -n posegen python=3.10 -y
19
+ conda activate posegen
20
+
21
+ conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia -y
22
+ pip install diffsynth==1.1.7
23
+ pip install wan@git+https://github.com/Wan-Video/Wan2.1
24
+ pip install -r requirements.txt
25
+ ```
26
+
27
+ ## Inference
28
+
29
+ To run inference, you need to prepare a reference image and a driving video. The process consists of extracting conditions, preparing a prompt, and generating the video chunks.
30
+
31
+ ### 1. Extract conditions
32
+ Extract pose and hand conditions from the driving video:
33
+ ```bash
34
+ python prepare_input_pose.py \
35
+ --pose_path "results/video1/sapiens/pose.pkl" \
36
+ --output_dir "results/video1/inputs" \
37
+ --video_path "examples/video1.mp4"
38
+
39
+ python prepare_input_hand.py \
40
+ --normal_path "results/video1/sapiens/normal.npy" \
41
+ --seg_path "results/video1/sapiens/seg.npy" \
42
+ --output_dir "results/video1/inputs" \
43
+ --video_path "examples/video1.mp4"
44
+ ```
45
+
46
+ ### 2. Generate Video Chunks
47
+ Generate the anchor base chunk (which stabilizes the background for the rest of the sequence):
48
+ ```bash
49
+ python inference.py \
50
+ --mode anch \
51
+ --image_path "examples/image1.png" \
52
+ --prompt_path "results/image1/prompt.txt" \
53
+ --hand_path "results/video1/inputs/hand.mp4" \
54
+ --pose_path "results/video1/inputs/pose.mp4" \
55
+ --output_dir "results/generated" \
56
+ --seed 42 \
57
+ --anch_chunk_idx 0 \
58
+ -s 0 2 \
59
+ -b 34 40 \
60
+ -p "4*10**9"
61
+ ```
62
+
63
+ Refer to the [GitHub README](https://github.com/Jessie459/PoseGen) for full details on generating subsequent base chunks and merging them.
64
+
65
+ ## Acknowledgement
66
+ This project is built upon [Sapiens](https://github.com/facebookresearch/sapiens), [Wan2.1](https://github.com/Wan-Video/Wan2.1), and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).
67
+
68
+ ## Citation
69
+
70
+ ```bibtex
71
+ @article{he2025posegen,
72
+ title={PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation},
73
+ author={He, Jingxuan and Su, Busheng and Wong, Finn},
74
+ journal={arXiv preprint arXiv:2508.05091},
75
+ year={2025}
76
+ }
77
+ ```