Add model card for PoseGen (#1)

- Add model card for PoseGen (d7311ac7088f7dc3ec31f1fe8c22f927da35ee53)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +77 -0

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+pipeline_tag: image-to-video
+---
+# PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
+[**Paper**](https://huggingface.co/papers/2508.05091) | [**Project Page**](https://jessie459.github.io/PoseGen-Page/) | [**GitHub**](https://github.com/Jessie459/PoseGen)
+**PoseGen** is a novel framework that generates temporally coherent, long-duration human videos from a single reference image and a driving video. It utilizes an in-context LoRA finetuning design to preserve identity fidelity and a segment-interleaved generation strategy to maintain consistency across extended durations.
+<p align="center">
+  <img src="https://github.com/Jessie459/PoseGen/raw/main/assets/teaser.png" alt="PoseGen Teaser">
+</p>
+## Installation
+```bash
+conda create -n posegen python=3.10 -y
+conda activate posegen
+conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia -y
+pip install diffsynth==1.1.7
+pip install wan@git+https://github.com/Wan-Video/Wan2.1
+pip install -r requirements.txt
+```
+## Inference
+To run inference, you need to prepare a reference image and a driving video. The process consists of extracting conditions, preparing a prompt, and generating the video chunks.
+### 1. Extract conditions
+Extract pose and hand conditions from the driving video:
+```bash
+python prepare_input_pose.py \
+    --pose_path "results/video1/sapiens/pose.pkl" \
+    --output_dir "results/video1/inputs" \
+    --video_path "examples/video1.mp4"
+python prepare_input_hand.py \
+    --normal_path "results/video1/sapiens/normal.npy" \
+    --seg_path "results/video1/sapiens/seg.npy" \
+    --output_dir "results/video1/inputs" \
+    --video_path "examples/video1.mp4"
+```
+### 2. Generate Video Chunks
+Generate the anchor base chunk (which stabilizes the background for the rest of the sequence):
+```bash
+python inference.py \
+    --mode anch \
+    --image_path "examples/image1.png" \
+    --prompt_path "results/image1/prompt.txt" \
+    --hand_path "results/video1/inputs/hand.mp4" \
+    --pose_path "results/video1/inputs/pose.mp4" \
+    --output_dir "results/generated" \
+    --seed 42 \
+    --anch_chunk_idx 0 \
+    -s 0 2 \
+    -b 34 40 \
+    -p "4*10**9"
+```
+Refer to the [GitHub README](https://github.com/Jessie459/PoseGen) for full details on generating subsequent base chunks and merging them.
+## Acknowledgement
+This project is built upon [Sapiens](https://github.com/facebookresearch/sapiens), [Wan2.1](https://github.com/Wan-Video/Wan2.1), and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio).
+## Citation
+```bibtex
+@article{he2025posegen,
+  title={PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation},
+  author={He, Jingxuan and Su, Busheng and Wong, Finn},
+  journal={arXiv preprint arXiv:2508.05091},
+  year={2025}
+}
+```