rawalkhirodkar commited on
Commit
1dabc48
·
verified ·
1 Parent(s): df547e5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: sapiens2-license
4
+ license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
5
+ pipeline_tag: image-feature-extraction
6
+ library_name: sapiens
7
+ tags:
8
+ - sapiens
9
+ - sapiens2
10
+ - vision-transformer
11
+ - human-centric
12
+ - pretrained-backbone
13
+ - feature-extraction
14
+ ---
15
+
16
+ # Sapiens2-5B
17
+
18
+ Sapiens2 is a family of high-resolution vision transformers pretrained on **1 billion human images** — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.
19
+
20
+ This repository contains the **5B parameter pretrained backbone**. It produces dense per-patch features suitable for fine-tuning downstream task heads.
21
+
22
+ - 📄 **Paper:** [OpenReview (ICLR 2026)](https://openreview.net/pdf?id=IVAlYCqdvW)
23
+ - 🌐 **Project Page:** [rawalkhirodkar.github.io/sapiens2](https://rawalkhirodkar.github.io/sapiens2)
24
+ - 💻 **Code:** [github.com/facebookresearch/sapiens2](https://github.com/facebookresearch/sapiens2)
25
+
26
+ ## Model Details
27
+
28
+ - **Developed by:** Meta
29
+ - **Model type:** Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
30
+ - **License:** [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md)
31
+ - **Task:** pretrain
32
+ - **Format:** safetensors
33
+ - **File:** `sapiens2_5b_pretrain.safetensors`
34
+
35
+ ## Quick Start
36
+
37
+ Install the [Sapiens2 repo](https://github.com/facebookresearch/sapiens2) (`pip install -e .`).
38
+
39
+ ```python
40
+ import torch
41
+ from huggingface_hub import hf_hub_download
42
+ from safetensors.torch import load_file
43
+ from sapiens.backbones.standalone.sapiens2 import Sapiens2
44
+
45
+ # Build the model and load the pretrained checkpoint
46
+ model = Sapiens2(arch="sapiens2_5b", img_size=(1024, 768), patch_size=16).eval().cuda() # img_size is (H, W)
47
+ ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-5b", filename="sapiens2_5b_pretrain.safetensors")
48
+ model.load_state_dict(load_file(ckpt_path))
49
+
50
+ # Forward pass on a single image (RGB; ImageNet normalization recommended)
51
+ x = torch.randn(1, 3, 1024, 768).cuda()
52
+ with torch.no_grad():
53
+ features = model(x)[0] # dense backbone features: (B, num_tokens, embed_dim)
54
+ ```
55
+
56
+ ## Model Card
57
+
58
+ | Field | Value |
59
+ |-------|-------|
60
+ | Architecture | Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) |
61
+ | Parameters | 5.071 B |
62
+ | FLOPs | 15.722 T |
63
+ | Embedding dim | 2432 |
64
+ | Layers | 56 |
65
+ | Attention heads | 32 |
66
+ | Pretraining resolution | 1024 × 768 (H × W) |
67
+ | Patch size | 16 |
68
+ | Pretraining data | 1B human images |
69
+
70
+ ### Sapiens2 Family
71
+
72
+ | Model | Params | FLOPs | Embed dim | Layers | Heads |
73
+ |-------|--------|-------|-----------|--------|-------|
74
+ | [Sapiens2-0.1B](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) | 0.114 B | 0.342 T | 768 | 12 | 12 |
75
+ | [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) | 0.398 B | 1.260 T | 1024 | 24 | 16 |
76
+ | [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) | 0.818 B | 2.592 T | 1280 | 32 | 16 |
77
+ | [Sapiens2-1B](https://huggingface.co/facebook/sapiens2-pretrain-1b) | 1.462 B | 4.715 T | 1536 | 40 | 24 |
78
+ | **Sapiens2-5B** *(this)* | 5.071 B | 15.722 T | 2432 | 56 | 32 |
79
+
80
+ See the [Sapiens2 Collection](https://huggingface.co/collections/facebook/sapiens2) for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).
81
+
82
+ ## Intended Use
83
+
84
+ - Feature extraction for human-centric downstream tasks
85
+ - Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
86
+ - Research on human-centric vision
87
+
88
+ ## License
89
+
90
+ Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md).
91
+
92
+ ## Citation
93
+
94
+ ```bibtex
95
+ @inproceedings{khirodkar2026sapiens2,
96
+ title={Sapiens2},
97
+ author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
98
+ booktitle={International Conference on Learning Representations (ICLR)},
99
+ year={2026}
100
+ }
101
+ ```