facebook
/

sapiens2-pretrain-5b

+---
+license: other
+license_name: sapiens2-license
+license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
+pipeline_tag: image-feature-extraction
+library_name: sapiens
+tags:
+  - sapiens
+  - sapiens2
+  - vision-transformer
+  - human-centric
+  - pretrained-backbone
+  - feature-extraction
+---
+# Sapiens2-5B
+Sapiens2 is a family of high-resolution vision transformers pretrained on **1 billion human images** — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.
+This repository contains the **5B parameter pretrained backbone**. It produces dense per-patch features suitable for fine-tuning downstream task heads.
+- 📄 **Paper:** [OpenReview (ICLR 2026)](https://openreview.net/pdf?id=IVAlYCqdvW)
+- 🌐 **Project Page:** [rawalkhirodkar.github.io/sapiens2](https://rawalkhirodkar.github.io/sapiens2)
+- 💻 **Code:** [github.com/facebookresearch/sapiens2](https://github.com/facebookresearch/sapiens2)
+## Model Details
+- **Developed by:** Meta
+- **Model type:** Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
+- **License:** [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md)
+- **Task:** pretrain
+- **Format:** safetensors
+- **File:** `sapiens2_5b_pretrain.safetensors`
+## Quick Start
+Install the [Sapiens2 repo](https://github.com/facebookresearch/sapiens2) (`pip install -e .`).
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+from sapiens.backbones.standalone.sapiens2 import Sapiens2
+# Build the model and load the pretrained checkpoint
+model = Sapiens2(arch="sapiens2_5b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
+ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-5b", filename="sapiens2_5b_pretrain.safetensors")
+model.load_state_dict(load_file(ckpt_path))
+# Forward pass on a single image (RGB; ImageNet normalization recommended)
+x = torch.randn(1, 3, 1024, 768).cuda()
+with torch.no_grad():
+    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)
+```
+## Model Card
+| Field | Value |
+|-------|-------|
+| Architecture | Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) |
+| Parameters | 5.071 B |
+| FLOPs | 15.722 T |
+| Embedding dim | 2432 |
+| Layers | 56 |
+| Attention heads | 32 |
+| Pretraining resolution | 1024 × 768 (H × W) |
+| Patch size | 16 |
+| Pretraining data | 1B human images |
+### Sapiens2 Family
+| Model | Params | FLOPs | Embed dim | Layers | Heads |
+|-------|--------|-------|-----------|--------|-------|
+| [Sapiens2-0.1B](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) | 0.114 B | 0.342 T | 768 | 12 | 12 |
+| [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) | 0.398 B | 1.260 T | 1024 | 24 | 16 |
+| [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) | 0.818 B | 2.592 T | 1280 | 32 | 16 |
+| [Sapiens2-1B](https://huggingface.co/facebook/sapiens2-pretrain-1b) | 1.462 B | 4.715 T | 1536 | 40 | 24 |
+| **Sapiens2-5B** *(this)* | 5.071 B | 15.722 T | 2432 | 56 | 32 |
+See the [Sapiens2 Collection](https://huggingface.co/collections/facebook/sapiens2) for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).
+## Intended Use
+- Feature extraction for human-centric downstream tasks
+- Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
+- Research on human-centric vision
+## License
+Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md).
+## Citation
+```bibtex
+@inproceedings{khirodkar2026sapiens2,
+  title={Sapiens2},
+  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
+  booktitle={International Conference on Learning Representations (ICLR)},
+  year={2026}
+}
+```