--- license: other license_name: sapiens2-license license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md pipeline_tag: image-feature-extraction library_name: sapiens tags: - sapiens - sapiens2 - vision-transformer - human-centric - pretrained-backbone - feature-extraction --- # Sapiens2-5B Sapiens2 is a family of high-resolution vision transformers pretrained on **1 billion human images** — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps. This repository contains the **5B parameter pretrained backbone**. It produces dense per-patch features suitable for fine-tuning downstream task heads. - 📄 **Paper:** [arXiv:2604.21681](https://arxiv.org/pdf/2604.21681) - 🌐 **Project Page:** [rawalkhirodkar.github.io/sapiens2](https://rawalkhirodkar.github.io/sapiens2) - 💻 **Code:** [github.com/facebookresearch/sapiens2](https://github.com/facebookresearch/sapiens2) ## Model Details - **Developed by:** Meta - **Model type:** Vision Transformer - **License:** [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md) - **Task:** pretrain - **Format:** safetensors - **File:** `sapiens2_5b_pretrain.safetensors` ## Quick Start Install the [Sapiens2 repo](https://github.com/facebookresearch/sapiens2) (`pip install -e .`). ```python import torch from huggingface_hub import hf_hub_download from safetensors.torch import load_file from sapiens.backbones.standalone.sapiens2 import Sapiens2 # Build the model and load the pretrained checkpoint model = Sapiens2(arch="sapiens2_5b", img_size=(1024, 768), patch_size=16).eval().cuda() # img_size is (H, W) ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-5b", filename="sapiens2_5b_pretrain.safetensors") model.load_state_dict(load_file(ckpt_path)) # Forward pass on a single image (RGB; ImageNet normalization recommended) x = torch.randn(1, 3, 1024, 768).cuda() with torch.no_grad(): features = model(x)[0] # dense backbone features: (B, num_tokens, embed_dim) ``` ## Model Card | Field | Value | |-------|-------| | Architecture | Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) | | Parameters | 5.071 B | | FLOPs | 15.722 T | | Embedding dim | 2432 | | Layers | 56 | | Attention heads | 32 | | Pretraining resolution | 1024 × 768 (H × W) | | Patch size | 16 | | Pretraining data | 1B human images | ### Sapiens2 Family | Model | Params | FLOPs | Embed dim | Layers | Heads | |-------|--------|-------|-----------|--------|-------| | [Sapiens2-0.1B](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) | 0.114 B | 0.342 T | 768 | 12 | 12 | | [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) | 0.398 B | 1.260 T | 1024 | 24 | 16 | | [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) | 0.818 B | 2.592 T | 1280 | 32 | 16 | | [Sapiens2-1B](https://huggingface.co/facebook/sapiens2-pretrain-1b) | 1.462 B | 4.715 T | 1536 | 40 | 24 | | [Sapiens2-1B-4K](https://huggingface.co/facebook/sapiens2-pretrain-1b-4k) | 1.607 B | — | 1536 | 40 | 24 | | **Sapiens2-5B** *(this)* | 5.071 B | 15.722 T | 2432 | 56 | 32 | See the [Sapiens2 Collection](https://huggingface.co/collections/facebook/sapiens2) for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps). ## Intended Use - Feature extraction for human-centric downstream tasks - Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap) - Research on human-centric vision ## License Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md). ## Citation ```bibtex @article{khirodkarsapiens2, title={Sapiens2}, author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke}, journal={arXiv preprint arXiv:2604.21681}, year={2026} } ```