Update model card

ff59d82 verified 13 days ago

3.92 kB

	---
	license: other
	license_name: sapiens2-license
	license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
	pipeline_tag: image-feature-extraction
	library_name: sapiens
	tags:
	- sapiens
	- sapiens2
	- vision-transformer
	- human-centric
	- pretrained-backbone
	- feature-extraction
	---

	# Sapiens2-1B

	Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

	This repository contains the 1B parameter pretrained backbone. It produces dense per-patch features suitable for fine-tuning downstream task heads.

	- 📄 Paper: [arXiv:2604.21681](https://arxiv.org/pdf/2604.21681)
	- 🌐 Project Page: [rawalkhirodkar.github.io/sapiens2](https://rawalkhirodkar.github.io/sapiens2)
	- 💻 Code: [github.com/facebookresearch/sapiens2](https://github.com/facebookresearch/sapiens2)

	## Model Details

	- Developed by: Meta
	- Model type: Vision Transformer
	- License: [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md)
	- Task: pretrain
	- Format: safetensors
	- File: `sapiens2_1b_pretrain.safetensors`

	## Quick Start

	Install the [Sapiens2 repo](https://github.com/facebookresearch/sapiens2) (`pip install -e .`).

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from sapiens.backbones.standalone.sapiens2 import Sapiens2

	# Build the model and load the pretrained checkpoint
	model = Sapiens2(arch="sapiens2_1b", img_size=(1024, 768), patch_size=16).eval().cuda() # img_size is (H, W)
	ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-1b", filename="sapiens2_1b_pretrain.safetensors")
	model.load_state_dict(load_file(ckpt_path))

	# Forward pass on a single image (RGB; ImageNet normalization recommended)
	x = torch.randn(1, 3, 1024, 768).cuda()
	with torch.no_grad():
	features = model(x)[0] # dense backbone features: (B, num_tokens, embed_dim)
	```

	## Model Card

	\| Field \| Value \|
	\|-------\|-------\|
	\| Architecture \| Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) \|
	\| Parameters \| 1.462 B \|
	\| FLOPs \| 4.715 T \|
	\| Embedding dim \| 1536 \|
	\| Layers \| 40 \|
	\| Attention heads \| 24 \|
	\| Pretraining resolution \| 1024 × 768 (H × W) \|
	\| Patch size \| 16 \|
	\| Pretraining data \| 1B human images \|

	### Sapiens2 Family

	\| Model \| Params \| FLOPs \| Embed dim \| Layers \| Heads \|
	\|-------\|--------\|-------\|-----------\|--------\|-------\|
	\| [Sapiens2-0.1B](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) \| 0.114 B \| 0.342 T \| 768 \| 12 \| 12 \|
	\| [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) \| 0.398 B \| 1.260 T \| 1024 \| 24 \| 16 \|
	\| [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) \| 0.818 B \| 2.592 T \| 1280 \| 32 \| 16 \|
	\| Sapiens2-1B (this) \| 1.462 B \| 4.715 T \| 1536 \| 40 \| 24 \|
	\| [Sapiens2-1B-4K](https://huggingface.co/facebook/sapiens2-pretrain-1b-4k) \| 1.607 B \| — \| 1536 \| 40 \| 24 \|
	\| [Sapiens2-5B](https://huggingface.co/facebook/sapiens2-pretrain-5b) \| 5.071 B \| 15.722 T \| 2432 \| 56 \| 32 \|

	See the [Sapiens2 Collection](https://huggingface.co/collections/facebook/sapiens2) for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

	## Intended Use

	- Feature extraction for human-centric downstream tasks
	- Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
	- Research on human-centric vision

	## License

	Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md).

	## Citation

	```bibtex
	@article{khirodkarsapiens2,
	title={Sapiens2},
	author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
	journal={arXiv preprint arXiv:2604.21681},
	year={2026}
	}
	```