Sapiens2-0.1B
Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images โ designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.
This repository contains the 0.1B parameter pretrained backbone (114M params). It produces dense per-patch features suitable for fine-tuning downstream task heads.
- ๐ Paper: OpenReview (ICLR 2026)
- ๐ Project Page: rawalkhirodkar.github.io/sapiens2
- ๐ป Code: github.com/facebookresearch/sapiens2
Model Details
- Developed by: Meta
- Model type: Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
- License: Sapiens2 License
- Task: pretrain
- Format: safetensors
- File:
sapiens2_0.1b_pretrain.safetensors
Quick Start
Install the Sapiens2 repo (pip install -e .).
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2
# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_0.1b", img_size=(1024, 768), patch_size=16).eval().cuda() # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-0.1b", filename="sapiens2_0.1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))
# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
features = model(x)[0] # dense backbone features: (B, num_tokens, embed_dim)
Model Card
| Field | Value |
|---|---|
| Architecture | Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) |
| Parameters | 0.114 B |
| FLOPs | 0.342 T |
| Embedding dim | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Pretraining resolution | 1024 ร 768 (H ร W) |
| Patch size | 16 |
| Pretraining data | 1B human images |
Sapiens2 Family
| Model | Params | FLOPs | Embed dim | Layers | Heads |
|---|---|---|---|---|---|
| Sapiens2-0.1B (this) | 0.114 B | 0.342 T | 768 | 12 | 12 |
| Sapiens2-0.4B | 0.398 B | 1.260 T | 1024 | 24 | 16 |
| Sapiens2-0.8B | 0.818 B | 2.592 T | 1280 | 32 | 16 |
| Sapiens2-1B | 1.462 B | 4.715 T | 1536 | 40 | 24 |
| Sapiens2-5B | 5.071 B | 15.722 T | 2432 | 56 | 32 |
See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).
Intended Use
- Feature extraction for human-centric downstream tasks
- Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
- Research on human-centric vision
License
Released under the Sapiens2 License.
Citation
@inproceedings{khirodkar2026sapiens2,
title={Sapiens2},
author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support