Sapiens2-0.1B

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This repository contains the 0.1B parameter pretrained backbone (114M params). It produces dense per-patch features suitable for fine-tuning downstream task heads.

📄 Paper: OpenReview (ICLR 2026)
🌐 Project Page: rawalkhirodkar.github.io/sapiens2
💻 Code: github.com/facebookresearch/sapiens2

Model Details

Developed by: Meta
Model type: Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
License: Sapiens2 License
Task: pretrain
Format: safetensors
File: sapiens2_0.1b_pretrain.safetensors

Quick Start

Install the Sapiens2 repo (pip install -e .).

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_0.1b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-0.1b", filename="sapiens2_0.1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)

Model Card

Field	Value
Architecture	Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
Parameters	0.114 B
FLOPs	0.342 T
Embedding dim	768
Layers	12
Attention heads	12
Pretraining resolution	1024 × 768 (H × W)
Patch size	16
Pretraining data	1B human images

Sapiens2 Family

Model	Params	FLOPs	Embed dim	Layers	Heads
Sapiens2-0.1B (this)	0.114 B	0.342 T	768	12	12
Sapiens2-0.4B	0.398 B	1.260 T	1024	24	16
Sapiens2-0.8B	0.818 B	2.592 T	1280	32	16
Sapiens2-1B	1.462 B	4.715 T	1536	40	24
Sapiens2-5B	5.071 B	15.722 T	2432	56	32

See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

Intended Use

Feature extraction for human-centric downstream tasks
Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
Research on human-centric vision

License

Released under the Sapiens2 License.

Citation

@inproceedings{khirodkar2026sapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Zhaoen, Su and Saito, Shunsuke},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Downloads last month: -

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support