rawalkhirodkar's picture
Update model card
ff59d82 verified
metadata
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
pipeline_tag: image-feature-extraction
library_name: sapiens
tags:
  - sapiens
  - sapiens2
  - vision-transformer
  - human-centric
  - pretrained-backbone
  - feature-extraction

Sapiens2-1B

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images โ€” designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This repository contains the 1B parameter pretrained backbone. It produces dense per-patch features suitable for fine-tuning downstream task heads.

Model Details

  • Developed by: Meta
  • Model type: Vision Transformer
  • License: Sapiens2 License
  • Task: pretrain
  • Format: safetensors
  • File: sapiens2_1b_pretrain.safetensors

Quick Start

Install the Sapiens2 repo (pip install -e .).

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_1b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-1b", filename="sapiens2_1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)

Model Card

Field Value
Architecture Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
Parameters 1.462 B
FLOPs 4.715 T
Embedding dim 1536
Layers 40
Attention heads 24
Pretraining resolution 1024 ร— 768 (H ร— W)
Patch size 16
Pretraining data 1B human images

Sapiens2 Family

Model Params FLOPs Embed dim Layers Heads
Sapiens2-0.1B 0.114 B 0.342 T 768 12 12
Sapiens2-0.4B 0.398 B 1.260 T 1024 24 16
Sapiens2-0.8B 0.818 B 2.592 T 1280 32 16
Sapiens2-1B (this) 1.462 B 4.715 T 1536 40 24
Sapiens2-1B-4K 1.607 B โ€” 1536 40 24
Sapiens2-5B 5.071 B 15.722 T 2432 56 32

See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

Intended Use

  • Feature extraction for human-centric downstream tasks
  • Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
  • Research on human-centric vision

License

Released under the Sapiens2 License.

Citation

@article{khirodkarsapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
  journal={arXiv preprint arXiv:2604.21681},
  year={2026}
}