--- license: other license_name: sapiens2-license license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md library_name: mlx tags: - mlx - sapiens2 - vision - human-centric - seg pipeline_tag: image-to-image base_model: - facebook/sapiens2-seg-0.4b --- # mlx-community/sapiens2-seg-0.4b-4bit MLX port of [`facebook/sapiens2-seg-0.4b`](https://huggingface.co/facebook/sapiens2-seg-0.4b) at **4-bit affine (group_size=64)** precision, converted with [mlx-vlm](https://github.com/Blaizzy/mlx-vlm). Sapiens2 is a family of human-centric ViTs pretrained on 1B human images. This repo contains the **seg** head paired with the Sapiens2-0.4b backbone. ## Install ```bash pip install -U mlx-vlm ``` ## Usage — body-part segmentation (29 classes) ```python from pathlib import Path from PIL import Image import numpy as np from mlx_vlm.utils import load_model from mlx_vlm.models.sapiens2.processing_sapiens2 import Sapiens2Processor from mlx_vlm.models.sapiens2.generate import Sapiens2Predictor model = load_model(Path("mlx-community/sapiens2-seg-0.4b-4bit")) processor = Sapiens2Processor.from_pretrained("mlx-community/sapiens2-seg-0.4b-4bit") predictor = Sapiens2Predictor(model, processor) result = predictor.predict(Image.open("person.jpg")) # result.mask (orig_h, orig_w) int32 class indices # result.seg_logits (29, H_out, W_out) raw logits print("active classes:", np.unique(result.mask).tolist()) Image.fromarray(result.mask.astype(np.uint8)).save("mask.png") ``` Output: dense 29-class body-part segmentation (DOME 29-class scheme — face, hair, torso, arms/legs split left/right, etc.). ## Convert your own checkpoint ```bash # 1. Stage a float32 MLX directory from the Facebook checkpoint python -m mlx_vlm.models.sapiens2.convert \ --hf-repo facebook/sapiens2-seg-0.4b \ --out ./sapiens2-seg-0.4b-fp32-mlx \ --dtype float32 # 2. Quantize + upload via the main mlx_vlm.convert CLI python -m mlx_vlm.convert \ --hf-path ./sapiens2-seg-0.4b-fp32-mlx \ --mlx-path ./sapiens2-seg-0.4b-4bit \ --quantize --q-bits 4 --q-group-size 64 --q-mode affine \ --upload-repo mlx-community/sapiens2-seg-0.4b-4bit ``` ## Architecture Sapiens2 backbone: 2-D RoPE ViT (bf16 rope), partial GQA (full MHA in the first/last 8 blocks, KV-half for the middle), SwiGLU FFN, cls + 8 storage tokens. Default input: **1024 × 768 (H × W)**, patch size 16, ImageNet normalization on the [0, 255] scale. See the [mlx-vlm sapiens2 port](https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/sapiens2) for implementation details. ## License Weights released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md); this MLX repackaging inherits that license. ## Citation ```bibtex @article{khirodkarsapiens2, title = {Sapiens2}, author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke}, journal= {arXiv preprint arXiv:2604.21681}, year = {2026} } ```