prince-canuma's picture
Add sapiens2-seg-0.4b-4bit
9519979 verified
metadata
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
library_name: mlx
tags:
  - mlx
  - sapiens2
  - vision
  - human-centric
  - seg
pipeline_tag: image-to-image
base_model:
  - facebook/sapiens2-seg-0.4b

mlx-community/sapiens2-seg-0.4b-4bit

MLX port of facebook/sapiens2-seg-0.4b at 4-bit affine (group_size=64) precision, converted with mlx-vlm.

Sapiens2 is a family of human-centric ViTs pretrained on 1B human images. This repo contains the seg head paired with the Sapiens2-0.4b backbone.

Install

pip install -U mlx-vlm

Usage — body-part segmentation (29 classes)

from pathlib import Path
from PIL import Image
import numpy as np
from mlx_vlm.utils import load_model
from mlx_vlm.models.sapiens2.processing_sapiens2 import Sapiens2Processor
from mlx_vlm.models.sapiens2.generate import Sapiens2Predictor

model = load_model(Path("mlx-community/sapiens2-seg-0.4b-4bit"))
processor = Sapiens2Processor.from_pretrained("mlx-community/sapiens2-seg-0.4b-4bit")
predictor = Sapiens2Predictor(model, processor)

result = predictor.predict(Image.open("person.jpg"))
# result.mask        (orig_h, orig_w) int32 class indices
# result.seg_logits  (29, H_out, W_out) raw logits

print("active classes:", np.unique(result.mask).tolist())
Image.fromarray(result.mask.astype(np.uint8)).save("mask.png")

Output: dense 29-class body-part segmentation (DOME 29-class scheme — face, hair, torso, arms/legs split left/right, etc.).

Convert your own checkpoint

# 1. Stage a float32 MLX directory from the Facebook checkpoint
python -m mlx_vlm.models.sapiens2.convert \
    --hf-repo facebook/sapiens2-seg-0.4b \
    --out ./sapiens2-seg-0.4b-fp32-mlx \
    --dtype float32

# 2. Quantize + upload via the main mlx_vlm.convert CLI
python -m mlx_vlm.convert \
    --hf-path  ./sapiens2-seg-0.4b-fp32-mlx \
    --mlx-path ./sapiens2-seg-0.4b-4bit \
    --quantize --q-bits 4 --q-group-size 64 --q-mode affine \
    --upload-repo mlx-community/sapiens2-seg-0.4b-4bit

Architecture

Sapiens2 backbone: 2-D RoPE ViT (bf16 rope), partial GQA (full MHA in the first/last 8 blocks, KV-half for the middle), SwiGLU FFN, cls + 8 storage tokens. Default input: 1024 × 768 (H × W), patch size 16, ImageNet normalization on the [0, 255] scale.

See the mlx-vlm sapiens2 port for implementation details.

License

Weights released under the Sapiens2 License; this MLX repackaging inherits that license.

Citation

@article{khirodkarsapiens2,
  title  = {Sapiens2},
  author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan
            and Su, Zhaoen and Saito, Shunsuke},
  journal= {arXiv preprint arXiv:2604.21681},
  year   = {2026}
}