---
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
library_name: mlx
tags:
  - mlx
  - sapiens2
  - vision
  - human-centric
  - seg
pipeline_tag: image-to-image
base_model:
  - facebook/sapiens2-seg-0.4b
---

# mlx-community/sapiens2-seg-0.4b-4bit

MLX port of [`facebook/sapiens2-seg-0.4b`](https://huggingface.co/facebook/sapiens2-seg-0.4b) at **4-bit affine (group_size=64)** precision, converted with [mlx-vlm](https://github.com/Blaizzy/mlx-vlm).

Sapiens2 is a family of human-centric ViTs pretrained on 1B human images.  This
repo contains the **seg** head paired with the Sapiens2-0.4b backbone.

## Install

```bash
pip install -U mlx-vlm
```

## Usage — body-part segmentation (29 classes)

```python
from pathlib import Path
from PIL import Image
import numpy as np
from mlx_vlm.utils import load_model
from mlx_vlm.models.sapiens2.processing_sapiens2 import Sapiens2Processor
from mlx_vlm.models.sapiens2.generate import Sapiens2Predictor

model = load_model(Path("mlx-community/sapiens2-seg-0.4b-4bit"))
processor = Sapiens2Processor.from_pretrained("mlx-community/sapiens2-seg-0.4b-4bit")
predictor = Sapiens2Predictor(model, processor)

result = predictor.predict(Image.open("person.jpg"))
# result.mask        (orig_h, orig_w) int32 class indices
# result.seg_logits  (29, H_out, W_out) raw logits

print("active classes:", np.unique(result.mask).tolist())
Image.fromarray(result.mask.astype(np.uint8)).save("mask.png")
```

Output: dense 29-class body-part segmentation (DOME 29-class scheme — face,
hair, torso, arms/legs split left/right, etc.).

## Convert your own checkpoint

```bash
# 1. Stage a float32 MLX directory from the Facebook checkpoint
python -m mlx_vlm.models.sapiens2.convert \
    --hf-repo facebook/sapiens2-seg-0.4b \
    --out ./sapiens2-seg-0.4b-fp32-mlx \
    --dtype float32

# 2. Quantize + upload via the main mlx_vlm.convert CLI
python -m mlx_vlm.convert \
    --hf-path  ./sapiens2-seg-0.4b-fp32-mlx \
    --mlx-path ./sapiens2-seg-0.4b-4bit \
    --quantize --q-bits 4 --q-group-size 64 --q-mode affine \
    --upload-repo mlx-community/sapiens2-seg-0.4b-4bit
```

## Architecture

Sapiens2 backbone: 2-D RoPE ViT (bf16 rope), partial GQA (full MHA in the
first/last 8 blocks, KV-half for the middle), SwiGLU FFN, cls + 8 storage
tokens.  Default input: **1024 × 768 (H × W)**, patch size 16, ImageNet
normalization on the [0, 255] scale.

See the [mlx-vlm sapiens2 port](https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/sapiens2) for implementation details.

## License

Weights released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md); this MLX repackaging inherits that license.

## Citation

```bibtex
@article{khirodkarsapiens2,
  title  = {Sapiens2},
  author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan
            and Su, Zhaoen and Saito, Shunsuke},
  journal= {arXiv preprint arXiv:2604.21681},
  year   = {2026}
}
```