prince-canuma's picture
Add sapiens2-seg-0.4b-4bit
9519979 verified
---
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
library_name: mlx
tags:
- mlx
- sapiens2
- vision
- human-centric
- seg
pipeline_tag: image-to-image
base_model:
- facebook/sapiens2-seg-0.4b
---
# mlx-community/sapiens2-seg-0.4b-4bit
MLX port of [`facebook/sapiens2-seg-0.4b`](https://huggingface.co/facebook/sapiens2-seg-0.4b) at **4-bit affine (group_size=64)** precision, converted with [mlx-vlm](https://github.com/Blaizzy/mlx-vlm).
Sapiens2 is a family of human-centric ViTs pretrained on 1B human images. This
repo contains the **seg** head paired with the Sapiens2-0.4b backbone.
## Install
```bash
pip install -U mlx-vlm
```
## Usage — body-part segmentation (29 classes)
```python
from pathlib import Path
from PIL import Image
import numpy as np
from mlx_vlm.utils import load_model
from mlx_vlm.models.sapiens2.processing_sapiens2 import Sapiens2Processor
from mlx_vlm.models.sapiens2.generate import Sapiens2Predictor
model = load_model(Path("mlx-community/sapiens2-seg-0.4b-4bit"))
processor = Sapiens2Processor.from_pretrained("mlx-community/sapiens2-seg-0.4b-4bit")
predictor = Sapiens2Predictor(model, processor)
result = predictor.predict(Image.open("person.jpg"))
# result.mask (orig_h, orig_w) int32 class indices
# result.seg_logits (29, H_out, W_out) raw logits
print("active classes:", np.unique(result.mask).tolist())
Image.fromarray(result.mask.astype(np.uint8)).save("mask.png")
```
Output: dense 29-class body-part segmentation (DOME 29-class scheme — face,
hair, torso, arms/legs split left/right, etc.).
## Convert your own checkpoint
```bash
# 1. Stage a float32 MLX directory from the Facebook checkpoint
python -m mlx_vlm.models.sapiens2.convert \
--hf-repo facebook/sapiens2-seg-0.4b \
--out ./sapiens2-seg-0.4b-fp32-mlx \
--dtype float32
# 2. Quantize + upload via the main mlx_vlm.convert CLI
python -m mlx_vlm.convert \
--hf-path ./sapiens2-seg-0.4b-fp32-mlx \
--mlx-path ./sapiens2-seg-0.4b-4bit \
--quantize --q-bits 4 --q-group-size 64 --q-mode affine \
--upload-repo mlx-community/sapiens2-seg-0.4b-4bit
```
## Architecture
Sapiens2 backbone: 2-D RoPE ViT (bf16 rope), partial GQA (full MHA in the
first/last 8 blocks, KV-half for the middle), SwiGLU FFN, cls + 8 storage
tokens. Default input: **1024 × 768 (H × W)**, patch size 16, ImageNet
normalization on the [0, 255] scale.
See the [mlx-vlm sapiens2 port](https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/sapiens2) for implementation details.
## License
Weights released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md); this MLX repackaging inherits that license.
## Citation
```bibtex
@article{khirodkarsapiens2,
title = {Sapiens2},
author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan
and Su, Zhaoen and Saito, Shunsuke},
journal= {arXiv preprint arXiv:2604.21681},
year = {2026}
}
```