Image Feature Extraction
Transformers
ONNX
sapiens
sapiens2
vision-transformer
human-centric
feature-extraction
onnxruntime-web
Instructions to use barakplasma/sapiens2-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use barakplasma/sapiens2-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="barakplasma/sapiens2-onnx")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("barakplasma/sapiens2-onnx", dtype="auto") - sapiens
How to use barakplasma/sapiens2-onnx with sapiens:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- sapiens2
How to use barakplasma/sapiens2-onnx with sapiens2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 8,242 Bytes
95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 1b4a282 95c2945 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | ---
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
pipeline_tag: image-feature-extraction
library_name: transformers
base_model: facebook/sapiens2-pretrain-0.1b
tags:
- sapiens
- sapiens2
- vision-transformer
- human-centric
- feature-extraction
- onnx
- onnxruntime-web
---
# Sapiens2-0.1B β ONNX Export
ONNX export of [facebook/sapiens2-pretrain-0.1b](https://huggingface.co/facebook/sapiens2-pretrain-0.1b), a vision transformer pretrained on **1 billion human images**, packaged for browser inference via `onnxruntime-web`.
| File | Size | Use |
|---|---|---|
| `sapiens2_0.1b_int8.onnx` | 116 MB | Browser (recommended) |
| `sapiens2_0.1b_fp32.onnx` | 458 MB | Server-side / higher precision |
| `example_embeddings.js` | β | Drop-in browser ES module |
**Output:** a `(batch, 768)` float32 vector per image (CLS token).
---
## What are embeddings?
The model encodes an image into a 768-dimensional vector that captures human-centric semantics β pose, body shape, clothing, and identity. Two images with similar people in similar poses will have embeddings close together in this space. Common uses:
- **Similarity search** β find the most similar person/pose in a collection
- **Clustering** β group images by pose, clothing, or activity
- **Classification** β train a lightweight head on top of frozen embeddings
- **Retrieval** β image β nearest-neighbor lookup in a vector database
---
## Browser quick start
```bash
npm install onnxruntime-web
```
```js
import * as ort from "onnxruntime-web";
// Point WASM binaries at the CDN build
ort.env.wasm.wasmPaths = "https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/";
const MODEL_URL =
"https://huggingface.co/barakplasma/sapiens2-onnx/resolve/main/sapiens2_0.1b_int8.onnx";
const H = 1024, W = 768;
const MEAN = [0.485, 0.456, 0.406];
const STD = [0.229, 0.224, 0.225];
// Load once; reuse for all images. ~1-2 s cold start.
export async function loadModel() {
return ort.InferenceSession.create(MODEL_URL, {
executionProviders: ["webgpu", "wasm"], // WebGPU ~1-3 s/img, WASM ~20-60 s/img
graphOptimizationLevel: "all",
});
}
// Accepts any <img>, <canvas>, ImageBitmap, or VideoFrame
function imageToTensor(source) {
const canvas = document.createElement("canvas");
canvas.width = W;
canvas.height = H;
canvas.getContext("2d").drawImage(source, 0, 0, W, H);
const { data } = canvas.getContext("2d").getImageData(0, 0, W, H); // RGBA uint8
const t = new Float32Array(3 * H * W);
for (let i = 0; i < H * W; i++) {
t[i] = (data[i * 4] / 255 - MEAN[0]) / STD[0]; // R
t[H * W + i] = (data[i * 4 + 1] / 255 - MEAN[1]) / STD[1]; // G
t[2 * H * W + i] = (data[i * 4 + 2] / 255 - MEAN[2]) / STD[2]; // B
}
return new ort.Tensor("float32", t, [1, 3, H, W]);
}
// Returns a Float32Array of length 768
export async function embed(session, imageSource) {
const { embedding } = await session.run({ pixel_values: imageToTensor(imageSource) });
return embedding.data;
}
// Cosine similarity: 1 = identical direction, 0 = orthogonal, -1 = opposite
export function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
```
---
## Caching in IndexedDB
The INT8 model is 116 MB. After the first load, store it in IndexedDB so repeat
visits skip the download entirely:
```js
const DB_NAME = "sapiens2-onnx";
const STORE = "models";
async function openDB() {
return new Promise((resolve, reject) => {
const req = indexedDB.open(DB_NAME, 1);
req.onupgradeneeded = () => req.result.createObjectStore(STORE);
req.onsuccess = () => resolve(req.result);
req.onerror = () => reject(req.error);
});
}
export async function loadModelCached(url = MODEL_URL) {
const db = await openDB();
const cached = await new Promise(res => {
const req = db.transaction(STORE).objectStore(STORE).get(url);
req.onsuccess = () => res(req.result ?? null);
req.onerror = () => res(null);
});
const buf = cached ?? await fetch(url).then(r => r.arrayBuffer()).then(buf => {
db.transaction(STORE, "readwrite").objectStore(STORE).put(buf, url);
return buf;
});
return ort.InferenceSession.create(buf, {
executionProviders: ["webgpu", "wasm"],
graphOptimizationLevel: "all",
});
}
```
---
## Full worked example
See [`example_embeddings.js`](./example_embeddings.js) β a self-contained ES module
you can drop into any browser project. It exports `loadModelCached`, `embed`,
`cosineSimilarity`, `l2Normalize`, and `findMostSimilar`.
Usage example (assumes an `<input type="file">` and two `<img>` elements):
```html
<input type="file" id="fileA" accept="image/*">
<input type="file" id="fileB" accept="image/*">
<img id="imgA"> <img id="imgB">
<p id="result"></p>
<script type="module">
import { loadModelCached, embed, cosineSimilarity } from "./example_embeddings.js";
const session = await loadModelCached();
async function onFileChange(inputId, imgId) {
const file = document.getElementById(inputId).files[0];
const img = document.getElementById(imgId);
img.src = URL.createObjectURL(file);
await img.decode();
return embed(session, img);
}
let embA, embB;
document.getElementById("fileA").onchange = async () => {
embA = await onFileChange("fileA", "imgA");
if (embA && embB) showSimilarity();
};
document.getElementById("fileB").onchange = async () => {
embB = await onFileChange("fileB", "imgB");
if (embA && embB) showSimilarity();
};
function showSimilarity() {
const score = cosineSimilarity(embA, embB);
document.getElementById("result").textContent =
`Similarity: ${score.toFixed(4)}`;
}
</script>
```
---
## Preprocessing spec
Input must be resized to exactly **1024 Γ 768 (H Γ W)** and normalized with
ImageNet statistics before passing to the model:
```
mean = [0.485, 0.456, 0.406] # per channel, RGB order
std = [0.229, 0.224, 0.225]
value = (pixel_uint8 / 255 β mean) / std
layout = NCHW float32 β shape (batch, 3, 1024, 768)
```
---
## Browser requirements
| | Minimum | Recommended |
|---|---|---|
| **Browser** | Chrome/Edge 113+ | Chrome 120+ |
| **Execution provider** | WASM | WebGPU |
| **Free RAM** | 4 GB | 8 GB |
| **INT8 latency** | ~20β60 s (WASM) | ~1β3 s (WebGPU) |
WebGPU is available on Chrome/Edge 113+ desktop. Mobile is not viable at this resolution.
---
## Model details
| | |
|---|---|
| **Base model** | [facebook/sapiens2-pretrain-0.1b](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) |
| **Architecture** | Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) |
| **Parameters** | 0.114 B |
| **FLOPs** | 0.342 T |
| **Embedding dim** | 768 |
| **Layers / heads** | 12 / 12 |
| **Input size** | 1024 Γ 768 (H Γ W) |
| **Patch size** | 16 px β 3,072 patch tokens |
| **Output** | CLS token: `(batch, 768)` float32 |
| **Pretraining data** | 1 billion curated human images |
| **ONNX opset** | 18 |
| **Quantization** | `quantize_dynamic`, QInt8 weights |
### Sapiens2 family
| Model | Params | Embed dim | Layers |
|---|---|---|---|
| **Sapiens2-0.1B** *(this)* | 0.114 B | 768 | 12 |
| [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) | 0.398 B | 1024 | 24 |
| [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) | 0.818 B | 1280 | 32 |
| [Sapiens2-1B](https://huggingface.co/facebook/sapiens2-pretrain-1b) | 1.462 B | 1536 | 40 |
| [Sapiens2-5B](https://huggingface.co/facebook/sapiens2-pretrain-5b) | 5.071 B | 2432 | 56 |
Only 0.1B is practical for browser inference. Larger variants require server-side deployment.
---
## License
Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md).
## Citation
```bibtex
@article{khirodkarsapiens2,
title = {Sapiens2},
author = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
journal = {arXiv preprint arXiv:2604.21681},
year = {2026}
}
```
|