File size: 8,242 Bytes
95c2945
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b4a282
95c2945
 
 
1b4a282
95c2945
1b4a282
95c2945
1b4a282
95c2945
 
 
 
 
1b4a282
95c2945
1b4a282
 
95c2945
1b4a282
95c2945
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
 
1b4a282
95c2945
 
1b4a282
95c2945
 
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
 
1b4a282
 
 
95c2945
 
 
 
 
1b4a282
 
95c2945
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
 
 
 
1b4a282
 
 
95c2945
1b4a282
 
95c2945
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b4a282
 
 
95c2945
1b4a282
95c2945
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
 
 
 
 
 
1b4a282
95c2945
1b4a282
 
 
95c2945
1b4a282
95c2945
1b4a282
 
 
 
 
95c2945
1b4a282
 
95c2945
1b4a282
95c2945
1b4a282
 
 
 
 
 
95c2945
 
1b4a282
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95c2945
 
 
 
1b4a282
 
 
 
95c2945
1b4a282
 
 
 
 
95c2945
 
 
 
1b4a282
 
 
 
 
 
 
 
 
 
 
 
 
95c2945
 
 
 
 
 
 
 
 
 
1b4a282
95c2945
 
 
 
1b4a282
95c2945
1b4a282
95c2945
 
 
 
 
 
 
 
 
1b4a282
95c2945
 
 
 
 
1b4a282
95c2945
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
license: other
license_name: sapiens2-license
license_link: https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md
pipeline_tag: image-feature-extraction
library_name: transformers
base_model: facebook/sapiens2-pretrain-0.1b
tags:
  - sapiens
  - sapiens2
  - vision-transformer
  - human-centric
  - feature-extraction
  - onnx
  - onnxruntime-web
---

# Sapiens2-0.1B β€” ONNX Export

ONNX export of [facebook/sapiens2-pretrain-0.1b](https://huggingface.co/facebook/sapiens2-pretrain-0.1b), a vision transformer pretrained on **1 billion human images**, packaged for browser inference via `onnxruntime-web`.

| File | Size | Use |
|---|---|---|
| `sapiens2_0.1b_int8.onnx` | 116 MB | Browser (recommended) |
| `sapiens2_0.1b_fp32.onnx` | 458 MB | Server-side / higher precision |
| `example_embeddings.js` | β€” | Drop-in browser ES module |

**Output:** a `(batch, 768)` float32 vector per image (CLS token).

---

## What are embeddings?

The model encodes an image into a 768-dimensional vector that captures human-centric semantics β€” pose, body shape, clothing, and identity. Two images with similar people in similar poses will have embeddings close together in this space. Common uses:

- **Similarity search** β€” find the most similar person/pose in a collection
- **Clustering** β€” group images by pose, clothing, or activity
- **Classification** β€” train a lightweight head on top of frozen embeddings
- **Retrieval** β€” image β†’ nearest-neighbor lookup in a vector database

---

## Browser quick start

```bash
npm install onnxruntime-web
```

```js
import * as ort from "onnxruntime-web";

// Point WASM binaries at the CDN build
ort.env.wasm.wasmPaths = "https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/";

const MODEL_URL =
  "https://huggingface.co/barakplasma/sapiens2-onnx/resolve/main/sapiens2_0.1b_int8.onnx";

const H = 1024, W = 768;
const MEAN = [0.485, 0.456, 0.406];
const STD  = [0.229, 0.224, 0.225];

// Load once; reuse for all images. ~1-2 s cold start.
export async function loadModel() {
  return ort.InferenceSession.create(MODEL_URL, {
    executionProviders: ["webgpu", "wasm"], // WebGPU ~1-3 s/img, WASM ~20-60 s/img
    graphOptimizationLevel: "all",
  });
}

// Accepts any <img>, <canvas>, ImageBitmap, or VideoFrame
function imageToTensor(source) {
  const canvas = document.createElement("canvas");
  canvas.width = W;
  canvas.height = H;
  canvas.getContext("2d").drawImage(source, 0, 0, W, H);
  const { data } = canvas.getContext("2d").getImageData(0, 0, W, H); // RGBA uint8

  const t = new Float32Array(3 * H * W);
  for (let i = 0; i < H * W; i++) {
    t[i]             = (data[i * 4]     / 255 - MEAN[0]) / STD[0]; // R
    t[H * W + i]     = (data[i * 4 + 1] / 255 - MEAN[1]) / STD[1]; // G
    t[2 * H * W + i] = (data[i * 4 + 2] / 255 - MEAN[2]) / STD[2]; // B
  }
  return new ort.Tensor("float32", t, [1, 3, H, W]);
}

// Returns a Float32Array of length 768
export async function embed(session, imageSource) {
  const { embedding } = await session.run({ pixel_values: imageToTensor(imageSource) });
  return embedding.data;
}

// Cosine similarity: 1 = identical direction, 0 = orthogonal, -1 = opposite
export function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot   += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
```

---

## Caching in IndexedDB

The INT8 model is 116 MB. After the first load, store it in IndexedDB so repeat
visits skip the download entirely:

```js
const DB_NAME = "sapiens2-onnx";
const STORE   = "models";

async function openDB() {
  return new Promise((resolve, reject) => {
    const req = indexedDB.open(DB_NAME, 1);
    req.onupgradeneeded = () => req.result.createObjectStore(STORE);
    req.onsuccess = () => resolve(req.result);
    req.onerror   = () => reject(req.error);
  });
}

export async function loadModelCached(url = MODEL_URL) {
  const db = await openDB();

  const cached = await new Promise(res => {
    const req = db.transaction(STORE).objectStore(STORE).get(url);
    req.onsuccess = () => res(req.result ?? null);
    req.onerror   = () => res(null);
  });

  const buf = cached ?? await fetch(url).then(r => r.arrayBuffer()).then(buf => {
    db.transaction(STORE, "readwrite").objectStore(STORE).put(buf, url);
    return buf;
  });

  return ort.InferenceSession.create(buf, {
    executionProviders: ["webgpu", "wasm"],
    graphOptimizationLevel: "all",
  });
}
```

---

## Full worked example

See [`example_embeddings.js`](./example_embeddings.js) β€” a self-contained ES module
you can drop into any browser project. It exports `loadModelCached`, `embed`,
`cosineSimilarity`, `l2Normalize`, and `findMostSimilar`.

Usage example (assumes an `<input type="file">` and two `<img>` elements):

```html
<input type="file" id="fileA" accept="image/*">
<input type="file" id="fileB" accept="image/*">
<img id="imgA"> <img id="imgB">
<p id="result"></p>

<script type="module">
import { loadModelCached, embed, cosineSimilarity } from "./example_embeddings.js";

const session = await loadModelCached();

async function onFileChange(inputId, imgId) {
  const file = document.getElementById(inputId).files[0];
  const img  = document.getElementById(imgId);
  img.src = URL.createObjectURL(file);
  await img.decode();
  return embed(session, img);
}

let embA, embB;
document.getElementById("fileA").onchange = async () => {
  embA = await onFileChange("fileA", "imgA");
  if (embA && embB) showSimilarity();
};
document.getElementById("fileB").onchange = async () => {
  embB = await onFileChange("fileB", "imgB");
  if (embA && embB) showSimilarity();
};

function showSimilarity() {
  const score = cosineSimilarity(embA, embB);
  document.getElementById("result").textContent =
    `Similarity: ${score.toFixed(4)}`;
}
</script>
```

---

## Preprocessing spec

Input must be resized to exactly **1024 Γ— 768 (H Γ— W)** and normalized with
ImageNet statistics before passing to the model:

```
mean   = [0.485, 0.456, 0.406]   # per channel, RGB order
std    = [0.229, 0.224, 0.225]
value  = (pixel_uint8 / 255 βˆ’ mean) / std
layout = NCHW float32 β€” shape (batch, 3, 1024, 768)
```

---

## Browser requirements

| | Minimum | Recommended |
|---|---|---|
| **Browser** | Chrome/Edge 113+ | Chrome 120+ |
| **Execution provider** | WASM | WebGPU |
| **Free RAM** | 4 GB | 8 GB |
| **INT8 latency** | ~20–60 s (WASM) | ~1–3 s (WebGPU) |

WebGPU is available on Chrome/Edge 113+ desktop. Mobile is not viable at this resolution.

---

## Model details

| | |
|---|---|
| **Base model** | [facebook/sapiens2-pretrain-0.1b](https://huggingface.co/facebook/sapiens2-pretrain-0.1b) |
| **Architecture** | Vision Transformer (RoPE, GQA, SwiGLU, RMSNorm, QK-norm) |
| **Parameters** | 0.114 B |
| **FLOPs** | 0.342 T |
| **Embedding dim** | 768 |
| **Layers / heads** | 12 / 12 |
| **Input size** | 1024 Γ— 768 (H Γ— W) |
| **Patch size** | 16 px β†’ 3,072 patch tokens |
| **Output** | CLS token: `(batch, 768)` float32 |
| **Pretraining data** | 1 billion curated human images |
| **ONNX opset** | 18 |
| **Quantization** | `quantize_dynamic`, QInt8 weights |

### Sapiens2 family

| Model | Params | Embed dim | Layers |
|---|---|---|---|
| **Sapiens2-0.1B** *(this)* | 0.114 B | 768 | 12 |
| [Sapiens2-0.4B](https://huggingface.co/facebook/sapiens2-pretrain-0.4b) | 0.398 B | 1024 | 24 |
| [Sapiens2-0.8B](https://huggingface.co/facebook/sapiens2-pretrain-0.8b) | 0.818 B | 1280 | 32 |
| [Sapiens2-1B](https://huggingface.co/facebook/sapiens2-pretrain-1b) | 1.462 B | 1536 | 40 |
| [Sapiens2-5B](https://huggingface.co/facebook/sapiens2-pretrain-5b) | 5.071 B | 2432 | 56 |

Only 0.1B is practical for browser inference. Larger variants require server-side deployment.

---

## License

Released under the [Sapiens2 License](https://github.com/facebookresearch/sapiens2/blob/main/LICENSE.md).

## Citation

```bibtex
@article{khirodkarsapiens2,
  title   = {Sapiens2},
  author  = {Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
  journal = {arXiv preprint arXiv:2604.21681},
  year    = {2026}
}
```