flyingbertman commited on
Commit
e5c56fc
·
verified ·
1 Parent(s): be5d373

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md CHANGED
@@ -1,3 +1,70 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: onnx
4
+ tags:
5
+ - segmentation
6
+ - sam
7
+ - mobile-sam
8
+ - onnx
9
+ base_model:
10
+ - facebook/sam-vit-base
11
+ - ChaoningZhang/MobileSAM
12
+ pipeline_tag: image-segmentation
13
  ---
14
+
15
+ # Segment Anything — Mobile Encoder + SAM Mask Decoder Bundle (ONNX)
16
+
17
+ ONNX bundle pairing the [MobileSAM](https://github.com/ChaoningZhang/MobileSAM) ViT-T image encoder with Meta's original SAM mask decoder. Together they form a complete Segment Anything pipeline at **~60 MB total** — orders of magnitude smaller than the original SAM ViT-H setup (~2.4 GB) and largely matched in accuracy on common segmentation tasks.
18
+
19
+ Not converted locally — these are the upstream-published ONNX checkpoints, bundled here for distribution stability.
20
+
21
+ Credit: Meta (SAM) and Kyungpook National University / Chaoning Zhang et al. (MobileSAM).
22
+
23
+ ## What this repo contains
24
+
25
+ | File | Size | Role |
26
+ |---|---|---|
27
+ | `mobile_sam_image_encoder.onnx` | ~27 MB | ViT-T (Tiny) image encoder; replaces SAM's ViT-H |
28
+ | `sam_mask_decoder_single.onnx` | ~16 MB | Mask decoder, single-mask output mode |
29
+ | `sam_mask_decoder_multi.onnx` | ~16 MB | Mask decoder, multi-mask output mode (returns top-3 masks per prompt) |
30
+
31
+ The encoder runs once per image to produce a per-pixel embedding (1024-dim feature map). The mask decoder then runs cheaply per prompt (a click, box, or mask hint) against that embedding to produce segmentation masks. This split is what makes interactive segmentation tractable — only the decoder runs in the per-prompt loop.
32
+
33
+ ## How to use
34
+
35
+ ```python
36
+ import onnxruntime as ort
37
+ import numpy as np
38
+
39
+ # 1. Encode the image once
40
+ encoder = ort.InferenceSession("mobile_sam_image_encoder.onnx")
41
+ img_embedding = encoder.run(None, {"input_image": preprocessed_image})[0]
42
+ # img_embedding shape: [1, 256, 64, 64]
43
+
44
+ # 2. Decode masks per prompt (point click here)
45
+ decoder = ort.InferenceSession("sam_mask_decoder_single.onnx")
46
+ point_coords = np.array([[[500, 375]]], dtype=np.float32) # one click
47
+ point_labels = np.array([[1]], dtype=np.float32) # 1 = positive
48
+
49
+ masks, iou_predictions, low_res_masks = decoder.run(None, {
50
+ "image_embeddings": img_embedding,
51
+ "point_coords": point_coords,
52
+ "point_labels": point_labels,
53
+ "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32),
54
+ "has_mask_input": np.array([0], dtype=np.float32),
55
+ "orig_im_size": np.array([img_h, img_w], dtype=np.float32),
56
+ })
57
+ ```
58
+
59
+ Use `sam_mask_decoder_single.onnx` for "give me the single best mask for this prompt" (typical interactive UX). Use `sam_mask_decoder_multi.onnx` for "give me top-3 candidate masks" — useful when the prompt is ambiguous (e.g., clicking on a person's shirt could yield mask of the shirt, the torso, or the whole person).
60
+
61
+ ## Why MobileSAM over original SAM
62
+
63
+ - **38× smaller encoder**: ViT-T (~10M params) vs SAM ViT-H (~636M params).
64
+ - **~5× faster inference** on CPU; near-real-time even without GPU.
65
+ - **Quality is comparable** on common segmentation benchmarks — MobileSAM was distilled from SAM ViT-H specifically to preserve quality.
66
+ - Choose original SAM ViT-H/L only when maximum accuracy on edge cases matters more than speed/size.
67
+
68
+ ## License
69
+
70
+ **Apache-2.0** — same as both upstream projects. `LICENSE` file included.