SAM3-LiteText
Overview
SAM3-LiteText was proposed in SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation by Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, and Fan Zhang.
SAM3-LiteText is a lightweight variant of SAM3 that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation. The SAM3 ViT-H image encoder is kept intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model.
The abstract from the paper is the following:
Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model.
The text encoder architecture is based on MobileCLIP and comes in three variants:
| Variant | Text Encoder | Text Params | Reduction |
|---|---|---|---|
| SAM3-LiteText-S0-16 | MobileCLIP-S0 | 42.54M | ~88% |
| SAM3-LiteText-S1-16 | MobileCLIP-S1 | 63.53M | ~82% |
| SAM3-LiteText-L-16 | MobileCLIP2-L | 123.80M | ~65% |
Usage
SAM3-LiteText is a drop-in replacement for SAM3 with a lightweight text encoder. It uses the same processor and supports the same prompting interface. Refer to the SAM3 documentation for detailed usage examples including text prompts, box prompts, batched inference, and more.
from io import BytesIO
import httpx
from transformers import AutoModel, AutoProcessor
from PIL import Image
model = AutoModel.from_pretrained("Simon7108528/sam3-litetext-s0", device_map="auto")
processor = AutoProcessor.from_pretrained("Simon7108528/sam3-litetext-s0")
image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(BytesIO(httpx.get(image_url).content)).convert("RGB")
inputs = processor(images=image, text="ear", return_tensors="pt").to(model.device)
outputs = model(**inputs)
results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist(),
)[0]
print(f"Found {len(results['masks'])} objects")
- Downloads last month
- 1,131