Mirror agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small from ModelScope

d2f7c0c verified 8 days ago

9.32 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: sentence-similarity
	language:
	- en
	tags:
	- agentic-intelligence-lab
	- elephant
	- embeddings
	- multimodal
	- retrieval
	- rag
	- agents
	- routing
	- image-text
	- audio-text
	- matryoshka
	- 2dmse
	model-index:
	- name: elephant-embeddings-v1-multimodal-small
	results:
	- task:
	type: image-text-retrieval
	dataset:
	name: COCO
	type: coco
	metrics:
	- name: Image-to-Text R@1
	type: recall_at_1
	value: 41.88
	- name: Image-to-Text R@5
	type: recall_at_5
	value: 71.64
	- name: Image-to-Text R@10
	type: recall_at_10
	value: 82.16
	- task:
	type: audio-text-retrieval
	dataset:
	name: LibriSpeech
	type: librispeech
	metrics:
	- name: Audio-to-Text R@1
	type: recall_at_1
	value: 36.38
	- name: Audio-to-Text R@5
	type: recall_at_5
	value: 68.22
	- name: Audio-to-Text R@10
	type: recall_at_10
	value: 79.52
	---

	# Elephant Embeddings V1 Multimodal Small

	`elephant-embeddings-v1-multimodal-small` is the compact multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

	This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` under a consistent Elephant model namespace.

	## Positioning

	This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release.

	It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following.

	## Model at a glance

	\| Item \| Value \|
	\| --- \| --- \|
	\| Family \| Elephant Embeddings V1 \|
	\| Maintainer \| Agentic Intelligence Lab \|
	\| Model type \| Multimodal embedding model \|
	\| Modalities \| Text, image, audio \|
	\| Text encoder \| `sentence-transformers/all-MiniLM-L6-v2` \|
	\| Image encoder \| `google/siglip-base-patch16-512` \|
	\| Audio encoder \| `openai/whisper-tiny` \|
	\| Fusion \| 2-layer Transformer attention \|
	\| Embedding dimension \| 384 \|
	\| Matryoshka dimensions \| 384, 256, 128, 64, 32 \|
	\| Image resolution \| 512×512 \|
	\| Audio input \| Up to 30s, 16kHz \|
	\| Upstream source \| `llm-semantic-router/multi-modal-embed-small` \|
	\| License \| Apache 2.0 \|

	## Why it fits agentic workloads

	Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content.

	Key advantages:

	- Shared multimodal space: compare text, screenshots/images, and short audio clips in one vector space.
	- Compact embedding size: 384-dimensional vectors are cheaper to store and search.
	- Dimension-adaptive retrieval: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes.
	- Practical modality encoders: combines lightweight text and audio encoders with a SigLIP image tower.
	- ONNX assets included: provides additional deployment artifacts for selected runtime paths.

	## Recommended use cases

	\| Scenario \| Example \|
	\| --- \| --- \|
	\| Lightweight multimodal retrieval \| Search captions, screenshots, and voice snippets together \|
	\| Agent route matching \| Match user text or UI screenshots to tools and workflows \|
	\| Edge or cost-sensitive indexing \| Use 384d or truncated vectors for lower storage cost \|
	\| Prototype multimodal memory \| Build a small unified memory index before moving to the large model \|
	\| Image/audio semantic search \| Retrieve text labels or notes from image/audio queries \|

	## Quick start on ModelScope

	```bash
	pip install modelscope torch transformers pillow safetensors
	```

	```python
	import os

	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from modelscope import snapshot_download
	from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel


	class MultiModalEmbedder(nn.Module):
	def __init__(self):
	super().__init__()
	self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
	self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

	self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512")
	self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model
	self.image_proj = nn.Linear(768, 384)

	self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
	self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder

	def encode_text(self, texts):
	if isinstance(texts, str):
	texts = [texts]
	inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
	inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
	outputs = self.text_encoder(**inputs)
	embeddings = outputs.last_hidden_state.mean(dim=1)
	return F.normalize(embeddings, p=2, dim=-1)

	def encode_image(self, images):
	inputs = self.image_processor(images=images, return_tensors="pt")
	inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
	outputs = self.image_encoder(**inputs)
	embeddings = self.image_proj(outputs.pooler_output)
	return F.normalize(embeddings, p=2, dim=-1)

	def encode_audio(self, waveform):
	if isinstance(waveform, torch.Tensor):
	waveform = waveform.squeeze().cpu().numpy()
	inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt")
	inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
	outputs = self.audio_encoder(**inputs)
	embeddings = outputs.last_hidden_state.mean(dim=1)
	return F.normalize(embeddings, p=2, dim=-1)


	repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small"
	local_dir = snapshot_download(repo_id)

	model = MultiModalEmbedder()
	state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False)

	model.text_encoder.load_state_dict({
	key.replace("text_encoder.encoder.", ""): value
	for key, value in state_dict.items()
	if key.startswith("text_encoder.encoder.")
	})
	model.image_encoder.load_state_dict({
	key.replace("image_encoder.vision_encoder.", ""): value
	for key, value in state_dict.items()
	if key.startswith("image_encoder.vision_encoder.")
	})
	model.image_proj.load_state_dict({
	key.replace("image_encoder.projection.", ""): value
	for key, value in state_dict.items()
	if key.startswith("image_encoder.projection.")
	})
	model.audio_encoder.load_state_dict({
	key.replace("audio_encoder.encoder.", ""): value
	for key, value in state_dict.items()
	if key.startswith("audio_encoder.encoder.")
	})

	model.eval()

	texts = ["A refund request", "A screenshot of a login failure"]
	text_embeddings = model.encode_text(texts)
	print(text_embeddings.shape) # [2, 384]
	```

	## Matryoshka truncation

	```python
	full_emb = model.encode_text("A billing support request") # [1, 384]

	emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)
	emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)
	emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)
	```

	## Evaluation snapshot

	\| Metric \| Score \|
	\| --- \| ---: \|
	\| COCO image-to-text R@1 \| 41.88% \|
	\| COCO image-to-text R@5 \| 71.64% \|
	\| COCO image-to-text R@10 \| 82.16% \|
	\| LibriSpeech audio-to-text R@1 \| 36.38% \|
	\| LibriSpeech audio-to-text R@5 \| 68.22% \|
	\| LibriSpeech audio-to-text R@10 \| 79.52% \|

	## Files

	\| File \| Description \|
	\| --- \| --- \|
	\| `model.pt` \| PyTorch checkpoint \|
	\| `model.safetensors` \| SafeTensors checkpoint \|
	\| `config.json` \| Model component configuration \|
	\| `onnx/` \| ONNX deployment assets \|
	\| `README.md` \| This model card \|

	## Lineage

	This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` and keeps the model artifacts unchanged except for the repository naming and model card presentation.

	## Limitations

	- English is the primary supported language for this compact release.
	- Image inputs are designed around 512×512 preprocessing.
	- Audio inputs are intended for short clips up to about 30 seconds at 16kHz.
	- The model is optimized for retrieval, routing, and similarity, not generation or captioning.
	- For higher-quality multimodal retrieval, use `elephant-embeddings-v1-multimodal-large`.

	## Citation

	```bibtex
	@misc{elephant-embeddings-v1-multimodal-small,
	title={Elephant Embeddings V1 Multimodal Small},
	author={Agentic Intelligence Lab},
	year={2026},
	url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small}
	}
	```

	## License

	Apache 2.0