--- license: apache-2.0 library_name: transformers pipeline_tag: sentence-similarity language: - en tags: - agentic-intelligence-lab - elephant - embeddings - multimodal - retrieval - rag - agents - routing - image-text - audio-text - matryoshka - 2dmse model-index: - name: elephant-embeddings-v1-multimodal-small results: - task: type: image-text-retrieval dataset: name: COCO type: coco metrics: - name: Image-to-Text R@1 type: recall_at_1 value: 41.88 - name: Image-to-Text R@5 type: recall_at_5 value: 71.64 - name: Image-to-Text R@10 type: recall_at_10 value: 82.16 - task: type: audio-text-retrieval dataset: name: LibriSpeech type: librispeech metrics: - name: Audio-to-Text R@1 type: recall_at_1 value: 36.38 - name: Audio-to-Text R@5 type: recall_at_5 value: 68.22 - name: Audio-to-Text R@10 type: recall_at_10 value: 79.52 --- # Elephant Embeddings V1 Multimodal Small `elephant-embeddings-v1-multimodal-small` is the compact multimodal embedding model in the **Agentic Intelligence Lab Elephant Embeddings V1** family. This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` under a consistent Elephant model namespace. ## Positioning This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release. It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following. ## Model at a glance | Item | Value | | --- | --- | | Family | Elephant Embeddings V1 | | Maintainer | Agentic Intelligence Lab | | Model type | Multimodal embedding model | | Modalities | Text, image, audio | | Text encoder | `sentence-transformers/all-MiniLM-L6-v2` | | Image encoder | `google/siglip-base-patch16-512` | | Audio encoder | `openai/whisper-tiny` | | Fusion | 2-layer Transformer attention | | Embedding dimension | 384 | | Matryoshka dimensions | 384, 256, 128, 64, 32 | | Image resolution | 512×512 | | Audio input | Up to 30s, 16kHz | | Upstream source | `llm-semantic-router/multi-modal-embed-small` | | License | Apache 2.0 | ## Why it fits agentic workloads Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content. Key advantages: - **Shared multimodal space**: compare text, screenshots/images, and short audio clips in one vector space. - **Compact embedding size**: 384-dimensional vectors are cheaper to store and search. - **Dimension-adaptive retrieval**: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes. - **Practical modality encoders**: combines lightweight text and audio encoders with a SigLIP image tower. - **ONNX assets included**: provides additional deployment artifacts for selected runtime paths. ## Recommended use cases | Scenario | Example | | --- | --- | | Lightweight multimodal retrieval | Search captions, screenshots, and voice snippets together | | Agent route matching | Match user text or UI screenshots to tools and workflows | | Edge or cost-sensitive indexing | Use 384d or truncated vectors for lower storage cost | | Prototype multimodal memory | Build a small unified memory index before moving to the large model | | Image/audio semantic search | Retrieve text labels or notes from image/audio queries | ## Quick start on ModelScope ```bash pip install modelscope torch transformers pillow safetensors ``` ```python import os import torch import torch.nn as nn import torch.nn.functional as F from modelscope import snapshot_download from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel class MultiModalEmbedder(nn.Module): def __init__(self): super().__init__() self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512") self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model self.image_proj = nn.Linear(768, 384) self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny") self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder def encode_text(self, texts): if isinstance(texts, str): texts = [texts] inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt") inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} outputs = self.text_encoder(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) return F.normalize(embeddings, p=2, dim=-1) def encode_image(self, images): inputs = self.image_processor(images=images, return_tensors="pt") inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} outputs = self.image_encoder(**inputs) embeddings = self.image_proj(outputs.pooler_output) return F.normalize(embeddings, p=2, dim=-1) def encode_audio(self, waveform): if isinstance(waveform, torch.Tensor): waveform = waveform.squeeze().cpu().numpy() inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt") inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} outputs = self.audio_encoder(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) return F.normalize(embeddings, p=2, dim=-1) repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small" local_dir = snapshot_download(repo_id) model = MultiModalEmbedder() state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False) model.text_encoder.load_state_dict({ key.replace("text_encoder.encoder.", ""): value for key, value in state_dict.items() if key.startswith("text_encoder.encoder.") }) model.image_encoder.load_state_dict({ key.replace("image_encoder.vision_encoder.", ""): value for key, value in state_dict.items() if key.startswith("image_encoder.vision_encoder.") }) model.image_proj.load_state_dict({ key.replace("image_encoder.projection.", ""): value for key, value in state_dict.items() if key.startswith("image_encoder.projection.") }) model.audio_encoder.load_state_dict({ key.replace("audio_encoder.encoder.", ""): value for key, value in state_dict.items() if key.startswith("audio_encoder.encoder.") }) model.eval() texts = ["A refund request", "A screenshot of a login failure"] text_embeddings = model.encode_text(texts) print(text_embeddings.shape) # [2, 384] ``` ## Matryoshka truncation ```python full_emb = model.encode_text("A billing support request") # [1, 384] emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1) emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1) emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1) ``` ## Evaluation snapshot | Metric | Score | | --- | ---: | | COCO image-to-text R@1 | 41.88% | | COCO image-to-text R@5 | 71.64% | | COCO image-to-text R@10 | 82.16% | | LibriSpeech audio-to-text R@1 | 36.38% | | LibriSpeech audio-to-text R@5 | 68.22% | | LibriSpeech audio-to-text R@10 | 79.52% | ## Files | File | Description | | --- | --- | | `model.pt` | PyTorch checkpoint | | `model.safetensors` | SafeTensors checkpoint | | `config.json` | Model component configuration | | `onnx/` | ONNX deployment assets | | `README.md` | This model card | ## Lineage This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` and keeps the model artifacts unchanged except for the repository naming and model card presentation. ## Limitations - English is the primary supported language for this compact release. - Image inputs are designed around 512×512 preprocessing. - Audio inputs are intended for short clips up to about 30 seconds at 16kHz. - The model is optimized for retrieval, routing, and similarity, not generation or captioning. - For higher-quality multimodal retrieval, use `elephant-embeddings-v1-multimodal-large`. ## Citation ```bibtex @misc{elephant-embeddings-v1-multimodal-small, title={Elephant Embeddings V1 Multimodal Small}, author={Agentic Intelligence Lab}, year={2026}, url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small} } ``` ## License Apache 2.0