Sentence Similarity
Transformers
ONNX
Safetensors
English
agentic-intelligence-lab
elephant
embeddings
multimodal
retrieval
rag
agents
routing
image-text
audio-text
matryoshka
2dmse
Eval Results (legacy)
Instructions to use agentic-in/elephant-embeddings-v1-multimodal-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use agentic-in/elephant-embeddings-v1-multimodal-small with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("agentic-in/elephant-embeddings-v1-multimodal-small", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: sentence-similarity | |
| language: | |
| - en | |
| tags: | |
| - agentic-intelligence-lab | |
| - elephant | |
| - embeddings | |
| - multimodal | |
| - retrieval | |
| - rag | |
| - agents | |
| - routing | |
| - image-text | |
| - audio-text | |
| - matryoshka | |
| - 2dmse | |
| model-index: | |
| - name: elephant-embeddings-v1-multimodal-small | |
| results: | |
| - task: | |
| type: image-text-retrieval | |
| dataset: | |
| name: COCO | |
| type: coco | |
| metrics: | |
| - name: Image-to-Text R@1 | |
| type: recall_at_1 | |
| value: 41.88 | |
| - name: Image-to-Text R@5 | |
| type: recall_at_5 | |
| value: 71.64 | |
| - name: Image-to-Text R@10 | |
| type: recall_at_10 | |
| value: 82.16 | |
| - task: | |
| type: audio-text-retrieval | |
| dataset: | |
| name: LibriSpeech | |
| type: librispeech | |
| metrics: | |
| - name: Audio-to-Text R@1 | |
| type: recall_at_1 | |
| value: 36.38 | |
| - name: Audio-to-Text R@5 | |
| type: recall_at_5 | |
| value: 68.22 | |
| - name: Audio-to-Text R@10 | |
| type: recall_at_10 | |
| value: 79.52 | |
| # Elephant Embeddings V1 Multimodal Small | |
| `elephant-embeddings-v1-multimodal-small` is the compact multimodal embedding model in the **Agentic Intelligence Lab Elephant Embeddings V1** family. | |
| This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` under a consistent Elephant model namespace. | |
| ## Positioning | |
| This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release. | |
| It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following. | |
| ## Model at a glance | |
| | Item | Value | | |
| | --- | --- | | |
| | Family | Elephant Embeddings V1 | | |
| | Maintainer | Agentic Intelligence Lab | | |
| | Model type | Multimodal embedding model | | |
| | Modalities | Text, image, audio | | |
| | Text encoder | `sentence-transformers/all-MiniLM-L6-v2` | | |
| | Image encoder | `google/siglip-base-patch16-512` | | |
| | Audio encoder | `openai/whisper-tiny` | | |
| | Fusion | 2-layer Transformer attention | | |
| | Embedding dimension | 384 | | |
| | Matryoshka dimensions | 384, 256, 128, 64, 32 | | |
| | Image resolution | 512×512 | | |
| | Audio input | Up to 30s, 16kHz | | |
| | Upstream source | `llm-semantic-router/multi-modal-embed-small` | | |
| | License | Apache 2.0 | | |
| ## Why it fits agentic workloads | |
| Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content. | |
| Key advantages: | |
| - **Shared multimodal space**: compare text, screenshots/images, and short audio clips in one vector space. | |
| - **Compact embedding size**: 384-dimensional vectors are cheaper to store and search. | |
| - **Dimension-adaptive retrieval**: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes. | |
| - **Practical modality encoders**: combines lightweight text and audio encoders with a SigLIP image tower. | |
| - **ONNX assets included**: provides additional deployment artifacts for selected runtime paths. | |
| ## Recommended use cases | |
| | Scenario | Example | | |
| | --- | --- | | |
| | Lightweight multimodal retrieval | Search captions, screenshots, and voice snippets together | | |
| | Agent route matching | Match user text or UI screenshots to tools and workflows | | |
| | Edge or cost-sensitive indexing | Use 384d or truncated vectors for lower storage cost | | |
| | Prototype multimodal memory | Build a small unified memory index before moving to the large model | | |
| | Image/audio semantic search | Retrieve text labels or notes from image/audio queries | | |
| ## Quick start on ModelScope | |
| ```bash | |
| pip install modelscope torch transformers pillow safetensors | |
| ``` | |
| ```python | |
| import os | |
| import torch | |
| import torch.nn as nn | |
| import torch.nn.functional as F | |
| from modelscope import snapshot_download | |
| from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel | |
| class MultiModalEmbedder(nn.Module): | |
| def __init__(self): | |
| super().__init__() | |
| self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") | |
| self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") | |
| self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512") | |
| self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model | |
| self.image_proj = nn.Linear(768, 384) | |
| self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny") | |
| self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder | |
| def encode_text(self, texts): | |
| if isinstance(texts, str): | |
| texts = [texts] | |
| inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt") | |
| inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} | |
| outputs = self.text_encoder(**inputs) | |
| embeddings = outputs.last_hidden_state.mean(dim=1) | |
| return F.normalize(embeddings, p=2, dim=-1) | |
| def encode_image(self, images): | |
| inputs = self.image_processor(images=images, return_tensors="pt") | |
| inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} | |
| outputs = self.image_encoder(**inputs) | |
| embeddings = self.image_proj(outputs.pooler_output) | |
| return F.normalize(embeddings, p=2, dim=-1) | |
| def encode_audio(self, waveform): | |
| if isinstance(waveform, torch.Tensor): | |
| waveform = waveform.squeeze().cpu().numpy() | |
| inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt") | |
| inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()} | |
| outputs = self.audio_encoder(**inputs) | |
| embeddings = outputs.last_hidden_state.mean(dim=1) | |
| return F.normalize(embeddings, p=2, dim=-1) | |
| repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small" | |
| local_dir = snapshot_download(repo_id) | |
| model = MultiModalEmbedder() | |
| state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False) | |
| model.text_encoder.load_state_dict({ | |
| key.replace("text_encoder.encoder.", ""): value | |
| for key, value in state_dict.items() | |
| if key.startswith("text_encoder.encoder.") | |
| }) | |
| model.image_encoder.load_state_dict({ | |
| key.replace("image_encoder.vision_encoder.", ""): value | |
| for key, value in state_dict.items() | |
| if key.startswith("image_encoder.vision_encoder.") | |
| }) | |
| model.image_proj.load_state_dict({ | |
| key.replace("image_encoder.projection.", ""): value | |
| for key, value in state_dict.items() | |
| if key.startswith("image_encoder.projection.") | |
| }) | |
| model.audio_encoder.load_state_dict({ | |
| key.replace("audio_encoder.encoder.", ""): value | |
| for key, value in state_dict.items() | |
| if key.startswith("audio_encoder.encoder.") | |
| }) | |
| model.eval() | |
| texts = ["A refund request", "A screenshot of a login failure"] | |
| text_embeddings = model.encode_text(texts) | |
| print(text_embeddings.shape) # [2, 384] | |
| ``` | |
| ## Matryoshka truncation | |
| ```python | |
| full_emb = model.encode_text("A billing support request") # [1, 384] | |
| emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1) | |
| emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1) | |
| emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1) | |
| ``` | |
| ## Evaluation snapshot | |
| | Metric | Score | | |
| | --- | ---: | | |
| | COCO image-to-text R@1 | 41.88% | | |
| | COCO image-to-text R@5 | 71.64% | | |
| | COCO image-to-text R@10 | 82.16% | | |
| | LibriSpeech audio-to-text R@1 | 36.38% | | |
| | LibriSpeech audio-to-text R@5 | 68.22% | | |
| | LibriSpeech audio-to-text R@10 | 79.52% | | |
| ## Files | |
| | File | Description | | |
| | --- | --- | | |
| | `model.pt` | PyTorch checkpoint | | |
| | `model.safetensors` | SafeTensors checkpoint | | |
| | `config.json` | Model component configuration | | |
| | `onnx/` | ONNX deployment assets | | |
| | `README.md` | This model card | | |
| ## Lineage | |
| This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` and keeps the model artifacts unchanged except for the repository naming and model card presentation. | |
| ## Limitations | |
| - English is the primary supported language for this compact release. | |
| - Image inputs are designed around 512×512 preprocessing. | |
| - Audio inputs are intended for short clips up to about 30 seconds at 16kHz. | |
| - The model is optimized for retrieval, routing, and similarity, not generation or captioning. | |
| - For higher-quality multimodal retrieval, use `elephant-embeddings-v1-multimodal-large`. | |
| ## Citation | |
| ```bibtex | |
| @misc{elephant-embeddings-v1-multimodal-small, | |
| title={Elephant Embeddings V1 Multimodal Small}, | |
| author={Agentic Intelligence Lab}, | |
| year={2026}, | |
| url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small} | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 | |