---
license: apache-2.0
library_name: transformers
pipeline_tag: sentence-similarity
language:
  - en
tags:
  - agentic-intelligence-lab
  - elephant
  - embeddings
  - multimodal
  - retrieval
  - rag
  - agents
  - routing
  - image-text
  - audio-text
  - matryoshka
  - 2dmse
model-index:
  - name: elephant-embeddings-v1-multimodal-small
    results:
      - task:
          type: image-text-retrieval
        dataset:
          name: COCO
          type: coco
        metrics:
          - name: Image-to-Text R@1
            type: recall_at_1
            value: 41.88
          - name: Image-to-Text R@5
            type: recall_at_5
            value: 71.64
          - name: Image-to-Text R@10
            type: recall_at_10
            value: 82.16
      - task:
          type: audio-text-retrieval
        dataset:
          name: LibriSpeech
          type: librispeech
        metrics:
          - name: Audio-to-Text R@1
            type: recall_at_1
            value: 36.38
          - name: Audio-to-Text R@5
            type: recall_at_5
            value: 68.22
          - name: Audio-to-Text R@10
            type: recall_at_10
            value: 79.52
---

# Elephant Embeddings V1 Multimodal Small

`elephant-embeddings-v1-multimodal-small` is the compact multimodal embedding model in the **Agentic Intelligence Lab Elephant Embeddings V1** family.

This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` under a consistent Elephant model namespace.

## Positioning

This model is a lightweight multimodal embedding model for text, image, and audio retrieval. It is designed for deployments that need a shared multimodal semantic space but prefer a smaller and cheaper model than the large tri-encoder release.

It is best suited for retrieval, routing, and similarity workloads rather than generative chat, captioning, or instruction following.

## Model at a glance

| Item | Value |
| --- | --- |
| Family | Elephant Embeddings V1 |
| Maintainer | Agentic Intelligence Lab |
| Model type | Multimodal embedding model |
| Modalities | Text, image, audio |
| Text encoder | `sentence-transformers/all-MiniLM-L6-v2` |
| Image encoder | `google/siglip-base-patch16-512` |
| Audio encoder | `openai/whisper-tiny` |
| Fusion | 2-layer Transformer attention |
| Embedding dimension | 384 |
| Matryoshka dimensions | 384, 256, 128, 64, 32 |
| Image resolution | 512×512 |
| Audio input | Up to 30s, 16kHz |
| Upstream source | `llm-semantic-router/multi-modal-embed-small` |
| License | Apache 2.0 |

## Why it fits agentic workloads

Small multimodal embeddings are useful when an agent runtime needs frequent low-cost similarity checks over mixed content.

Key advantages:

- **Shared multimodal space**: compare text, screenshots/images, and short audio clips in one vector space.
- **Compact embedding size**: 384-dimensional vectors are cheaper to store and search.
- **Dimension-adaptive retrieval**: truncate vectors to 256d, 128d, 64d, or 32d for lower-cost indexes.
- **Practical modality encoders**: combines lightweight text and audio encoders with a SigLIP image tower.
- **ONNX assets included**: provides additional deployment artifacts for selected runtime paths.

## Recommended use cases

| Scenario | Example |
| --- | --- |
| Lightweight multimodal retrieval | Search captions, screenshots, and voice snippets together |
| Agent route matching | Match user text or UI screenshots to tools and workflows |
| Edge or cost-sensitive indexing | Use 384d or truncated vectors for lower storage cost |
| Prototype multimodal memory | Build a small unified memory index before moving to the large model |
| Image/audio semantic search | Retrieve text labels or notes from image/audio queries |

## Quick start on ModelScope

```bash
pip install modelscope torch transformers pillow safetensors
```

```python
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
from modelscope import snapshot_download
from transformers import AutoModel, AutoTokenizer, SiglipModel, SiglipProcessor, WhisperFeatureExtractor, WhisperModel


class MultiModalEmbedder(nn.Module):
    def __init__(self):
        super().__init__()
        self.text_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.text_encoder = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

        self.image_processor = SiglipProcessor.from_pretrained("google/siglip-base-patch16-512")
        self.image_encoder = SiglipModel.from_pretrained("google/siglip-base-patch16-512").vision_model
        self.image_proj = nn.Linear(768, 384)

        self.audio_processor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
        self.audio_encoder = WhisperModel.from_pretrained("openai/whisper-tiny").encoder

    def encode_text(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        inputs = self.text_tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.text_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_image(self, images):
        inputs = self.image_processor(images=images, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.image_encoder(**inputs)
        embeddings = self.image_proj(outputs.pooler_output)
        return F.normalize(embeddings, p=2, dim=-1)

    def encode_audio(self, waveform):
        if isinstance(waveform, torch.Tensor):
            waveform = waveform.squeeze().cpu().numpy()
        inputs = self.audio_processor(waveform, sampling_rate=16000, return_tensors="pt")
        inputs = {key: value.to(next(self.parameters()).device) for key, value in inputs.items()}
        outputs = self.audio_encoder(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return F.normalize(embeddings, p=2, dim=-1)


repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small"
local_dir = snapshot_download(repo_id)

model = MultiModalEmbedder()
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu", weights_only=False)

model.text_encoder.load_state_dict({
    key.replace("text_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("text_encoder.encoder.")
})
model.image_encoder.load_state_dict({
    key.replace("image_encoder.vision_encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.vision_encoder.")
})
model.image_proj.load_state_dict({
    key.replace("image_encoder.projection.", ""): value
    for key, value in state_dict.items()
    if key.startswith("image_encoder.projection.")
})
model.audio_encoder.load_state_dict({
    key.replace("audio_encoder.encoder.", ""): value
    for key, value in state_dict.items()
    if key.startswith("audio_encoder.encoder.")
})

model.eval()

texts = ["A refund request", "A screenshot of a login failure"]
text_embeddings = model.encode_text(texts)
print(text_embeddings.shape)  # [2, 384]
```

## Matryoshka truncation

```python
full_emb = model.encode_text("A billing support request")  # [1, 384]

emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)
emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)
emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)
```

## Evaluation snapshot

| Metric | Score |
| --- | ---: |
| COCO image-to-text R@1 | 41.88% |
| COCO image-to-text R@5 | 71.64% |
| COCO image-to-text R@10 | 82.16% |
| LibriSpeech audio-to-text R@1 | 36.38% |
| LibriSpeech audio-to-text R@5 | 68.22% |
| LibriSpeech audio-to-text R@10 | 79.52% |

## Files

| File | Description |
| --- | --- |
| `model.pt` | PyTorch checkpoint |
| `model.safetensors` | SafeTensors checkpoint |
| `config.json` | Model component configuration |
| `onnx/` | ONNX deployment assets |
| `README.md` | This model card |

## Lineage

This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-small` and keeps the model artifacts unchanged except for the repository naming and model card presentation.

## Limitations

- English is the primary supported language for this compact release.
- Image inputs are designed around 512×512 preprocessing.
- Audio inputs are intended for short clips up to about 30 seconds at 16kHz.
- The model is optimized for retrieval, routing, and similarity, not generation or captioning.
- For higher-quality multimodal retrieval, use `elephant-embeddings-v1-multimodal-large`.

## Citation

```bibtex
@misc{elephant-embeddings-v1-multimodal-small,
  title={Elephant Embeddings V1 Multimodal Small},
  author={Agentic Intelligence Lab},
  year={2026},
  url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-small}
}
```

## License

Apache 2.0