| --- |
| license: apache-2.0 |
| library_name: pytorch |
| pipeline_tag: sentence-similarity |
| language: |
| - multilingual |
| tags: |
| - agentic-intelligence-lab |
| - elephant |
| - embeddings |
| - multimodal |
| - retrieval |
| - rag |
| - agents |
| - routing |
| - image-text |
| - audio-text |
| - tri-encoder |
| - pytorch |
| model-index: |
| - name: elephant-embeddings-v1-multimodal-large |
| results: |
| - task: |
| type: sentence-similarity |
| dataset: |
| name: Internal cached validation set |
| type: cached_retrieval_validation |
| metrics: |
| - name: Eval loss |
| type: eval_loss |
| value: 0.389702 |
| - name: Eval top1 |
| type: eval_top1 |
| value: 0.861707 |
| --- |
| |
| # Elephant Embeddings V1 Multimodal Large |
|
|
| `elephant-embeddings-v1-multimodal-large` is the large multimodal embedding model in the **Agentic Intelligence Lab Elephant Embeddings V1** family. |
|
|
| This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-large` under a consistent Elephant model namespace. |
|
|
| ## Positioning |
|
|
| This model is a production-oriented multimodal embedding model for semantic routing, retrieval, and cross-modal matching across text, image, and audio. |
|
|
| It is not a generative chat or captioning model. Instead, it maps different modalities into one shared embedding space so agent systems can compare requests, screenshots, documents, and audio records with the same retrieval interface. |
|
|
| ## Model at a glance |
|
|
| | Item | Value | |
| | --- | --- | |
| | Family | Elephant Embeddings V1 | |
| | Maintainer | Agentic Intelligence Lab | |
| | Model type | Multimodal embedding model | |
| | Modalities | Text, image, audio | |
| | Architecture | Custom PyTorch tri-encoder | |
| | Text encoder | `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` | |
| | Image encoder | `google/siglip2-so400m-patch14-384` | |
| | Audio encoder | `openai/whisper-medium` | |
| | Embedding dimension | 768 | |
| | Max text length | 32,768 tokens | |
| | Objective | Cached multiple negatives ranking loss | |
| | Upstream source | `llm-semantic-router/multi-modal-embed-large` | |
| | License | Apache 2.0 | |
|
|
| ## Why it fits agentic workloads |
|
|
| Agentic products increasingly need to retrieve and route over mixed inputs: user text, screenshots, UI states, documents, voice notes, support calls, and multimodal memory. This model is designed for that operating pattern. |
|
|
| Key advantages: |
|
|
| - **Shared semantic space**: text, images, and audio can be compared with cosine similarity. |
| - **Routing-grade representation**: optimized for retrieval, matching, and routing rather than generation. |
| - **Strong modality towers**: uses dedicated text, image, and audio encoders instead of forcing all modalities through a single monolithic checkpoint. |
| - **Long-context text path**: supports long tool descriptions, traces, and knowledge chunks through the text encoder. |
| - **Production packaging**: includes the custom source package needed to construct and run the tri-encoder. |
|
|
| ## Recommended use cases |
|
|
| | Scenario | Example | |
| | --- | --- | |
| | Multimodal RAG | Retrieve text notes using an image or audio query | |
| | Agent routing | Route screenshots, user text, or voice requests to the right tool or workflow | |
| | Memory search | Search mixed text/image/audio memory stores in one vector space | |
| | Support and operations | Match tickets, screenshots, logs, and recorded calls semantically | |
| | Offline indexing | Build high-quality 768d multimodal indexes | |
|
|
| ## Quick start on ModelScope |
|
|
| ```bash |
| pip install modelscope torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile |
| ``` |
|
|
| ```python |
| import json |
| import os |
| import sys |
| |
| import torch |
| import torch.nn.functional as F |
| from modelscope import snapshot_download |
| |
| repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large" |
| local_dir = snapshot_download(repo_id) |
| |
| sys.path.insert(0, os.path.join(local_dir, "src")) |
| |
| from hf_st_mm.data import PairItem |
| from hf_st_mm.model import MultiModalSentenceEmbedder |
| |
| with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle: |
| cfg = json.load(handle) |
| |
| model = MultiModalSentenceEmbedder( |
| text_encoder_name=cfg["model"]["text_encoder_name"], |
| image_encoder_name=cfg["model"]["image_encoder_name"], |
| audio_encoder_name=cfg["model"]["audio_encoder_name"], |
| embedding_dim=int(cfg["model"]["embedding_dim"]), |
| max_text_length=int(cfg["model"]["max_text_length"]), |
| ) |
| |
| state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu") |
| model.load_state_dict(state_dict) |
| model.eval() |
| |
| items = [ |
| PairItem(modality="text", value="route this request to the billing workflow"), |
| PairItem(modality="image", value="/path/to/screenshot.png"), |
| PairItem(modality="audio", value="/path/to/call.wav"), |
| ] |
| |
| with torch.no_grad(): |
| embeddings = model.encode_items(items) |
| |
| print(embeddings.shape) # [3, 768] |
| |
| query = PairItem(modality="text", value="refund request for a wrong charge") |
| candidate = PairItem(modality="audio", value="/path/to/refund_call.wav") |
| |
| with torch.no_grad(): |
| embs = model.encode_items([query, candidate]) |
| |
| similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item() |
| print(f"similarity={similarity:.4f}") |
| ``` |
|
|
| ## Evaluation snapshot |
|
|
| | Metric | Value | |
| | --- | ---: | |
| | Eval loss | 0.389702 | |
| | Eval top1 | 0.861707 | |
|
|
| The validation metrics come from the tri-encoder cached retrieval validation path used during export. They are intended as a release sanity snapshot rather than a public leaderboard claim. |
|
|
| ## Files |
|
|
| | File | Description | |
| | --- | --- | |
| | `model.pt` | Exported PyTorch weights | |
| | `config.json` | Tri-encoder and training/export configuration | |
| | `src/hf_st_mm/` | Python package used to construct and run the model | |
| | `README.md` | This model card | |
|
|
| ## Lineage |
|
|
| This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-large` and keeps the model artifacts unchanged except for the repository naming and model card presentation. |
|
|
| ## Limitations |
|
|
| - This is a custom PyTorch tri-encoder export, not a standard Transformers auto-class checkpoint. |
| - Inference relies on the packaged `hf_st_mm` source code. |
| - Image and audio inputs are expected as local file paths in the simple inference path. |
| - The model is optimized for retrieval, routing, and similarity, not generation or captioning. |
| - Reported validation metrics come from an internal cached retrieval validation set. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{elephant-embeddings-v1-multimodal-large, |
| title={Elephant Embeddings V1 Multimodal Large}, |
| author={Agentic Intelligence Lab}, |
| year={2026}, |
| url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|