README.md · agentic-in/elephant-embeddings-v1-multimodal-large at main

elephant-embeddings-v1-multimodal-large / README.md

Xunzhuo

Mirror agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large from ModelScope

17bbccd verified 8 days ago

preview code

raw

history blame contribute delete

6.97 kB

	---
	license: apache-2.0
	library_name: pytorch
	pipeline_tag: sentence-similarity
	language:
	- multilingual
	tags:
	- agentic-intelligence-lab
	- elephant
	- embeddings
	- multimodal
	- retrieval
	- rag
	- agents
	- routing
	- image-text
	- audio-text
	- tri-encoder
	- pytorch
	model-index:
	- name: elephant-embeddings-v1-multimodal-large
	results:
	- task:
	type: sentence-similarity
	dataset:
	name: Internal cached validation set
	type: cached_retrieval_validation
	metrics:
	- name: Eval loss
	type: eval_loss
	value: 0.389702
	- name: Eval top1
	type: eval_top1
	value: 0.861707
	---

	# Elephant Embeddings V1 Multimodal Large

	`elephant-embeddings-v1-multimodal-large` is the large multimodal embedding model in the Agentic Intelligence Lab Elephant Embeddings V1 family.

	This ModelScope release is maintained by `agentic-intelligence-lab` to make Elephant embedding models easier to download and deploy in mainland China. It mirrors and renames the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-large` under a consistent Elephant model namespace.

	## Positioning

	This model is a production-oriented multimodal embedding model for semantic routing, retrieval, and cross-modal matching across text, image, and audio.

	It is not a generative chat or captioning model. Instead, it maps different modalities into one shared embedding space so agent systems can compare requests, screenshots, documents, and audio records with the same retrieval interface.

	## Model at a glance

	\| Item \| Value \|
	\| --- \| --- \|
	\| Family \| Elephant Embeddings V1 \|
	\| Maintainer \| Agentic Intelligence Lab \|
	\| Model type \| Multimodal embedding model \|
	\| Modalities \| Text, image, audio \|
	\| Architecture \| Custom PyTorch tri-encoder \|
	\| Text encoder \| `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` \|
	\| Image encoder \| `google/siglip2-so400m-patch14-384` \|
	\| Audio encoder \| `openai/whisper-medium` \|
	\| Embedding dimension \| 768 \|
	\| Max text length \| 32,768 tokens \|
	\| Objective \| Cached multiple negatives ranking loss \|
	\| Upstream source \| `llm-semantic-router/multi-modal-embed-large` \|
	\| License \| Apache 2.0 \|

	## Why it fits agentic workloads

	Agentic products increasingly need to retrieve and route over mixed inputs: user text, screenshots, UI states, documents, voice notes, support calls, and multimodal memory. This model is designed for that operating pattern.

	Key advantages:

	- Shared semantic space: text, images, and audio can be compared with cosine similarity.
	- Routing-grade representation: optimized for retrieval, matching, and routing rather than generation.
	- Strong modality towers: uses dedicated text, image, and audio encoders instead of forcing all modalities through a single monolithic checkpoint.
	- Long-context text path: supports long tool descriptions, traces, and knowledge chunks through the text encoder.
	- Production packaging: includes the custom source package needed to construct and run the tri-encoder.

	## Recommended use cases

	\| Scenario \| Example \|
	\| --- \| --- \|
	\| Multimodal RAG \| Retrieve text notes using an image or audio query \|
	\| Agent routing \| Route screenshots, user text, or voice requests to the right tool or workflow \|
	\| Memory search \| Search mixed text/image/audio memory stores in one vector space \|
	\| Support and operations \| Match tickets, screenshots, logs, and recorded calls semantically \|
	\| Offline indexing \| Build high-quality 768d multimodal indexes \|

	## Quick start on ModelScope

	```bash
	pip install modelscope torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile
	```

	```python
	import json
	import os
	import sys

	import torch
	import torch.nn.functional as F
	from modelscope import snapshot_download

	repo_id = "agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large"
	local_dir = snapshot_download(repo_id)

	sys.path.insert(0, os.path.join(local_dir, "src"))

	from hf_st_mm.data import PairItem
	from hf_st_mm.model import MultiModalSentenceEmbedder

	with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
	cfg = json.load(handle)

	model = MultiModalSentenceEmbedder(
	text_encoder_name=cfg["model"]["text_encoder_name"],
	image_encoder_name=cfg["model"]["image_encoder_name"],
	audio_encoder_name=cfg["model"]["audio_encoder_name"],
	embedding_dim=int(cfg["model"]["embedding_dim"]),
	max_text_length=int(cfg["model"]["max_text_length"]),
	)

	state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
	model.load_state_dict(state_dict)
	model.eval()

	items = [
	PairItem(modality="text", value="route this request to the billing workflow"),
	PairItem(modality="image", value="/path/to/screenshot.png"),
	PairItem(modality="audio", value="/path/to/call.wav"),
	]

	with torch.no_grad():
	embeddings = model.encode_items(items)

	print(embeddings.shape) # [3, 768]

	query = PairItem(modality="text", value="refund request for a wrong charge")
	candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")

	with torch.no_grad():
	embs = model.encode_items([query, candidate])

	similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
	print(f"similarity={similarity:.4f}")
	```

	## Evaluation snapshot

	\| Metric \| Value \|
	\| --- \| ---: \|
	\| Eval loss \| 0.389702 \|
	\| Eval top1 \| 0.861707 \|

	The validation metrics come from the tri-encoder cached retrieval validation path used during export. They are intended as a release sanity snapshot rather than a public leaderboard claim.

	## Files

	\| File \| Description \|
	\| --- \| --- \|
	\| `model.pt` \| Exported PyTorch weights \|
	\| `config.json` \| Tri-encoder and training/export configuration \|
	\| `src/hf_st_mm/` \| Python package used to construct and run the model \|
	\| `README.md` \| This model card \|

	## Lineage

	This ModelScope package is published by `agentic-intelligence-lab` as part of the Elephant model release line. It mirrors the upstream HuggingFace model `llm-semantic-router/multi-modal-embed-large` and keeps the model artifacts unchanged except for the repository naming and model card presentation.

	## Limitations

	- This is a custom PyTorch tri-encoder export, not a standard Transformers auto-class checkpoint.
	- Inference relies on the packaged `hf_st_mm` source code.
	- Image and audio inputs are expected as local file paths in the simple inference path.
	- The model is optimized for retrieval, routing, and similarity, not generation or captioning.
	- Reported validation metrics come from an internal cached retrieval validation set.

	## Citation

	```bibtex
	@misc{elephant-embeddings-v1-multimodal-large,
	title={Elephant Embeddings V1 Multimodal Large},
	author={Agentic Intelligence Lab},
	year={2026},
	url={https://modelscope.cn/models/agentic-intelligence-lab/elephant-embeddings-v1-multimodal-large}
	}
	```

	## License

	Apache 2.0