Sentence Similarity
PyTorch
sentence-transformers
multimodal
embeddings
retrieval
image-text
audio-text
text-image-audio
tri-encoder
semantic-router
Eval Results (legacy)
Instructions to use llm-semantic-router/multi-modal-embed-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use llm-semantic-router/multi-modal-embed-large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: pytorch | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - multimodal | |
| - embeddings | |
| - retrieval | |
| - image-text | |
| - audio-text | |
| - text-image-audio | |
| - tri-encoder | |
| - semantic-router | |
| - pytorch | |
| model-index: | |
| - name: multi-modal-embed-large | |
| results: | |
| - task: | |
| type: sentence-similarity | |
| dataset: | |
| name: Internal cached validation set | |
| type: cached_retrieval_validation | |
| metrics: | |
| - name: Eval loss | |
| type: eval_loss | |
| value: 0.389702 | |
| - name: Eval top1 | |
| type: eval_top1 | |
| value: 0.861707 | |
| # multi-modal-embed-large | |
| `multi-modal-embed-large` is the large production multimodal embedding model from the [llm-semantic-router](https://huggingface.co/llm-semantic-router) project. | |
| It is designed for routing, retrieval, and cross-modal matching across text, image, and audio rather than for generative chat. The model uses a tri-encoder architecture with separate text, image, and audio towers projected into one shared embedding space. | |
| ## Purpose | |
| This release exists to provide a large multimodal embedding model for production systems where inputs may arrive as text, screenshots or images, and audio. It is built for semantic routing, multimodal retrieval, and cross-modal similarity. | |
| ## What Is In This Repository | |
| This repository contains the minimum artifacts needed to load and run the exported model: | |
| - `model.pt`: trained weights for the final exported model | |
| - `config.json`: model configuration and encoder names | |
| - `src/hf_st_mm/...`: the Python source package used to construct and run the tri-encoder | |
| - `README.md`: this model card, including usage examples and validation summary | |
| This is not a generic Hugging Face Transformers checkpoint with a built-in auto-class loader. It is a packaged custom PyTorch model export. | |
| ## Advantages And Innovation | |
| Most multimodal models are optimized for generation, captioning, or chat. This model is optimized for embeddings and operational use. | |
| What is different here: | |
| - map text, image, and audio into one shared semantic space | |
| - support routing and retrieval instead of text generation | |
| - preserve a strong multilingual text backbone | |
| - use stronger modality-specific encoders instead of forcing every modality into one monolithic checkpoint | |
| - support production training and evaluation on cached shard datasets | |
| ## Model Overview | |
| This release packages the large routing-grade tri-encoder trained in PyTorch with the server training stack from this project. | |
| Architecture: | |
| - text encoder: `llm-semantic-router/mmbert-embed-32k-2d-matryoshka` | |
| - image encoder: `google/siglip2-so400m-patch14-384` | |
| - audio encoder: `openai/whisper-medium` | |
| - shared embedding dimension: `768` | |
| - max text length: `32768` | |
| Training characteristics: | |
| - objective: cached multiple negatives ranking loss | |
| - training stack: PyTorch + Accelerate | |
| - target hardware: AMD MI300X | |
| - data pipeline: cached tensor shards with sequential shard loading and worker-local prefetch | |
| ## How To Use It | |
| ## Installation | |
| ```bash | |
| pip install torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile huggingface_hub | |
| ``` | |
| ## Python Usage | |
| The simplest way to use the model is to download the repository snapshot, load the packaged source code, and then encode one or more modality-tagged items. | |
| ```python | |
| import json | |
| import os | |
| import sys | |
| import torch | |
| from huggingface_hub import snapshot_download | |
| repo_id = "llm-semantic-router/multi-modal-embed-large" | |
| local_dir = snapshot_download(repo_id=repo_id) | |
| sys.path.insert(0, os.path.join(local_dir, "src")) | |
| from hf_st_mm.data import PairItem | |
| from hf_st_mm.model import MultiModalSentenceEmbedder | |
| with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle: | |
| cfg = json.load(handle) | |
| model = MultiModalSentenceEmbedder( | |
| text_encoder_name=cfg["model"]["text_encoder_name"], | |
| image_encoder_name=cfg["model"]["image_encoder_name"], | |
| audio_encoder_name=cfg["model"]["audio_encoder_name"], | |
| embedding_dim=int(cfg["model"]["embedding_dim"]), | |
| max_text_length=int(cfg["model"]["max_text_length"]), | |
| ) | |
| state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu") | |
| model.load_state_dict(state_dict) | |
| model.eval() | |
| items = [ | |
| PairItem(modality="text", value="route this request to the billing team"), | |
| PairItem(modality="image", value="/path/to/screenshot.png"), | |
| PairItem(modality="audio", value="/path/to/call.wav"), | |
| ] | |
| with torch.no_grad(): | |
| embeddings = model.encode_items(items) | |
| print(embeddings.shape) # [3, 768] | |
| import torch.nn.functional as F | |
| query = PairItem(modality="text", value="refund request for wrong charge") | |
| candidate = PairItem(modality="audio", value="/path/to/refund_call.wav") | |
| with torch.no_grad(): | |
| embs = model.encode_items([query, candidate]) | |
| similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item() | |
| print(f"similarity={similarity:.4f}") | |
| ``` | |
| ## Validation Snapshot | |
| At upload time, the final export was evaluated with the repository's tri-encoder evaluator. | |
| - `eval_loss`: `0.389702` | |
| - `eval_top1`: `0.861707` | |
| ## Practical Notes | |
| - Text inputs can be provided as raw strings or tokenized features. | |
| - Image and audio inputs can be provided as file paths. | |
| - Cached tensor payloads are supported by the training stack, but the simplest inference path is to use file paths or raw text. | |
| - This release is intended for production retrieval and routing use cases rather than for instruction-following or caption generation. | |
| ## Limitations | |
| - This is a custom tri-encoder export, not a standard Transformers auto-class package. | |
| - Inference currently relies on the packaged `hf_st_mm` source code. | |
| - The validation metrics reported here come from the repository's cached retrieval validation path, not from a public benchmark leaderboard. | |
| ## Training Code | |
| Training and evaluation code live in the server training project that produced this checkpoint. | |
| - trainer: `scripts/train_st_multimodal.py` | |
| - evaluator: `scripts/evaluate_tri_encoder.py` | |
| - model: `src/hf_st_mm/model.py` | |