File size: 8,253 Bytes

---
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- multimodal
- embeddings
- retrieval
- image-text
- audio-text
- text-image-audio
- tri-encoder
- semantic-router
- pytorch
model-index:
- name: multi-modal-embed-large
  results:
  - task:
      type: sentence-similarity
    dataset:
      name: Internal cached validation set
      type: cached_retrieval_validation
    metrics:
    - name: Eval loss
      type: eval_loss
      value: 0.389702
    - name: Eval top1
      type: eval_top1
      value: 0.861707
---

# multi-modal-embed-large

`multi-modal-embed-large` is the large production multimodal embedding model from the [llm-semantic-router](https://huggingface.co/llm-semantic-router) project.

It is designed for routing, retrieval, and cross-modal matching across text, image, and audio rather than for generative chat. The model uses a tri-encoder architecture with separate text, image, and audio towers projected into one shared embedding space.

## Purpose

This release exists to provide a large multimodal embedding model for production systems where inputs may arrive as text, screenshots or images, and audio. It is built for semantic routing, multimodal retrieval, and cross-modal similarity.

## What Is In This Repository

This repository contains the minimum artifacts needed to load and run the exported model:

- `model.pt`: trained weights for the final exported model
- `config.json`: model configuration and encoder names
- `src/hf_st_mm/...`: the Python source package used to construct and run the tri-encoder
- `README.md`: this model card, including usage examples and validation summary

This is not a generic Hugging Face Transformers checkpoint with a built-in auto-class loader. It is a packaged custom PyTorch model export.

## Advantages And Innovation

Most multimodal models are optimized for generation, captioning, or chat. This model is optimized for embeddings and operational use.

What is different here:

- map text, image, and audio into one shared semantic space
- support routing and retrieval instead of text generation
- preserve a strong multilingual text backbone
- use stronger modality-specific encoders instead of forcing every modality into one monolithic checkpoint
- support production training and evaluation on cached shard datasets

## Model Overview

This release packages the large routing-grade tri-encoder trained in PyTorch with the server training stack from this project.

Architecture:

- text encoder: `llm-semantic-router/mmbert-embed-32k-2d-matryoshka`
- image encoder: `google/siglip2-so400m-patch14-384`
- audio encoder: `openai/whisper-medium`
- shared embedding dimension: `768`
- max text length: `32768`

Training characteristics:

- objective: cached multiple negatives ranking loss
- training stack: PyTorch + Accelerate
- target hardware: AMD MI300X
- data pipeline: cached tensor shards with sequential shard loading and worker-local prefetch

## How To Use It

### Using Sentence Transformers

Install Sentence Transformers with the audio and image extras:

```bash
pip install "sentence_transformers[image,audio]"
```

Then load the model directly. Modality is inferred automatically from the input (plain strings -> `text`, image paths/URLs/PIL images -> `image`, audio paths/URLs/NumPy arrays -> `audio`):

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True)

text_embeddings = model.encode(
    [
        "Martin Luther King Jr. delivering his I have a dream speech",
        "two cats sleeping side by side on a pink couch",
    ]
)
image_embeddings = model.encode(
    [
        "http://images.cocodataset.org/val2017/000000039769.jpg",  # two cats on a pink couch
        "http://images.cocodataset.org/val2017/000000000139.jpg",  # distractor
    ]
)
audio_embeddings = model.encode(
    [
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",            # MLK speech
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/i-know-kung-fu.mp3",  # distractor
    ]
)

print(text_embeddings.shape, image_embeddings.shape, audio_embeddings.shape)
# (2, 768) (2, 768) (2, 768)

# Each row is a text query, each column a media candidate; the highest score per row is the
# correct cross-modal match.
print(model.similarity(text_embeddings, image_embeddings))
# tensor([[0.0704, 0.0121],   # MLK text:  neither image matches
#         [0.5532, 0.3070]])  # cats text: the cats photo wins

print(model.similarity(text_embeddings, audio_embeddings))
# tensor([[ 0.2186,  0.1428],   # MLK text:  the MLK audio wins
#         [-0.0625,  0.0667]])  # cats text: neither audio matches
```

Each modality routes through the matching sub-module pipeline:

- `text` -> `Transformer(mmbert) -> Pooling(mean) -> Normalize`
- `image` -> `SiglipVisionTransformer -> Pooling(mean) -> Dense(1152, 768) -> Normalize`
- `audio` -> `WhisperEncoderTransformer -> Pooling(mean) -> Dense(1024, 768) -> Normalize`


### Using the packaged `hf_st_mm` source code

The original packaged inference path remains available alongside the Sentence Transformers integration. Install the dependencies:

```bash
pip install torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile huggingface_hub
```

Then download the repository snapshot, load the packaged source code, and encode modality-tagged items:

```python
import json
import os
import sys

import torch
from huggingface_hub import snapshot_download

repo_id = "llm-semantic-router/multi-modal-embed-large"
local_dir = snapshot_download(repo_id=repo_id)

sys.path.insert(0, os.path.join(local_dir, "src"))

from hf_st_mm.data import PairItem
from hf_st_mm.model import MultiModalSentenceEmbedder

with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
    cfg = json.load(handle)

model = MultiModalSentenceEmbedder(
    text_encoder_name=cfg["model"]["text_encoder_name"],
    image_encoder_name=cfg["model"]["image_encoder_name"],
    audio_encoder_name=cfg["model"]["audio_encoder_name"],
    embedding_dim=int(cfg["model"]["embedding_dim"]),
    max_text_length=int(cfg["model"]["max_text_length"]),
)
state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

items = [
  PairItem(modality="text", value="route this request to the billing team"),
    PairItem(modality="image", value="/path/to/screenshot.png"),
    PairItem(modality="audio", value="/path/to/call.wav"),
]

with torch.no_grad():
    embeddings = model.encode_items(items)

print(embeddings.shape)  # [3, 768]

import torch.nn.functional as F

query = PairItem(modality="text", value="refund request for wrong charge")
candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")

with torch.no_grad():
    embs = model.encode_items([query, candidate])

similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
print(f"similarity={similarity:.4f}")
```

## Validation Snapshot

At upload time, the final export was evaluated with the repository's tri-encoder evaluator.

- `eval_loss`: `0.389702`
- `eval_top1`: `0.861707`

## Practical Notes

- Text inputs can be provided as raw strings or tokenized features.
- Image and audio inputs can be provided as file paths.
- Cached tensor payloads are supported by the training stack, but the simplest inference path is to use file paths or raw text.
- This release is intended for production retrieval and routing use cases rather than for instruction-following or caption generation.

## Limitations

- This is a custom tri-encoder export, not a standard Transformers auto-class package.
- Inference currently relies on the packaged `hf_st_mm` source code.
- The validation metrics reported here come from the repository's cached retrieval validation path, not from a public benchmark leaderboard.

## Training Code

Training and evaluation code live in the server training project that produced this checkpoint.

- trainer: `scripts/train_st_multimodal.py`
- evaluator: `scripts/evaluate_tri_encoder.py`
- model: `src/hf_st_mm/model.py`