Tom Aarsen

Integrate with upcoming Sentence Transformers v5.5.0

1a5fa14 23 days ago

8.25 kB

	---
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- multimodal
	- embeddings
	- retrieval
	- image-text
	- audio-text
	- text-image-audio
	- tri-encoder
	- semantic-router
	- pytorch
	model-index:
	- name: multi-modal-embed-large
	results:
	- task:
	type: sentence-similarity
	dataset:
	name: Internal cached validation set
	type: cached_retrieval_validation
	metrics:
	- name: Eval loss
	type: eval_loss
	value: 0.389702
	- name: Eval top1
	type: eval_top1
	value: 0.861707
	---

	# multi-modal-embed-large

	`multi-modal-embed-large` is the large production multimodal embedding model from the [llm-semantic-router](https://huggingface.co/llm-semantic-router) project.

	It is designed for routing, retrieval, and cross-modal matching across text, image, and audio rather than for generative chat. The model uses a tri-encoder architecture with separate text, image, and audio towers projected into one shared embedding space.

	## Purpose

	This release exists to provide a large multimodal embedding model for production systems where inputs may arrive as text, screenshots or images, and audio. It is built for semantic routing, multimodal retrieval, and cross-modal similarity.

	## What Is In This Repository

	This repository contains the minimum artifacts needed to load and run the exported model:

	- `model.pt`: trained weights for the final exported model
	- `config.json`: model configuration and encoder names
	- `src/hf_st_mm/...`: the Python source package used to construct and run the tri-encoder
	- `README.md`: this model card, including usage examples and validation summary

	This is not a generic Hugging Face Transformers checkpoint with a built-in auto-class loader. It is a packaged custom PyTorch model export.

	## Advantages And Innovation

	Most multimodal models are optimized for generation, captioning, or chat. This model is optimized for embeddings and operational use.

	What is different here:

	- map text, image, and audio into one shared semantic space
	- support routing and retrieval instead of text generation
	- preserve a strong multilingual text backbone
	- use stronger modality-specific encoders instead of forcing every modality into one monolithic checkpoint
	- support production training and evaluation on cached shard datasets

	## Model Overview

	This release packages the large routing-grade tri-encoder trained in PyTorch with the server training stack from this project.

	Architecture:

	- text encoder: `llm-semantic-router/mmbert-embed-32k-2d-matryoshka`
	- image encoder: `google/siglip2-so400m-patch14-384`
	- audio encoder: `openai/whisper-medium`
	- shared embedding dimension: `768`
	- max text length: `32768`

	Training characteristics:

	- objective: cached multiple negatives ranking loss
	- training stack: PyTorch + Accelerate
	- target hardware: AMD MI300X
	- data pipeline: cached tensor shards with sequential shard loading and worker-local prefetch

	## How To Use It

	### Using Sentence Transformers

	Install Sentence Transformers with the audio and image extras:

	```bash
	pip install "sentence_transformers[image,audio]"
	```

	Then load the model directly. Modality is inferred automatically from the input (plain strings -> `text`, image paths/URLs/PIL images -> `image`, audio paths/URLs/NumPy arrays -> `audio`):

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True)

	text_embeddings = model.encode(
	[
	"Martin Luther King Jr. delivering his I have a dream speech",
	"two cats sleeping side by side on a pink couch",
	]
	)
	image_embeddings = model.encode(
	[
	"http://images.cocodataset.org/val2017/000000039769.jpg", # two cats on a pink couch
	"http://images.cocodataset.org/val2017/000000000139.jpg", # distractor
	]
	)
	audio_embeddings = model.encode(
	[
	"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", # MLK speech
	"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/i-know-kung-fu.mp3", # distractor
	]
	)

	print(text_embeddings.shape, image_embeddings.shape, audio_embeddings.shape)
	# (2, 768) (2, 768) (2, 768)

	# Each row is a text query, each column a media candidate; the highest score per row is the
	# correct cross-modal match.
	print(model.similarity(text_embeddings, image_embeddings))
	# tensor([[0.0704, 0.0121], # MLK text: neither image matches
	# [0.5532, 0.3070]]) # cats text: the cats photo wins

	print(model.similarity(text_embeddings, audio_embeddings))
	# tensor([[ 0.2186, 0.1428], # MLK text: the MLK audio wins
	# [-0.0625, 0.0667]]) # cats text: neither audio matches
	```

	Each modality routes through the matching sub-module pipeline:

	- `text` -> `Transformer(mmbert) -> Pooling(mean) -> Normalize`
	- `image` -> `SiglipVisionTransformer -> Pooling(mean) -> Dense(1152, 768) -> Normalize`
	- `audio` -> `WhisperEncoderTransformer -> Pooling(mean) -> Dense(1024, 768) -> Normalize`


	### Using the packaged `hf_st_mm` source code

	The original packaged inference path remains available alongside the Sentence Transformers integration. Install the dependencies:

	```bash
	pip install torch sentence-transformers transformers accelerate safetensors pillow librosa soundfile huggingface_hub
	```

	Then download the repository snapshot, load the packaged source code, and encode modality-tagged items:

	```python
	import json
	import os
	import sys

	import torch
	from huggingface_hub import snapshot_download

	repo_id = "llm-semantic-router/multi-modal-embed-large"
	local_dir = snapshot_download(repo_id=repo_id)

	sys.path.insert(0, os.path.join(local_dir, "src"))

	from hf_st_mm.data import PairItem
	from hf_st_mm.model import MultiModalSentenceEmbedder

	with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as handle:
	cfg = json.load(handle)

	model = MultiModalSentenceEmbedder(
	text_encoder_name=cfg["model"]["text_encoder_name"],
	image_encoder_name=cfg["model"]["image_encoder_name"],
	audio_encoder_name=cfg["model"]["audio_encoder_name"],
	embedding_dim=int(cfg["model"]["embedding_dim"]),
	max_text_length=int(cfg["model"]["max_text_length"]),
	)
	state_dict = torch.load(os.path.join(local_dir, "model.pt"), map_location="cpu")
	model.load_state_dict(state_dict)
	model.eval()

	items = [
	PairItem(modality="text", value="route this request to the billing team"),
	PairItem(modality="image", value="/path/to/screenshot.png"),
	PairItem(modality="audio", value="/path/to/call.wav"),
	]

	with torch.no_grad():
	embeddings = model.encode_items(items)

	print(embeddings.shape) # [3, 768]

	import torch.nn.functional as F

	query = PairItem(modality="text", value="refund request for wrong charge")
	candidate = PairItem(modality="audio", value="/path/to/refund_call.wav")

	with torch.no_grad():
	embs = model.encode_items([query, candidate])

	similarity = F.cosine_similarity(embs[0:1], embs[1:2]).item()
	print(f"similarity={similarity:.4f}")
	```

	## Validation Snapshot

	At upload time, the final export was evaluated with the repository's tri-encoder evaluator.

	- `eval_loss`: `0.389702`
	- `eval_top1`: `0.861707`

	## Practical Notes

	- Text inputs can be provided as raw strings or tokenized features.
	- Image and audio inputs can be provided as file paths.
	- Cached tensor payloads are supported by the training stack, but the simplest inference path is to use file paths or raw text.
	- This release is intended for production retrieval and routing use cases rather than for instruction-following or caption generation.

	## Limitations

	- This is a custom tri-encoder export, not a standard Transformers auto-class package.
	- Inference currently relies on the packaged `hf_st_mm` source code.
	- The validation metrics reported here come from the repository's cached retrieval validation path, not from a public benchmark leaderboard.

	## Training Code

	Training and evaluation code live in the server training project that produced this checkpoint.

	- trainer: `scripts/train_st_multimodal.py`
	- evaluator: `scripts/evaluate_tri_encoder.py`
	- model: `src/hf_st_mm/model.py`