aegis-embed

aegis-embed is a multilingual long-context embedding model purpose-built for agent-native retrieval, memory, and decision workflows.

It is designed for systems where embeddings sit on the semantic hot path rather than at the edge of the stack: memory lookup, knowledge retrieval, tool matching, task routing, long-horizon recall, clustering, and multilingual indexing. Its value is not just a benchmark score, but a practical operating profile that fits real agent runtimes: 32K context, 2D Matryoshka adaptability across dimensions and layers, 307M-class deployability, and strong latency-quality efficiency under repeated inference.

In short, aegis-embed is built for teams that want one embedding space to support fast routing, scalable retrieval, and high-confidence semantic matching without paying the operational cost of a much larger model.

Why it fits agentic workloads

Agentic systems do not call embeddings once. They call them everywhere: before retrieval, during routing, when matching tools, when searching memory, and while compressing or re-ranking state. That means a useful agent embedding model must be more than accurate — it must also be flexible under tight runtime budgets.

aegis-embed is designed around that reality.

1. One model, many budget tiers

This model supports Matryoshka embeddings, which means you can encode once at full size and truncate to smaller dimensions with limited quality loss.

That is especially useful for agent systems because different stages of the stack often need different budgets:

64d for very cheap candidate generation, broad routing, or huge memory banks
256d for balanced retrieval over large corpora
768d for highest-quality retrieval, offline indexing, or final-stage matching

Instead of managing separate embedding models for each tier, you can keep one semantic space and choose the dimensional budget that matches the task.

2. 2D Matryoshka gives runtime flexibility, not just storage savings

The model is trained with 2D Matryoshka behavior:

dimension reduction for smaller vectors and lower storage / bandwidth cost
layer reduction for lower-latency inference paths in custom runtimes

This matters for agents because the same system often mixes:

latency-sensitive routing decisions
high-volume memory scans
higher-quality retrieval for final evidence gathering

A single model that can serve multiple latency / quality profiles is much easier to operate than a stack of unrelated specialized encoders.

3. Long context helps when agent state is not naturally short

Many agent workloads are not short isolated queries. They involve:

tool descriptions
execution traces
long notes
merged memory summaries
multi-hop research snippets
large document chunks

With 32,768 tokens of context length, aegis-embed can represent larger semantic units before you are forced into aggressive chunking. That helps preserve cross-section meaning in long documents and richer memory entries.

4. Small enough to be operationally practical

At roughly 307M parameters, this model sits in a useful middle ground:

substantially lighter than large embedding models in the 600M+ or multi-billion range
still expressive enough for multilingual retrieval and similarity work
easier to host in systems where embedding is part of a hot path rather than an occasional offline batch

For agentic platforms, that usually means better economics and simpler scaling.

5. One embedding space across the stack

Agent systems are easier to operate when routing, retrieval, memory search, and semantic matching all live in the same vector space.

aegis-embed is well suited to that pattern:

64d can serve broad routing and large-memory scanning
256d can cover the main retrieval tier
768d can stay reserved for the highest-fidelity matching paths

That means one model can cover multiple semantic stages without forcing the system to juggle incompatible encoders, duplicated indexes, or divergent retrieval behavior.

Model at a glance

Feature	Value
Parameters	307M
Architecture	ModernBERT encoder with YaRN scaling
Hidden Size	768
Layers	22
Context Length	32,768 tokens
Pooling	Mean pooling
Similarity	Cosine
Languages	Multilingual
Matryoshka Dimensions	768, 512, 256, 128, 64

Headline results

Metric	Score
MTEB Mean (24 tasks)	61.4
STS Benchmark	80.5
Dimension Retention	99% @ 256d, 98% @ 64d
Layer Speedup	3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3	1.6-3.1× faster on longer sequences / larger batches

These numbers make the model particularly attractive for systems that must balance quality, latency, vector size, and deployment simplicity instead of optimizing only for leaderboard peak score.

Usage

Basic usage with Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")

texts = [
    "Find tool descriptions related to browser automation.",
    "检索和用户历史偏好相关的记忆。",
    "Retrieve notes about deployment failures in staging.",
]

embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 768)

Matryoshka truncation for smaller vectors

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")
embeddings = model.encode(texts, convert_to_tensor=True)

# Balanced retrieval tier
embeddings_256d = F.normalize(embeddings[:, :256], p=2, dim=1)

# Ultra-cheap routing / large memory-bank tier
embeddings_64d = F.normalize(embeddings[:, :64], p=2, dim=1)

Long-context encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("/path/to/aegis-embed")
model.max_seq_length = 8192  # can be increased up to 32768

long_note = "..."
embedding = model.encode(long_note)

Why Matryoshka matters for agents

A common agent stack has several retrieval-like stages:

broad candidate fetch over a very large store
narrower semantic lookup over a smaller candidate set
high-confidence final matching before action or answer synthesis

Matryoshka lets one model support all three stages:

Stage	Suggested Dim	Why
Broad routing / candidate generation	64d	Maximize speed and minimize storage
Main retrieval	256d	Strong balance of quality and cost
Final matching / offline indexing	768d	Best semantic fidelity

That is often a better operational story than mixing several incompatible embedding models across the same pipeline.

Evaluation details

MTEB benchmark (24 tasks)

Category	Score
STS (7 tasks)	79.3
Classification (6)	62.4
Pair Classification (2)	76.2
Reranking (2)	64.4
Clustering (4)	36.9
Retrieval (3)	38.2
Overall Mean	61.4

STS benchmark comparison

Model	Parameters	STS Score
Qwen3-Embed-0.6B	600M	76.17
aegis-embed	307M	80.5
Qwen3-Embed-8B	8B	81.08

2D Matryoshka quality matrix (STS)

Layers	768d	256d	64d
22L	80.5	79.9	78.5
11L	53.7	48.0	44.4
6L	45.2	45.2	43.5
3L	44.0	44.1	41.8

Long-context retrieval (4K tokens)

Metric	Score
R@1	68.8%
R@10	81.2%
MRR	71.9%

Throughput (AMD MI300X)

Layers	Throughput	Speedup
22L	477/s	1.0×
11L	916/s	1.9×
6L	1573/s	3.3×
3L	2761/s	5.8×

Training

Data

Trained on BAAI/bge-m3-data with multilingual triplets across diverse domains.

Configuration

Base model: llm-semantic-router/mmbert-32k-yarn
Loss: Matryoshka2dLoss (combines adaptive layer loss and Matryoshka loss)
Matryoshka dimensions: [768, 512, 256, 128, 64]
Max sequence length: 32768
Batch size: 16 (effective 32 with gradient accumulation)
Learning rate: 2e-5
Hardware: AMD Instinct MI300X

Recommended use cases

aegis-embed is especially well suited for:

Agent memory retrieval across long, mixed-format notes or histories
Tool and skill selection where descriptions need semantic matching
Knowledge-base retrieval for assistants and RAG systems
Multilingual search across mixed-language corpora
Large memory banks that benefit from 64d / 256d vector tiers
Long-document semantic indexing where short-context encoders lose structure

Model lineage and packaging

aegis-embed is derived from llm-semantic-router/mmbert-embed-32k-2d-matryoshka and distributed here as a lean Sentence Transformers / PyTorch package.

This build intentionally omits bundled ONNX artifacts so the model remains smaller and easier to move, mirror, cache, and deploy in environments that primarily rely on native Transformers runtimes.

Limitations

Full-quality mode is still the best default for important retrieval decisions; aggressive layer reduction trades away quality.
Although the model supports up to 32K tokens, very long inputs still increase compute and memory cost.
The model is optimized for retrieval and semantic similarity; some downstream tasks may benefit from task-specific fine-tuning.
If your deployment stack requires ONNX out of the box, you will need to export that separately.

Citation

If you use this model, please cite the upstream work it is derived from:

@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}