aegis-embed
aegis-embed is a multilingual long-context embedding model purpose-built for agent-native retrieval, memory, and decision workflows.
It is designed for systems where embeddings sit on the semantic hot path rather than at the edge of the stack: memory lookup, knowledge retrieval, tool matching, task routing, long-horizon recall, clustering, and multilingual indexing. Its value is not just a benchmark score, but a practical operating profile that fits real agent runtimes: 32K context, 2D Matryoshka adaptability across dimensions and layers, 307M-class deployability, and strong latency-quality efficiency under repeated inference.
In short, aegis-embed is built for teams that want one embedding space to support fast routing, scalable retrieval, and high-confidence semantic matching without paying the operational cost of a much larger model.
Why it fits agentic workloads
Agentic systems do not call embeddings once. They call them everywhere: before retrieval, during routing, when matching tools, when searching memory, and while compressing or re-ranking state. That means a useful agent embedding model must be more than accurate — it must also be flexible under tight runtime budgets.
aegis-embed is designed around that reality.
1. One model, many budget tiers
This model supports Matryoshka embeddings, which means you can encode once at full size and truncate to smaller dimensions with limited quality loss.
That is especially useful for agent systems because different stages of the stack often need different budgets:
- 64d for very cheap candidate generation, broad routing, or huge memory banks
- 256d for balanced retrieval over large corpora
- 768d for highest-quality retrieval, offline indexing, or final-stage matching
Instead of managing separate embedding models for each tier, you can keep one semantic space and choose the dimensional budget that matches the task.
2. 2D Matryoshka gives runtime flexibility, not just storage savings
The model is trained with 2D Matryoshka behavior:
- dimension reduction for smaller vectors and lower storage / bandwidth cost
- layer reduction for lower-latency inference paths in custom runtimes
This matters for agents because the same system often mixes:
- latency-sensitive routing decisions
- high-volume memory scans
- higher-quality retrieval for final evidence gathering
A single model that can serve multiple latency / quality profiles is much easier to operate than a stack of unrelated specialized encoders.
3. Long context helps when agent state is not naturally short
Many agent workloads are not short isolated queries. They involve:
- tool descriptions
- execution traces
- long notes
- merged memory summaries
- multi-hop research snippets
- large document chunks
With 32,768 tokens of context length, aegis-embed can represent larger semantic units before you are forced into aggressive chunking. That helps preserve cross-section meaning in long documents and richer memory entries.
4. Small enough to be operationally practical
At roughly 307M parameters, this model sits in a useful middle ground:
- substantially lighter than large embedding models in the 600M+ or multi-billion range
- still expressive enough for multilingual retrieval and similarity work
- easier to host in systems where embedding is part of a hot path rather than an occasional offline batch
For agentic platforms, that usually means better economics and simpler scaling.
5. One embedding space across the stack
Agent systems are easier to operate when routing, retrieval, memory search, and semantic matching all live in the same vector space.
aegis-embed is well suited to that pattern:
- 64d can serve broad routing and large-memory scanning
- 256d can cover the main retrieval tier
- 768d can stay reserved for the highest-fidelity matching paths
That means one model can cover multiple semantic stages without forcing the system to juggle incompatible encoders, duplicated indexes, or divergent retrieval behavior.
Model at a glance
| Feature | Value |
|---|---|
| Parameters | 307M |
| Architecture | ModernBERT encoder with YaRN scaling |
| Hidden Size | 768 |
| Layers | 22 |
| Context Length | 32,768 tokens |
| Pooling | Mean pooling |
| Similarity | Cosine |
| Languages | Multilingual |
| Matryoshka Dimensions | 768, 512, 256, 128, 64 |
Headline results
| Metric | Score |
|---|---|
| MTEB Mean (24 tasks) | 61.4 |
| STS Benchmark | 80.5 |
| Dimension Retention | 99% @ 256d, 98% @ 64d |
| Layer Speedup | 3.3× @ 6L, 5.8× @ 3L |
| Latency vs BGE-M3 | 1.6-3.1× faster on longer sequences / larger batches |
These numbers make the model particularly attractive for systems that must balance quality, latency, vector size, and deployment simplicity instead of optimizing only for leaderboard peak score.
Usage
Basic usage with Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("/path/to/aegis-embed")
texts = [
"Find tool descriptions related to browser automation.",
"检索和用户历史偏好相关的记忆。",
"Retrieve notes about deployment failures in staging.",
]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 768)
Matryoshka truncation for smaller vectors
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("/path/to/aegis-embed")
embeddings = model.encode(texts, convert_to_tensor=True)
# Balanced retrieval tier
embeddings_256d = F.normalize(embeddings[:, :256], p=2, dim=1)
# Ultra-cheap routing / large memory-bank tier
embeddings_64d = F.normalize(embeddings[:, :64], p=2, dim=1)
Long-context encoding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("/path/to/aegis-embed")
model.max_seq_length = 8192 # can be increased up to 32768
long_note = "..."
embedding = model.encode(long_note)
Why Matryoshka matters for agents
A common agent stack has several retrieval-like stages:
- broad candidate fetch over a very large store
- narrower semantic lookup over a smaller candidate set
- high-confidence final matching before action or answer synthesis
Matryoshka lets one model support all three stages:
| Stage | Suggested Dim | Why |
|---|---|---|
| Broad routing / candidate generation | 64d | Maximize speed and minimize storage |
| Main retrieval | 256d | Strong balance of quality and cost |
| Final matching / offline indexing | 768d | Best semantic fidelity |
That is often a better operational story than mixing several incompatible embedding models across the same pipeline.
Evaluation details
MTEB benchmark (24 tasks)
| Category | Score |
|---|---|
| STS (7 tasks) | 79.3 |
| Classification (6) | 62.4 |
| Pair Classification (2) | 76.2 |
| Reranking (2) | 64.4 |
| Clustering (4) | 36.9 |
| Retrieval (3) | 38.2 |
| Overall Mean | 61.4 |
STS benchmark comparison
| Model | Parameters | STS Score |
|---|---|---|
| Qwen3-Embed-0.6B | 600M | 76.17 |
| aegis-embed | 307M | 80.5 |
| Qwen3-Embed-8B | 8B | 81.08 |
2D Matryoshka quality matrix (STS)
| Layers | 768d | 256d | 64d |
|---|---|---|---|
| 22L | 80.5 | 79.9 | 78.5 |
| 11L | 53.7 | 48.0 | 44.4 |
| 6L | 45.2 | 45.2 | 43.5 |
| 3L | 44.0 | 44.1 | 41.8 |
Long-context retrieval (4K tokens)
| Metric | Score |
|---|---|
| R@1 | 68.8% |
| R@10 | 81.2% |
| MRR | 71.9% |
Throughput (AMD MI300X)
| Layers | Throughput | Speedup |
|---|---|---|
| 22L | 477/s | 1.0× |
| 11L | 916/s | 1.9× |
| 6L | 1573/s | 3.3× |
| 3L | 2761/s | 5.8× |
Training
Data
Trained on BAAI/bge-m3-data with multilingual triplets across diverse domains.
Configuration
- Base model: llm-semantic-router/mmbert-32k-yarn
- Loss:
Matryoshka2dLoss(combines adaptive layer loss and Matryoshka loss) - Matryoshka dimensions:
[768, 512, 256, 128, 64] - Max sequence length:
32768 - Batch size:
16(effective32with gradient accumulation) - Learning rate:
2e-5 - Hardware: AMD Instinct MI300X
Recommended use cases
aegis-embed is especially well suited for:
- Agent memory retrieval across long, mixed-format notes or histories
- Tool and skill selection where descriptions need semantic matching
- Knowledge-base retrieval for assistants and RAG systems
- Multilingual search across mixed-language corpora
- Large memory banks that benefit from 64d / 256d vector tiers
- Long-document semantic indexing where short-context encoders lose structure
Model lineage and packaging
aegis-embed is derived from llm-semantic-router/mmbert-embed-32k-2d-matryoshka and distributed here as a lean Sentence Transformers / PyTorch package.
This build intentionally omits bundled ONNX artifacts so the model remains smaller and easier to move, mirror, cache, and deploy in environments that primarily rely on native Transformers runtimes.
Limitations
- Full-quality mode is still the best default for important retrieval decisions; aggressive layer reduction trades away quality.
- Although the model supports up to 32K tokens, very long inputs still increase compute and memory cost.
- The model is optimized for retrieval and semantic similarity; some downstream tasks may benefit from task-specific fine-tuning.
- If your deployment stack requires ONNX out of the box, you will need to export that separately.
Citation
If you use this model, please cite the upstream work it is derived from:
@misc{mmbert-embed-2d-matryoshka,
title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
author={vLLM Semantic Router Team},
year={2025},
url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for llm-semantic-router/aegis-embed
Base model
jhu-clsp/mmBERT-baseCollection including llm-semantic-router/aegis-embed
Evaluation results
- spearman on STS Benchmarkself-reported80.500