File size: 14,242 Bytes
22890bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6cd8e3
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
license: apache-2.0
language:
  - en
  - zh
base_model:
  - Qwen/Qwen3-Reranker-4B
tags:
  - reranker
  - memory
  - agent
  - cross-encoder
  - sentence-transformers
  - qwen3
library_name: transformers
pipeline_tag: text-classification
---

# MemReranker-4B

## Introduction

**MemReranker** is a reasoning-aware reranking model family (0.6B / 4B) purpose-built for **agent memory retrieval**. It is fine-tuned from [Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B) through multi-stage LLM knowledge distillation.

In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities. This leads to recalled results that are semantically relevant yet do not contain the key information needed to answer the question.

MemReranker addresses three specific problems in memory scenarios:

1. **Score Miscalibration** — Relevance scores from generic models are poorly calibrated, making threshold-based filtering difficult.
2. **Complex Query Degradation** — Ranking degrades when facing temporal constraints, causal reasoning, and other complex queries.
3. **Context Disambiguation** — The model cannot leverage dialogue context for semantic disambiguation.

📄 **Paper:** [MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval](https://arxiv.org/abs/2605.06132)

## Model Family

| Model | Base Model | Parameters | HuggingFace |
|:------|:-----------|:-----------|:------------|
| **MemReranker-4B** | Qwen3-Reranker-4B | 4B | [IAAR-Shanghai/MemReranker-4B](https://huggingface.co/IAAR-Shanghai/MemReranker-4B) |

> 💡 **No GPU? Use our hosted API directly!** Both models are available via the [Memos Rerank API](https://memos-docs.openmem.net/cn/api_docs/core/rerank) — no deployment needed. See the [Memos API](#memos-api) section below.

## Training Methodology

MemReranker employs a two-stage training paradigm:

- **Stage 1: BCE Pointwise Distillation** — Multi-teacher pairwise comparisons generate calibrated soft labels via a five-level scoring system based on Elo/Bradley-Terry, establishing well-distributed scores.
- **Stage 2: InfoNCE Contrastive Fine-tuning** — Enhances hard-sample discrimination through contrastive learning.

Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution.

## Benchmark Results

### LOCOMO Memory Retrieval Benchmark

| Model | MAP | MRR | NDCG@1 | NDCG@3 | NDCG@10 | NDCG | R@3 | R@5 | R@20 | F1 |
|:------|:---:|:---:|:------:|:------:|:-------:|:----:|:---:|:---:|:----:|:--:|
| BGE-v2-m3 | 0.671 | 0.699 | 0.607 | 0.672 | 0.714 | 0.736 | 0.716 | 0.768 | 0.863 | 0.504 |
| Qwen3-Reranker-0.6B | 0.643 | 0.673 | 0.576 | 0.638 | 0.689 | 0.714 | 0.681 | 0.748 | 0.857 | 0.472 |
| Qwen3-Reranker-4B | 0.689 | 0.716 | 0.623 | 0.691 | 0.732 | 0.750 | 0.735 | 0.796 | 0.873 | 0.522 |
| Qwen3-Reranker-8B | 0.721 | 0.748 | 0.666 | 0.724 | 0.759 | 0.775 | 0.763 | 0.813 | 0.880 | 0.552 |
| GPT-4o-mini | 0.715 | 0.742 | 0.657 | 0.719 | 0.753 | 0.770 | 0.760 | 0.812 | 0.868 | 0.544 |
| Gemini-3-Flash | 0.777 | 0.797 | 0.737 | 0.778 | 0.807 | 0.816 | 0.805 | 0.847 | 0.892 | 0.622 |
| **MemReranker-0.6B** | 0.715 | 0.738 | 0.650 | 0.717 | 0.754 | 0.770 | 0.758 | 0.809 | 0.885 | 0.555 |
| **MemReranker-4B** | **0.737** | **0.760** | **0.679** | **0.739** | **0.773** | **0.786** | **0.777** | **0.824** | **0.889** | **0.577** |

> MemReranker-0.6B matches GPT-4o-mini and open-source 4B/8B models on key metrics. MemReranker-4B achieves **0.737 MAP**, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10–20% of large models (~200ms).

## Quickstart

### Memos API

The easiest way to use MemReranker — no GPU or local deployment required. Both `memos-reranker-0.6b` and `memos-reranker-4b` are available through the [Memos Rerank API](https://memos-docs.openmem.net/cn/api_docs/core/rerank).

```python
import os
import requests
import json

os.environ["MEMOS_API_KEY"] = "YOUR_API_KEY"         # Get from https://memos-dashboard.openmem.net/apikeys/
os.environ["MEMOS_BASE_URL"] = "https://memos.memtensor.cn/api/openmem/v1"

url = f"{os.environ['MEMOS_BASE_URL']}/rerank"

payload = {
    "model": "memos-reranker-4b",   # or "memos-reranker-0.6b"
    "query": "用户有什么兴趣爱好",
    "documents": [
        "用户喜欢打羽毛球",
        "用户在杭州做后端开发",
        "用户偏好简洁的回复风格",
        "用户比较喜欢酱香型白酒",
        "用户下周三要去北京出差"
    ],
    "top_n": 3
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Token {os.environ['MEMOS_API_KEY']}"
}

response = requests.post(url, headers=headers, data=json.dumps(payload))
print(response.json())
```

📖 Full API documentation: [https://memos-docs.openmem.net/cn/api_docs/core/rerank](https://memos-docs.openmem.net/cn/api_docs/core/rerank)

### Sentence Transformers (Recommended)

```python
import torch
from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "IAAR-Shanghai/MemReranker-4B",
    model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "cuda"},
)

query = "What is the capital of China?"
documents = [
    "The capital of China is Beijing.",
    "Beijing has been the capital of various Chinese dynasties.",
    "Python is a popular programming language.",
]

# Get relevance scores
scores = model.predict([(query, doc) for doc in documents])
print(scores)
# tensor([2.969, 1.328, -3.141])

# Get 0-1 probability via sigmoid
probs = model.predict(
    [(query, doc) for doc in documents],
    activation_fn=torch.nn.Sigmoid(),
)

# Or get a sorted ranking directly
results = model.rank(query, documents)
print(results)
```

The default prompt is `"query"`, which injects the instruction *"Given a web search query, retrieve relevant passages that answer the query"*. A custom instruction can be passed via `prompts={"my_task": "..."}` and `default_prompt_name="my_task"` on `CrossEncoder(...)`.

### Transformers

> Requires `transformers>=4.51.0`

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(
        instruction=instruction, query=query, doc=doc
    )
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False,
        max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

def compute_logits(inputs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained(
    "IAAR-Shanghai/MemReranker-4B", padding_side='left'
)
model = AutoModelForCausalLM.from_pretrained(
    "IAAR-Shanghai/MemReranker-4B"
).eval()

# We recommend enabling flash_attention_2 for better acceleration and memory saving:
# model = AutoModelForCausalLM.from_pretrained(
#     "IAAR-Shanghai/MemReranker-4B",
#     torch_dtype=torch.float16,
#     attn_implementation="flash_attention_2"
# ).cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = '<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)

task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?", "Explain gravity"]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores:", scores)
```

### vLLM

> Requires `vllm>=0.8.5`

```python
import math
import torch
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.inputs.data import TokensPrompt

def format_instruction(instruction, query, doc):
    text = [
        {"role": "system", "content": 'Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".'},
        {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
    ]
    return text

def process_inputs(pairs, instruction, max_length, suffix_tokens):
    messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
    messages = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
    )
    messages = [ele[:max_length] + suffix_tokens for ele in messages]
    messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
    return messages

def compute_logits(model, messages, sampling_params, true_token, false_token):
    outputs = model.generate(messages, sampling_params, use_tqdm=False)
    scores = []
    for output in outputs:
        final_logits = output.outputs[0].logprobs[-1]
        true_logit = final_logits[true_token].logprob if true_token in final_logits else -10
        false_logit = final_logits[false_token].logprob if false_token in final_logits else -10
        true_score = math.exp(true_logit)
        false_score = math.exp(false_logit)
        score = true_score / (true_score + false_score)
        scores.append(score)
    return scores

number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained("IAAR-Shanghai/MemReranker-4B")
model = LLM(
    model="IAAR-Shanghai/MemReranker-4B",
    tensor_parallel_size=number_of_gpu,
    max_model_len=10000,
    enable_prefix_caching=True,
    gpu_memory_utilization=0.8,
)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

suffix = '<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
max_length = 8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=1,
    logprobs=20,
    allowed_token_ids=[true_token, false_token],
)

task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?", "Explain gravity"]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other.",
]

pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length - len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print("scores:", scores)
```

### vLLM OpenAI-Compatible Server

You can also deploy MemReranker-4B as an OpenAI-compatible API server using vLLM:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model IAAR-Shanghai/MemReranker-4B \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --served-model-name MemReranker-4B \
  --host 0.0.0.0 \
  --port 8089 \
  --chat-template /path/to/qwen3_reranker.jinja \
  --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], "is_original_qwen3_reranker": true}'
```

## How It Works

MemReranker uses a **yes/no logit-score** reranking head inherited from Qwen3-Reranker. Given a query-document pair, the model outputs logits at the final token position. The score is computed as:

$$\text{score} = \frac{e^{\log p(\text{yes})}}{e^{\log p(\text{yes})} + e^{\log p(\text{no})}}$$

where `yes` (token id: 9693) and `no` (token id: 2152) are the two classification tokens.

## Use Cases

- **Agent Memory Retrieval** — Rerank retrieved memory fragments for LLM agents with long-term memory.
- **Multi-turn Dialogue Search** — Handle temporal, causal, and coreference queries in conversational history.
- **RAG Pipelines** — Serve as a second-stage reranker in Retrieve-then-Rerank architectures.
- **General Text Retrieval** — Also effective as a general-purpose reranker for web search and document retrieval tasks.

## Citation

```bibtex
@article{li2026memreranker,
  title={MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval},
  author={Li, Chunyu and Kang, Jingyi and Chen, Ding and Zhang, Mengyuan and Shen, Jiajun and Tang, Bo and Zhou, Xuanhe and Xiong, Feiyu and Li, Zhiyu},
  journal={arXiv preprint arXiv:2605.06132},
  year={2026}
}
```

## Acknowledgements

MemReranker is built upon [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-4B) by the Qwen Team. We thank the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) community for the CrossEncoder integration support.

---
license: apache-2.0
---