kk04jy fridayl commited on
Commit
22890bb
·
1 Parent(s): 7fe33c1

Update README.md (#2)

Browse files

- Update README.md (cbb4d851cab8c0e300b9c5d66ddaceacb0c75bba)


Co-authored-by: chunyuli <fridayl@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +341 -0
README.md CHANGED
@@ -1,3 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - Qwen/Qwen3-Reranker-4B
8
+ tags:
9
+ - reranker
10
+ - memory
11
+ - agent
12
+ - cross-encoder
13
+ - sentence-transformers
14
+ - qwen3
15
+ library_name: transformers
16
+ pipeline_tag: text-classification
17
+ ---
18
+
19
+ # MemReranker-4B
20
+
21
+ ## Introduction
22
+
23
+ **MemReranker** is a reasoning-aware reranking model family (0.6B / 4B) purpose-built for **agent memory retrieval**. It is fine-tuned from [Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B) through multi-stage LLM knowledge distillation.
24
+
25
+ In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities. This leads to recalled results that are semantically relevant yet do not contain the key information needed to answer the question.
26
+
27
+ MemReranker addresses three specific problems in memory scenarios:
28
+
29
+ 1. **Score Miscalibration** — Relevance scores from generic models are poorly calibrated, making threshold-based filtering difficult.
30
+ 2. **Complex Query Degradation** — Ranking degrades when facing temporal constraints, causal reasoning, and other complex queries.
31
+ 3. **Context Disambiguation** — The model cannot leverage dialogue context for semantic disambiguation.
32
+
33
+ 📄 **Paper:** [MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval](https://arxiv.org/abs/2605.06132)
34
+
35
+ ## Model Family
36
+
37
+ | Model | Base Model | Parameters | HuggingFace |
38
+ |:------|:-----------|:-----------|:------------|
39
+ | **MemReranker-4B** | Qwen3-Reranker-4B | 4B | [IAAR-Shanghai/MemReranker-4B](https://huggingface.co/IAAR-Shanghai/MemReranker-4B) |
40
+
41
+ > 💡 **No GPU? Use our hosted API directly!** Both models are available via the [Memos Rerank API](https://memos-docs.openmem.net/cn/api_docs/core/rerank) — no deployment needed. See the [Memos API](#memos-api) section below.
42
+
43
+ ## Training Methodology
44
+
45
+ MemReranker employs a two-stage training paradigm:
46
+
47
+ - **Stage 1: BCE Pointwise Distillation** — Multi-teacher pairwise comparisons generate calibrated soft labels via a five-level scoring system based on Elo/Bradley-Terry, establishing well-distributed scores.
48
+ - **Stage 2: InfoNCE Contrastive Fine-tuning** — Enhances hard-sample discrimination through contrastive learning.
49
+
50
+ Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution.
51
+
52
+ ## Benchmark Results
53
+
54
+ ### LOCOMO Memory Retrieval Benchmark
55
+
56
+ | Model | MAP | MRR | NDCG@1 | NDCG@3 | NDCG@10 | NDCG | R@3 | R@5 | R@20 | F1 |
57
+ |:------|:---:|:---:|:------:|:------:|:-------:|:----:|:---:|:---:|:----:|:--:|
58
+ | BGE-v2-m3 | 0.671 | 0.699 | 0.607 | 0.672 | 0.714 | 0.736 | 0.716 | 0.768 | 0.863 | 0.504 |
59
+ | Qwen3-Reranker-0.6B | 0.643 | 0.673 | 0.576 | 0.638 | 0.689 | 0.714 | 0.681 | 0.748 | 0.857 | 0.472 |
60
+ | Qwen3-Reranker-4B | 0.689 | 0.716 | 0.623 | 0.691 | 0.732 | 0.750 | 0.735 | 0.796 | 0.873 | 0.522 |
61
+ | Qwen3-Reranker-8B | 0.721 | 0.748 | 0.666 | 0.724 | 0.759 | 0.775 | 0.763 | 0.813 | 0.880 | 0.552 |
62
+ | GPT-4o-mini | 0.715 | 0.742 | 0.657 | 0.719 | 0.753 | 0.770 | 0.760 | 0.812 | 0.868 | 0.544 |
63
+ | Gemini-3-Flash | 0.777 | 0.797 | 0.737 | 0.778 | 0.807 | 0.816 | 0.805 | 0.847 | 0.892 | 0.622 |
64
+ | **MemReranker-0.6B** | 0.715 | 0.738 | 0.650 | 0.717 | 0.754 | 0.770 | 0.758 | 0.809 | 0.885 | 0.555 |
65
+ | **MemReranker-4B** | **0.737** | **0.760** | **0.679** | **0.739** | **0.773** | **0.786** | **0.777** | **0.824** | **0.889** | **0.577** |
66
+
67
+ > MemReranker-0.6B matches GPT-4o-mini and open-source 4B/8B models on key metrics. MemReranker-4B achieves **0.737 MAP**, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10–20% of large models (~200ms).
68
+
69
+ ## Quickstart
70
+
71
+ ### Memos API
72
+
73
+ The easiest way to use MemReranker — no GPU or local deployment required. Both `memos-reranker-0.6b` and `memos-reranker-4b` are available through the [Memos Rerank API](https://memos-docs.openmem.net/cn/api_docs/core/rerank).
74
+
75
+ ```python
76
+ import os
77
+ import requests
78
+ import json
79
+
80
+ os.environ["MEMOS_API_KEY"] = "YOUR_API_KEY" # Get from https://memos-dashboard.openmem.net/apikeys/
81
+ os.environ["MEMOS_BASE_URL"] = "https://memos.memtensor.cn/api/openmem/v1"
82
+
83
+ url = f"{os.environ['MEMOS_BASE_URL']}/rerank"
84
+
85
+ payload = {
86
+ "model": "memos-reranker-4b", # or "memos-reranker-0.6b"
87
+ "query": "用户有什么兴趣爱好",
88
+ "documents": [
89
+ "用户喜欢打羽毛球",
90
+ "用户在杭州做后端开发",
91
+ "用户偏好简洁的回复风格",
92
+ "用户比较喜欢酱香型白酒",
93
+ "用户下周三要去北京出差"
94
+ ],
95
+ "top_n": 3
96
+ }
97
+
98
+ headers = {
99
+ "Content-Type": "application/json",
100
+ "Authorization": f"Token {os.environ['MEMOS_API_KEY']}"
101
+ }
102
+
103
+ response = requests.post(url, headers=headers, data=json.dumps(payload))
104
+ print(response.json())
105
+ ```
106
+
107
+ 📖 Full API documentation: [https://memos-docs.openmem.net/cn/api_docs/core/rerank](https://memos-docs.openmem.net/cn/api_docs/core/rerank)
108
+
109
+ ### Sentence Transformers (Recommended)
110
+
111
+ ```python
112
+ import torch
113
+ from sentence_transformers import CrossEncoder
114
+
115
+ model = CrossEncoder(
116
+ "IAAR-Shanghai/MemReranker-4B",
117
+ model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "cuda"},
118
+ )
119
+
120
+ query = "What is the capital of China?"
121
+ documents = [
122
+ "The capital of China is Beijing.",
123
+ "Beijing has been the capital of various Chinese dynasties.",
124
+ "Python is a popular programming language.",
125
+ ]
126
+
127
+ # Get relevance scores
128
+ scores = model.predict([(query, doc) for doc in documents])
129
+ print(scores)
130
+ # tensor([2.969, 1.328, -3.141])
131
+
132
+ # Get 0-1 probability via sigmoid
133
+ probs = model.predict(
134
+ [(query, doc) for doc in documents],
135
+ activation_fn=torch.nn.Sigmoid(),
136
+ )
137
+
138
+ # Or get a sorted ranking directly
139
+ results = model.rank(query, documents)
140
+ print(results)
141
+ ```
142
+
143
+ The default prompt is `"query"`, which injects the instruction *"Given a web search query, retrieve relevant passages that answer the query"*. A custom instruction can be passed via `prompts={"my_task": "..."}` and `default_prompt_name="my_task"` on `CrossEncoder(...)`.
144
+
145
+ ### Transformers
146
+
147
+ > Requires `transformers>=4.51.0`
148
+
149
+ ```python
150
+ import torch
151
+ from transformers import AutoTokenizer, AutoModelForCausalLM
152
+
153
+ def format_instruction(instruction, query, doc):
154
+ if instruction is None:
155
+ instruction = 'Given a web search query, retrieve relevant passages that answer the query'
156
+ output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(
157
+ instruction=instruction, query=query, doc=doc
158
+ )
159
+ return output
160
+
161
+ def process_inputs(pairs):
162
+ inputs = tokenizer(
163
+ pairs, padding=False, truncation='longest_first',
164
+ return_attention_mask=False,
165
+ max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
166
+ )
167
+ for i, ele in enumerate(inputs['input_ids']):
168
+ inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
169
+ inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
170
+ for key in inputs:
171
+ inputs[key] = inputs[key].to(model.device)
172
+ return inputs
173
+
174
+ def compute_logits(inputs):
175
+ batch_scores = model(**inputs).logits[:, -1, :]
176
+ true_vector = batch_scores[:, token_true_id]
177
+ false_vector = batch_scores[:, token_false_id]
178
+ batch_scores = torch.stack([false_vector, true_vector], dim=1)
179
+ batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
180
+ scores = batch_scores[:, 1].exp().tolist()
181
+ return scores
182
+
183
+ tokenizer = AutoTokenizer.from_pretrained(
184
+ "IAAR-Shanghai/MemReranker-4B", padding_side='left'
185
+ )
186
+ model = AutoModelForCausalLM.from_pretrained(
187
+ "IAAR-Shanghai/MemReranker-4B"
188
+ ).eval()
189
+
190
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving:
191
+ # model = AutoModelForCausalLM.from_pretrained(
192
+ # "IAAR-Shanghai/MemReranker-4B",
193
+ # torch_dtype=torch.float16,
194
+ # attn_implementation="flash_attention_2"
195
+ # ).cuda().eval()
196
+
197
+ token_false_id = tokenizer.convert_tokens_to_ids("no")
198
+ token_true_id = tokenizer.convert_tokens_to_ids("yes")
199
+ max_length = 8192
200
+
201
+ prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
202
+ suffix = '<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
203
+ prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
204
+ suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
205
+
206
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
207
+ queries = ["What is the capital of China?", "Explain gravity"]
208
+ documents = [
209
+ "The capital of China is Beijing.",
210
+ "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
211
+ ]
212
+
213
+ pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]
214
+ inputs = process_inputs(pairs)
215
+ scores = compute_logits(inputs)
216
+ print("scores:", scores)
217
+ ```
218
+
219
+ ### vLLM
220
+
221
+ > Requires `vllm>=0.8.5`
222
+
223
+ ```python
224
+ import math
225
+ import torch
226
+ from transformers import AutoTokenizer
227
+ from vllm import LLM, SamplingParams
228
+ from vllm.inputs.data import TokensPrompt
229
+
230
+ def format_instruction(instruction, query, doc):
231
+ text = [
232
+ {"role": "system", "content": 'Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".'},
233
+ {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
234
+ ]
235
+ return text
236
+
237
+ def process_inputs(pairs, instruction, max_length, suffix_tokens):
238
+ messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
239
+ messages = tokenizer.apply_chat_template(
240
+ messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
241
+ )
242
+ messages = [ele[:max_length] + suffix_tokens for ele in messages]
243
+ messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
244
+ return messages
245
+
246
+ def compute_logits(model, messages, sampling_params, true_token, false_token):
247
+ outputs = model.generate(messages, sampling_params, use_tqdm=False)
248
+ scores = []
249
+ for output in outputs:
250
+ final_logits = output.outputs[0].logprobs[-1]
251
+ true_logit = final_logits[true_token].logprob if true_token in final_logits else -10
252
+ false_logit = final_logits[false_token].logprob if false_token in final_logits else -10
253
+ true_score = math.exp(true_logit)
254
+ false_score = math.exp(false_logit)
255
+ score = true_score / (true_score + false_score)
256
+ scores.append(score)
257
+ return scores
258
+
259
+ number_of_gpu = torch.cuda.device_count()
260
+ tokenizer = AutoTokenizer.from_pretrained("IAAR-Shanghai/MemReranker-4B")
261
+ model = LLM(
262
+ model="IAAR-Shanghai/MemReranker-4B",
263
+ tensor_parallel_size=number_of_gpu,
264
+ max_model_len=10000,
265
+ enable_prefix_caching=True,
266
+ gpu_memory_utilization=0.8,
267
+ )
268
+ tokenizer.padding_side = "left"
269
+ tokenizer.pad_token = tokenizer.eos_token
270
+
271
+ suffix = '<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'
272
+ max_length = 8192
273
+ suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
274
+ true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
275
+ false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
276
+ sampling_params = SamplingParams(
277
+ temperature=0,
278
+ max_tokens=1,
279
+ logprobs=20,
280
+ allowed_token_ids=[true_token, false_token],
281
+ )
282
+
283
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
284
+ queries = ["What is the capital of China?", "Explain gravity"]
285
+ documents = [
286
+ "The capital of China is Beijing.",
287
+ "Gravity is a force that attracts two bodies towards each other.",
288
+ ]
289
+
290
+ pairs = list(zip(queries, documents))
291
+ inputs = process_inputs(pairs, task, max_length - len(suffix_tokens), suffix_tokens)
292
+ scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
293
+ print("scores:", scores)
294
+ ```
295
+
296
+ ### vLLM OpenAI-Compatible Server
297
+
298
+ You can also deploy MemReranker-4B as an OpenAI-compatible API server using vLLM:
299
+
300
+ ```bash
301
+ python -m vllm.entrypoints.openai.api_server \
302
+ --model IAAR-Shanghai/MemReranker-4B \
303
+ --tensor-parallel-size 1 \
304
+ --gpu-memory-utilization 0.9 \
305
+ --served-model-name MemReranker-4B \
306
+ --host 0.0.0.0 \
307
+ --port 8089 \
308
+ --chat-template /path/to/qwen3_reranker.jinja \
309
+ --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], "is_original_qwen3_reranker": true}'
310
+ ```
311
+
312
+ ## How It Works
313
+
314
+ MemReranker uses a **yes/no logit-score** reranking head inherited from Qwen3-Reranker. Given a query-document pair, the model outputs logits at the final token position. The score is computed as:
315
+
316
+ $$\text{score} = \frac{e^{\log p(\text{yes})}}{e^{\log p(\text{yes})} + e^{\log p(\text{no})}}$$
317
+
318
+ where `yes` (token id: 9693) and `no` (token id: 2152) are the two classification tokens.
319
+
320
+ ## Use Cases
321
+
322
+ - **Agent Memory Retrieval** — Rerank retrieved memory fragments for LLM agents with long-term memory.
323
+ - **Multi-turn Dialogue Search** — Handle temporal, causal, and coreference queries in conversational history.
324
+ - **RAG Pipelines** — Serve as a second-stage reranker in Retrieve-then-Rerank architectures.
325
+ - **General Text Retrieval** — Also effective as a general-purpose reranker for web search and document retrieval tasks.
326
+
327
+ ## Citation
328
+
329
+ ```bibtex
330
+ @article{li2026memreranker,
331
+ title={MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval},
332
+ author={Li, Chunyu and Kang, Jingyi and Chen, Ding and Zhang, Mengyuan and Shen, Jiajun and Tang, Bo and Zhou, Xuanhe and Xiong, Feiyu and Li, Zhiyu},
333
+ journal={arXiv preprint arXiv:2605.06132},
334
+ year={2026}
335
+ }
336
+ ```
337
+
338
+ ## Acknowledgements
339
+
340
+ MemReranker is built upon [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-4B) by the Qwen Team. We thank the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) community for the CrossEncoder integration support.
341
+
342
  ---
343
  license: apache-2.0
344
  ---