reranker / README.md

Update model card with expanded zero-shot evaluation: LongMemEval-S, MSC-MemFuse-MC10, and HotpotQA. Checkpoint unchanged.

960236b verified 6 days ago

preview code

raw

history blame contribute delete

5.82 kB

	---
	license: apache-2.0
	base_model: prajjwal1/bert-tiny
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- lycheemem
	- memory
	- reranking
	- evidence-retrieval
	- bert-tiny
	---

	# LycheeMem BERT-Tiny Memory Reranker v0

	This repository provides the optional v0 transformer reranker checkpoint for
	LycheeMem semantic memory search. The model scores `(query, memory candidate)`
	pairs and is used as a conservative reranker over a wider memory candidate pool.

	The reranker is default-off in LycheeMem. It only changes memory search when the
	user installs the optional rerank dependencies, downloads this checkpoint, and
	explicitly enables the transformer rerank hook.

	## Model

	```text
	name: LycheeMem/reranker
	base_model: prajjwal1/bert-tiny
	task: memory evidence reranking
	architecture: AutoModelForSequenceClassification
	runtime: local checkpoint, default-off LycheeMem hook
	version: v0.1.0
	```

	## Intended Use

	Use this checkpoint with LycheeMem's experimental transformer reranker hook:

	```bash
	pip install "lycheemem[rerank]"

	EXPERIMENTAL_TRANSFORMER_RERANK=true
	TRANSFORMER_RERANK_MODEL_PATH=/path/to/lycheemem-reranker-v0
	TRANSFORMER_RERANK_MAX_REPLACEMENTS=1
	TRANSFORMER_RERANK_MERGE_MARGIN=0.3
	TRANSFORMER_RERANK_WIDE_TOP_K=50
	```

	If dependencies or the local checkpoint are missing, LycheeMem falls back to
	baseline memory search.

	## Training Data

	The checkpoint was trained on LoCoMo-derived memory evidence reranking bundles.
	Each training example pairs a user question with candidate memory texts and
	evidence IDs derived from the LoCoMo benchmark.

	The source repository does not include LoCoMo data, generated caches, or training
	outputs. Reproduction notes are maintained in the LycheeMem source repository.

	## Metrics

	All metrics below measure evidence retrieval/reranking, not final LLM answer
	quality. The primary metric is whether at least one gold evidence item appears
	in the returned top-10 candidates (`hit@10`).

	### LoCoMo Evidence Retrieval

	```text
	System memory backend, 200 QA:
	baseline: 124/200 = 0.620
	v0: 130/200 = 0.650
	added/lost/net: +7/-1/+6

	System LanceDB backend, 200 QA:
	baseline: 124/200 = 0.620
	v0: 131/200 = 0.655
	added/lost/net: +8/-1/+7

	Full-memory cache, 5 seeds:
	held added/lost/net: +115/-7/+108
	added/lost ratio: 16.43

	Split checks:
	interleave held: 466/765 -> 495/765, net +29
	prefix held: 473/766 -> 501/766, net +28
	conversation-heldout held: 476/772 -> 504/772, net +28
	```

	### Candidate Context Probe

	Same checkpoint, different candidate text construction:

	```text
	single-turn v0: 998/1531 = 0.651862, net +67
	context-candidate v0: 1013/1531 = 0.661659, net +82
	```

	### Zero-Shot Evidence Selection

	```text
	LongMemEval-S cleaned:
	baseline: 469/500 = 0.938
	wide: 500/500 = 1.000
	v0: 484/500 = 0.968
	added/lost/net: +16/-1/+15

	MSC-MemFuse-MC10 turn-level:
	baseline: 142/299 = 0.475
	wide: 279/299 = 0.933
	v0: 152/299 = 0.508
	added/lost/net: +10/-0/+10

	HotpotQA distractor sentence-level:
	baseline: 6957/7405 = 0.9395
	wide: 7405/7405 = 1.0000
	v0: 7076/7405 = 0.9556
	added/lost/net: +141/-22/+119
	```

	These zero-shot fixtures are intended to check whether the LoCoMo-trained v0
	checkpoint transfers as an evidence selector. LongMemEval-S and MSC-MemFuse are
	memory/dialogue-style settings. HotpotQA is a wiki multi-hop supporting-sentence
	setting, so it is a useful but less direct transfer check.

	## Limitations

	- The checkpoint is trained on LoCoMo-derived evidence bundles and may not
	generalize to every private memory corpus.
	- It assumes relevant evidence is already present in the wide candidate pool.
	- It is not an RL policy and does not learn online by itself.
	- The MSC-MemFuse fixture uses answer-string matching to infer evidence turns;
	this is a conservative heuristic, not original human evidence annotation.
	- HotpotQA transfer is positive but has more lost cases than memory-style
	fixtures, so dense wiki distractors need monitoring.
	- The strongest current accuracy bottleneck appears to be candidate
	representation, especially single-turn evidence-boundary cases.
	- The hook should remain default-off until a user or deployment explicitly opts
	in and monitors diagnostics.

	## Runtime Behavior

	LycheeMem's transformer reranker uses this checkpoint only after baseline memory
	search has produced a wider candidate pool. The current v0 policy is
	conservative:

	```text
	wide_top_k: 50
	max_replacements: 1
	merge_margin: 0.3
	runtime: local checkpoint only
	default behavior: disabled
	```

	In plain terms: baseline search retrieves memories first. The reranker only gets
	a narrow chance to replace one item in the final top-k when a better evidence
	candidate is already present in the wider candidate pool.

	## Files

	Expected checkpoint directory:

	```text
	config.json
	model.safetensors
	run_meta.json
	special_tokens_map.json
	tokenizer_config.json
	vocab.txt
	```

	SHA256 checksums for the v0.1.0 checkpoint artifact:

	```text
	ed54572648824881775812e8b2b0af9be1b720ebdbdf2d1b7c0d976c4ca14c8a config.json
	0a328c53b55cbd49aeec0a44e6b9e2d02d09539e6784d93fc515ba815261fca0 model.safetensors
	7841bca86e19c72c1cd0f4834efb5c413975ad01ffc5c7020328f4cc62b70536 run_meta.json
	b6d346be366a7d1d48332dbc9fdf3bf8960b5d879522b7799ddba59e76237ee3 special_tokens_map.json
	e711904cac23112776b678356ccf702cf934babaa01125f698ac43bf9ad38e73 tokenizer_config.json
	07eced375cec144d27c900241f3e339478dec958f92fddbc551f295c992038a3 vocab.txt
	```

	## Citation and Scope

	This checkpoint is part of LycheeMem's optional memory retrieval research path.
	It is not an RL policy and does not learn online by itself. Online feedback and
	personalization are handled by separate experimental components.