Hindsight Memory Reranker
A fine-tuned cross-encoder reranking model optimized for ranking documents in Hindsight-formatted agent memory systems.
Model Description
This model is a fine-tuned version of cross-encoder/ms-marco-MiniLM-L12-v2, specifically trained to rerank memory documents formatted according to the Hindsight memory architecture.
The model is trained with Quantization-Aware Training (QAT) and exported to ONNX format with INT8 dynamic activation and INT4 weight quantization for efficient inference.
Key Features
- Date-Range Aware: Distinguishes between ongoing facts (
[End: ongoing]), completed events ([End: <date>]), and timeless facts (no date prefix) - Hindsight-Optimized: Trained on documents with temporal anchors, fact types, and contextual metadata
- Quantized: INT8/INT4 quantization for efficient CPU deployment
- ONNX Export: Ready for production deployment without PyTorch dependency
Usage
With ONNX Runtime
import onnxruntime as ort
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
session = ort.InferenceSession("model.onnx")
def score(query: str, documents: list[str]) -> list[float]:
pairs = [(query, doc) for doc in documents]
encodings = tokenizer.encode_batch(pairs)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.array([e.type_ids for e in encodings], dtype=np.int64)
logits = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
})[0]
return logits.flatten().tolist()
Document Formatting
Documents must be formatted with date ranges before ranking:
[Start: January 01, 2024 (2024-01-01)] [End: ongoing] [World] Ruby Martinez is the Department Head of the Engineering Department at TechCo Global.
Format cases:
| Scenario | Format |
|---|---|
| Ongoing fact (started, still true) | [Start: Month DD, YYYY (YYYY-MM-DD)] [End: ongoing] [Type] Text |
| Completed fact (started and ended) | [Start: Month DD, YYYY (YYYY-MM-DD)] [End: Month DD, YYYY (YYYY-MM-DD)] [Type] Text |
| Timeless fact (no temporal anchor) | [Type] Text |
| End-only (rare) | [End: Month DD, YYYY (YYYY-MM-DD)] [Type] Text |
Where:
- Type:
World(enduring fact),Event(bounded occurrence), orObservation(synthesized insight) - Context (optional): Category prefix before the text (e.g.,
Architecture Decision: ...)
Training Details
Training Data
315 hand-crafted triplets covering diverse domains: software engineering, data engineering, DevOps, personal life, health, finance, hobbies, work culture, and consumer products. All names and organizations are fictional.
Each triplet contains:
query: Natural language questionpositive: The memory unit that correctly answers the querynegative: A plausible but incorrect memory unit (hard negative β topically related but wrong)
Both positive and negative include occurred_start and occurred_end fields to train date-range awareness.
| Split | Samples |
|---|---|
| Train | 315 |
| Eval | 70 |
| Test | 60 |
Example:
{
"query": "What framework does the Memex API use?",
"positive": {
"text": "The Memex application remains a Python-based system utilizing the FastAPI framework. | When: As of March 20, 2026 | Involving: The development team | Following the abandonment of the Rust migration spike...",
"occurred_start": "2026-03-20T00:00:00",
"occurred_end": null,
"type": "world",
"context": ""
},
"negative": {
"text": "A Rust migration of the Memex server was initiated as a proof-of-concept spike but was ultimately abandoned and never completed. | When: Before March 20, 2026 | Involving: The development team...",
"occurred_start": "2026-02-01T00:00:00",
"occurred_end": "2026-03-20T00:00:00",
"type": "event",
"context": ""
}
}
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | cross-encoder/ms-marco-MiniLM-L12-v2 |
| Epochs | 10 |
| Batch Size | 8 |
| Learning Rate | 6e-6 |
| Loss Function | Margin Ranking Loss (margin=1.0) |
| Max Sequence Length | 512 |
| Quantization | INT8 dynamic activation, INT4 weights (QAT via torchao) |
Loss Function
Uses Margin Ranking Loss to ensure positive documents receive higher relevance scores than negative documents:
loss = max(0, margin - score_positive + score_negative)
Evaluation Results
v2 β Date range format (2026-03-22)
| Metric | Baseline (Eval) | Trained (Eval) | Baseline (Test) | Trained (Test) |
|---|---|---|---|---|
| Accuracy@1 | 0.8857 | 0.9571 | 0.8667 | 0.9333 |
| Accuracy@3 | 1.0000 | 1.0000 | 0.9833 | 0.9833 |
| Accuracy@5 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Accuracy@10 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| MRR@10 | 0.9429 | 0.9786 | 0.9208 | 0.9597 |
Note: The v2 baseline scores are higher than v1 because the v1 dataset contained low-quality examples (single-word triplets, trivially short texts) that confused even a strong baseline. The v2 data was fully rewritten with natural memory unit text. The relative improvement (baseline β trained) is the meaningful signal.
v1 β Single date format (previous)
| Metric | Baseline (Eval) | Trained (Eval) | Baseline (Test) | Trained (Test) |
|---|---|---|---|---|
| Accuracy@1 | 0.6286 | 0.6714 | 0.5167 | 0.7167 |
| Accuracy@3 | 0.8143 | 0.8714 | 0.8000 | 0.8167 |
| Accuracy@5 | 0.8571 | 0.9000 | 0.8500 | 0.8833 |
| MRR@10 | 0.7290 | 0.7728 | 0.6681 | 0.7788 |
Model Architecture
- Type: Cross-Encoder for sequence classification
- Base: MiniLM-L12 (12 transformer layers)
- Output: Single relevance score (logits)
- Quantization: INT8 dynamic activation + INT4 weight (via torchao)
Changelog
v2 (2026-03-22)
- Breaking: Input format changed from
[Date: ...]to[Start: ...] [End: ...]date ranges - Added
occurred_endsupport for distinguishing ongoing vs completed facts - Timeless facts (no dates) now have no date prefix instead of a forced date
- Training data fully rewritten: 315 hand-crafted triplets with diverse domains
- Removed low-quality single-word examples from v1 dataset
v1 (initial)
- Single
[Date: ...]timestamp format - 300 training triplets (mixed quality)
Limitations
- Training data is synthetic; the model may not generalize to all real-world retrieval patterns
- Optimized for Memex memory unit format; may not perform well on arbitrary document formats
- English language only
- Small eval/test corpus (140/120 docs) β absolute accuracy numbers may not reflect production performance at larger scale
License
MIT License
- Downloads last month
- 261
Model tree for JasperHG90/ms-marco-minilm-l12-hindsight-reranker
Base model
microsoft/MiniLM-L12-H384-uncased