File size: 9,886 Bytes
d363ec6 91f2c4f d363ec6 91f2c4f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | ---
license: apache-2.0
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: peft
tags:
- memory-encoder
- lora
- structured-extraction
- lycheemem
language:
- en
pipeline_tag: text-generation
---
# Encoder v0 β Memory Encoder for LycheeMem
A LoRA adapter on top of **Qwen2.5-0.5B-Instruct** that turns conversation turns into structured `MemoryRecord` JSON (typed, atomic, with entities / temporal / evidence span / source_role). Trained by distilling DeepSeek V4 Flash and selecting high-quality candidates via a 4-dim verifier.
Designed as a drop-in encoder for [LycheeMem](https://github.com/LycheeMem/lycheemem)'s write-side memory pipeline, with **physical JSON schema guarantee via constrained decoding** (outlines + Pydantic).
## Highlights
- **8.7 MB LoRA adapter** on a 0.5B base β runs locally on a single RTX 4060 Ti 8GB, zero API cost
- **+125% weighted_score** over the runtime Qwen2.5-7B baseline on a 519-sample held set
- **100% JSON schema compliance** with constrained decoding (vs 74% for runtime baseline, 96-98% for SOTA prompt-only)
- **4Γ faster** than the runtime baseline (3.4s vs 20s p50 latency)
- On LongMemEval-style task dialogs, **outperforms even Qwen2.5-72B and V4 Flash teacher** on weighted_score (3.749 vs 3.666 / 3.700)
## Evaluation
Evaluated on 519 held-out conversation segments (LongMemEval-S + MSC-MemFuse-MC10, English personal dialogs). The **weighted_score** is a 4-dim LLM-as-judge metric (V4 Flash) on `atomicity / self_containedness / entity_coverage / evidence_alignment`, weighted 0.25 / 0.30 / 0.20 / 0.25, with failures scored 0 (out of 5.0).
### 7-Model Leaderboard
| rank | model | size | weighted_score | schema_ok | latency p50 |
|---|---|---|---|---|---|
| 1 | DeepSeek-V3 | 671B (MoE) | 4.057 | 96.9% | 44s |
| 2 | Qwen2.5-72B-Instruct | 72B | 3.951 | 98.8% | 33s |
| 3 | DeepSeek V4 Flash (teacher) | β | 3.833 | 95.8% | 14s |
| **4** | **encoder_v0 (this model)** | **0.5B + LoRA** | **3.775** | **100.0%** | **3.4s** |
| 5 | Qwen3-32B | 32B | 3.476 | 97.7% | 67s |
| 6 | Qwen2.5-14B-Instruct | 14B | 1.946 | 80.5% | 19s |
| 7 | Qwen2.5-7B-Instruct (runtime baseline) | 7B | 1.679 | 74.0% | 20s |
### 4-Dim Quality Breakdown
| model | atomicity | self_cont | entity_cov | evidence |
|---|---|---|---|---|
| DeepSeek-V3 | 4.61 | 4.90 | 4.27 | 3.60 |
| Qwen2.5-72B | 4.89 | 4.85 | 4.14 | 3.54 |
| V4 Flash (teacher) | 4.48 | 4.88 | 4.21 | 3.94 |
| **encoder_v0** | **4.53** | **4.51** | **2.93** β οΈ | **3.30** |
| Qwen3-32B | 4.38 | 4.74 | 4.13 | 3.18 |
| Qwen2.5-7B | 4.20 | 4.47 | 3.27 | 2.98 |
`entity_coverage` is the model's main known weakness (1.0-1.3 points below SOTA), planned to be addressed in v2.
### Per-Source Breakdown
| model | LongMemEval | MSC |
|---|---|---|
| DeepSeek-V3 | 3.871 | 4.357 |
| Qwen2.5-72B | 3.666 | 4.408 |
| V4 Flash (teacher) | 3.700 | 4.047 |
| **encoder_v0** | **3.749** | **3.817** |
| Qwen2.5-7B (baseline) | 1.330 | 2.241 |
On task-oriented dialogs (LongMemEval), encoder_v0 actually **surpasses both Qwen2.5-72B and the V4 Flash teacher**.
## Training
```text
Pipeline:
Stage 1: 5000 conversation segments from LongMemEval-S + MSC-MemFuse-MC10
Stage 2a: V4 Flash distillation β 4769 candidate record sets
Stage 2b: Rule + V4 Flash verifier (4-dim β₯ 4.0) β 2590 pseudo-gold
Stage 2c: +394 synthetic advice-class samples (gold = empty records)
Stage 3: LoRA SFT on Qwen2.5-0.5B-Instruct
rank=16, alpha=32, dropout=0.05
target_modules = q_proj, k_proj, v_proj, o_proj
3 epochs, batch=1*accum16, lr=2e-4, bf16
28.5 min on RTX 4060 Ti 8GB
Trainable params: 2.16M / 496M = 0.44%
Final eval loss: 0.293
```
Total training cost: ~Β₯24 (API for distillation + verifier) + 28 min local GPU.
## Intended Use
**Primary use**: Drop-in write-side encoder for LycheeMem (or similar long-term memory systems) that takes a conversation segment and outputs `MemoryRecord` JSON suitable for storage and downstream retrieval.
**Input format**:
```python
{
"previous_turns": [{"role": "user", "content": "..."}, ...], # optional
"current_turns": [{"role": "user", "content": "..."}, ...], # required
"session_date": "2026-05-12" # optional, ISO or freeform
}
```
**Output format** (strict JSON, guaranteed by constrained decoding):
```python
{
"records": [
{
"memory_type": "fact|preference|event|constraint|procedure|failure_pattern|tool_affordance",
"semantic_text": "User plans to visit Beijing on 2026-05-20 to meet Li Hua.",
"entities": ["Beijing", "Li Hua"],
"temporal": {"t_ref": "2026-05-12", "t_valid_from": "2026-05-20", "t_valid_to": ""},
"tags": ["travel", "meeting"],
"evidence_turns": [0],
"source_role": "user"
}
]
}
```
## How to Use
### Install dependencies
```bash
pip install transformers peft outlines pydantic torch
```
### Inference (with constrained decoding β recommended)
```python
import torch
import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from pydantic import BaseModel
from typing import Literal
# 1. Load base + LoRA adapter
BASE = "Qwen/Qwen2.5-0.5B-Instruct"
ADAPTER = "fuhao23/encoder_v0"
tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
BASE, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model_hf = PeftModel.from_pretrained(base, ADAPTER).eval()
# 2. Define output schema (must match the schema used in training)
class Temporal(BaseModel):
t_ref: str = ""
t_valid_from: str = ""
t_valid_to: str = ""
class MemoryRecord(BaseModel):
memory_type: Literal["fact", "preference", "event", "constraint",
"procedure", "failure_pattern", "tool_affordance"]
semantic_text: str
entities: list[str]
temporal: Temporal
tags: list[str]
evidence_turns: list[int]
source_role: Literal["user", "assistant", "both", ""]
class MemoryRecordList(BaseModel):
records: list[MemoryRecord]
model = outlines.from_transformers(model_hf, tok)
generator = outlines.Generator(model, MemoryRecordList)
# 3. The system prompt this adapter was trained on (use COMPACT_ENCODING_SYSTEM
# from LycheeMem: src/memory/semantic/prompts.py:13-85). Must use as-is.
SYSTEM_PROMPT = """You are a memory extractor for a personal AI assistant's
long-term memory system. ... (full prompt in LycheeMem repo)"""
# 4. Build user content + encode
user_content = """\
<PREVIOUS_TURNS>
(no previous turns)
</PREVIOUS_TURNS>
<CURRENT_TURNS>
user: I want to try out my new slow cooker from Bed Bath & Beyond.
assistant: Congratulations! Slow cookers are great for ...
user: Thanks for the cleaning tips.
</CURRENT_TURNS>"""
prompt = tok.apply_chat_template(
[{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_content}],
tokenize=False, add_generation_prompt=True,
)
output = generator(prompt, max_new_tokens=1024)
print(output)
# Output: strict JSON of {"records": [...]}
```
### Inference (without constrained decoding β not recommended)
The model **CAN** be used without `outlines`, but **schema compliance drops from 100% to ~64%** due to base Qwen2.5-0.5B's tendency to regress to conversation-continuation mode on assistant-advice-heavy inputs. Always use constrained decoding for production.
## Limitations
This is a v0 research release. **Read carefully before deployment**:
1. **LLM-as-judge bias in evaluation**. The `weighted_score` is computed using V4 Flash as judge β the same model family as the teacher. Comparisons against models stronger than V4 Flash (Qwen2.5-72B, DeepSeek-V3) may have ceiling effects; the precise SOTA ranking around rank 1-4 is not fully reliable.
2. **No human ground truth**. No human annotator has labeled records as "good / bad" β judge consistency with humans is unverified. Recommended next step: 50-sample human annotation + Cohen's kappa.
3. **No downstream retrieval evaluation**. The original training plan included an `evidence retrieval hit@10` benchmark on LongMemEval β this is not yet completed. The current metrics measure **encoder output quality in isolation**, not the end-to-end impact on memory retrieval accuracy.
4. **Narrow evaluation distribution**. The 519-sample held set is entirely English personal-dialog (LongMemEval + MSC). Chinese, technical, code, and long-context dialogs are not evaluated. OOD deployment may degrade.
5. **Entity coverage weakness**. `entity_coverage` 4-dim score is 2.93 vs SOTA 4.1-4.3 β the encoder under-extracts named entities. Planned fix in v2 with entity-rich training data.
6. **Constrained decoding is required for the headline 100% schema_ok**. Without `outlines`, schema compliance drops to ~64%.
7. **Not yet integrated into LycheeMem runtime**. No real-traffic data β quality on actual user dialogs vs the eval set is untested.
## Method Background
Pipeline and evaluation methodology documented in detail at the [LycheeMem repository](https://github.com/LycheeMem/lycheemem):
- `docs/encoder_v0.md` β full evaluation report with case studies
- `docs/encoder_eval_framework.md` β evaluation framework
- `examples/encoder_v0_try.py` β interactive try-it tool
Inspired by [MemReranker](https://arxiv.org/abs/2605.06132)'s small-model distillation methodology for memory systems.
## Citation
```bibtex
@misc{lycheemem_encoder_v0,
title = {Encoder v0: A Distilled Memory Encoder for Long-Term Conversation Memory},
author = {LycheeMem},
year = {2026},
url = {https://huggingface.co/fuhao23/encoder_v0}
}
```
Base model:
```bibtex
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
author = {Qwen Team},
year = {2024}
}
```
## License
Apache 2.0 (matches base Qwen2.5-0.5B-Instruct license).
|