File size: 9,886 Bytes
d363ec6
 
91f2c4f
 
 
 
 
 
 
 
 
 
d363ec6
91f2c4f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
license: apache-2.0
base_model: Qwen/Qwen2.5-0.5B-Instruct
library_name: peft
tags:
  - memory-encoder
  - lora
  - structured-extraction
  - lycheemem
language:
  - en
pipeline_tag: text-generation
---

# Encoder v0 β€” Memory Encoder for LycheeMem

A LoRA adapter on top of **Qwen2.5-0.5B-Instruct** that turns conversation turns into structured `MemoryRecord` JSON (typed, atomic, with entities / temporal / evidence span / source_role). Trained by distilling DeepSeek V4 Flash and selecting high-quality candidates via a 4-dim verifier.

Designed as a drop-in encoder for [LycheeMem](https://github.com/LycheeMem/lycheemem)'s write-side memory pipeline, with **physical JSON schema guarantee via constrained decoding** (outlines + Pydantic).

## Highlights

- **8.7 MB LoRA adapter** on a 0.5B base β€” runs locally on a single RTX 4060 Ti 8GB, zero API cost
- **+125% weighted_score** over the runtime Qwen2.5-7B baseline on a 519-sample held set
- **100% JSON schema compliance** with constrained decoding (vs 74% for runtime baseline, 96-98% for SOTA prompt-only)
- **4Γ— faster** than the runtime baseline (3.4s vs 20s p50 latency)
- On LongMemEval-style task dialogs, **outperforms even Qwen2.5-72B and V4 Flash teacher** on weighted_score (3.749 vs 3.666 / 3.700)

## Evaluation

Evaluated on 519 held-out conversation segments (LongMemEval-S + MSC-MemFuse-MC10, English personal dialogs). The **weighted_score** is a 4-dim LLM-as-judge metric (V4 Flash) on `atomicity / self_containedness / entity_coverage / evidence_alignment`, weighted 0.25 / 0.30 / 0.20 / 0.25, with failures scored 0 (out of 5.0).

### 7-Model Leaderboard

| rank | model | size | weighted_score | schema_ok | latency p50 |
|---|---|---|---|---|---|
| 1 | DeepSeek-V3 | 671B (MoE) | 4.057 | 96.9% | 44s |
| 2 | Qwen2.5-72B-Instruct | 72B | 3.951 | 98.8% | 33s |
| 3 | DeepSeek V4 Flash (teacher) | β€” | 3.833 | 95.8% | 14s |
| **4** | **encoder_v0 (this model)** | **0.5B + LoRA** | **3.775** | **100.0%** | **3.4s** |
| 5 | Qwen3-32B | 32B | 3.476 | 97.7% | 67s |
| 6 | Qwen2.5-14B-Instruct | 14B | 1.946 | 80.5% | 19s |
| 7 | Qwen2.5-7B-Instruct (runtime baseline) | 7B | 1.679 | 74.0% | 20s |

### 4-Dim Quality Breakdown

| model | atomicity | self_cont | entity_cov | evidence |
|---|---|---|---|---|
| DeepSeek-V3 | 4.61 | 4.90 | 4.27 | 3.60 |
| Qwen2.5-72B | 4.89 | 4.85 | 4.14 | 3.54 |
| V4 Flash (teacher) | 4.48 | 4.88 | 4.21 | 3.94 |
| **encoder_v0** | **4.53** | **4.51** | **2.93** ⚠️ | **3.30** |
| Qwen3-32B | 4.38 | 4.74 | 4.13 | 3.18 |
| Qwen2.5-7B | 4.20 | 4.47 | 3.27 | 2.98 |

`entity_coverage` is the model's main known weakness (1.0-1.3 points below SOTA), planned to be addressed in v2.

### Per-Source Breakdown

| model | LongMemEval | MSC |
|---|---|---|
| DeepSeek-V3 | 3.871 | 4.357 |
| Qwen2.5-72B | 3.666 | 4.408 |
| V4 Flash (teacher) | 3.700 | 4.047 |
| **encoder_v0** | **3.749** | **3.817** |
| Qwen2.5-7B (baseline) | 1.330 | 2.241 |

On task-oriented dialogs (LongMemEval), encoder_v0 actually **surpasses both Qwen2.5-72B and the V4 Flash teacher**.

## Training

```text
Pipeline:
  Stage 1: 5000 conversation segments from LongMemEval-S + MSC-MemFuse-MC10
  Stage 2a: V4 Flash distillation β†’ 4769 candidate record sets
  Stage 2b: Rule + V4 Flash verifier (4-dim β‰₯ 4.0) β†’ 2590 pseudo-gold
  Stage 2c: +394 synthetic advice-class samples (gold = empty records)
  Stage 3:  LoRA SFT on Qwen2.5-0.5B-Instruct
            rank=16, alpha=32, dropout=0.05
            target_modules = q_proj, k_proj, v_proj, o_proj
            3 epochs, batch=1*accum16, lr=2e-4, bf16
            28.5 min on RTX 4060 Ti 8GB

Trainable params: 2.16M / 496M = 0.44%
Final eval loss: 0.293
```

Total training cost: ~Β₯24 (API for distillation + verifier) + 28 min local GPU.

## Intended Use

**Primary use**: Drop-in write-side encoder for LycheeMem (or similar long-term memory systems) that takes a conversation segment and outputs `MemoryRecord` JSON suitable for storage and downstream retrieval.

**Input format**:

```python
{
  "previous_turns": [{"role": "user", "content": "..."}, ...],  # optional
  "current_turns": [{"role": "user", "content": "..."}, ...],   # required
  "session_date": "2026-05-12"  # optional, ISO or freeform
}
```

**Output format** (strict JSON, guaranteed by constrained decoding):

```python
{
  "records": [
    {
      "memory_type": "fact|preference|event|constraint|procedure|failure_pattern|tool_affordance",
      "semantic_text": "User plans to visit Beijing on 2026-05-20 to meet Li Hua.",
      "entities": ["Beijing", "Li Hua"],
      "temporal": {"t_ref": "2026-05-12", "t_valid_from": "2026-05-20", "t_valid_to": ""},
      "tags": ["travel", "meeting"],
      "evidence_turns": [0],
      "source_role": "user"
    }
  ]
}
```

## How to Use

### Install dependencies

```bash
pip install transformers peft outlines pydantic torch
```

### Inference (with constrained decoding β€” recommended)

```python
import torch
import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from pydantic import BaseModel
from typing import Literal

# 1. Load base + LoRA adapter
BASE = "Qwen/Qwen2.5-0.5B-Instruct"
ADAPTER = "fuhao23/encoder_v0"

tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model_hf = PeftModel.from_pretrained(base, ADAPTER).eval()

# 2. Define output schema (must match the schema used in training)
class Temporal(BaseModel):
    t_ref: str = ""
    t_valid_from: str = ""
    t_valid_to: str = ""

class MemoryRecord(BaseModel):
    memory_type: Literal["fact", "preference", "event", "constraint",
                         "procedure", "failure_pattern", "tool_affordance"]
    semantic_text: str
    entities: list[str]
    temporal: Temporal
    tags: list[str]
    evidence_turns: list[int]
    source_role: Literal["user", "assistant", "both", ""]

class MemoryRecordList(BaseModel):
    records: list[MemoryRecord]

model = outlines.from_transformers(model_hf, tok)
generator = outlines.Generator(model, MemoryRecordList)

# 3. The system prompt this adapter was trained on (use COMPACT_ENCODING_SYSTEM
#    from LycheeMem: src/memory/semantic/prompts.py:13-85). Must use as-is.
SYSTEM_PROMPT = """You are a memory extractor for a personal AI assistant's
long-term memory system. ... (full prompt in LycheeMem repo)"""

# 4. Build user content + encode
user_content = """\
<PREVIOUS_TURNS>
(no previous turns)
</PREVIOUS_TURNS>

<CURRENT_TURNS>
user: I want to try out my new slow cooker from Bed Bath & Beyond.
assistant: Congratulations! Slow cookers are great for ...
user: Thanks for the cleaning tips.
</CURRENT_TURNS>"""

prompt = tok.apply_chat_template(
    [{"role": "system", "content": SYSTEM_PROMPT},
     {"role": "user", "content": user_content}],
    tokenize=False, add_generation_prompt=True,
)
output = generator(prompt, max_new_tokens=1024)
print(output)
# Output: strict JSON of {"records": [...]}
```

### Inference (without constrained decoding β€” not recommended)

The model **CAN** be used without `outlines`, but **schema compliance drops from 100% to ~64%** due to base Qwen2.5-0.5B's tendency to regress to conversation-continuation mode on assistant-advice-heavy inputs. Always use constrained decoding for production.

## Limitations

This is a v0 research release. **Read carefully before deployment**:

1. **LLM-as-judge bias in evaluation**. The `weighted_score` is computed using V4 Flash as judge β€” the same model family as the teacher. Comparisons against models stronger than V4 Flash (Qwen2.5-72B, DeepSeek-V3) may have ceiling effects; the precise SOTA ranking around rank 1-4 is not fully reliable.

2. **No human ground truth**. No human annotator has labeled records as "good / bad" β€” judge consistency with humans is unverified. Recommended next step: 50-sample human annotation + Cohen's kappa.

3. **No downstream retrieval evaluation**. The original training plan included an `evidence retrieval hit@10` benchmark on LongMemEval β€” this is not yet completed. The current metrics measure **encoder output quality in isolation**, not the end-to-end impact on memory retrieval accuracy.

4. **Narrow evaluation distribution**. The 519-sample held set is entirely English personal-dialog (LongMemEval + MSC). Chinese, technical, code, and long-context dialogs are not evaluated. OOD deployment may degrade.

5. **Entity coverage weakness**. `entity_coverage` 4-dim score is 2.93 vs SOTA 4.1-4.3 β€” the encoder under-extracts named entities. Planned fix in v2 with entity-rich training data.

6. **Constrained decoding is required for the headline 100% schema_ok**. Without `outlines`, schema compliance drops to ~64%.

7. **Not yet integrated into LycheeMem runtime**. No real-traffic data β€” quality on actual user dialogs vs the eval set is untested.

## Method Background

Pipeline and evaluation methodology documented in detail at the [LycheeMem repository](https://github.com/LycheeMem/lycheemem):

- `docs/encoder_v0.md` β€” full evaluation report with case studies
- `docs/encoder_eval_framework.md` β€” evaluation framework
- `examples/encoder_v0_try.py` β€” interactive try-it tool

Inspired by [MemReranker](https://arxiv.org/abs/2605.06132)'s small-model distillation methodology for memory systems.

## Citation

```bibtex
@misc{lycheemem_encoder_v0,
  title  = {Encoder v0: A Distilled Memory Encoder for Long-Term Conversation Memory},
  author = {LycheeMem},
  year   = {2026},
  url    = {https://huggingface.co/fuhao23/encoder_v0}
}
```

Base model:

```bibtex
@misc{qwen2.5,
  title  = {Qwen2.5: A Party of Foundation Models},
  author = {Qwen Team},
  year   = {2024}
}
```

## License

Apache 2.0 (matches base Qwen2.5-0.5B-Instruct license).