Eval Framework 输出格式
输出目录结构
运行完成后 --output-dir 下包含 5 个文件:
output-dir/
├── pipeline_sessions.jsonl # Stage 1 checkpoint — pipeline 中间结果(session 级)
├── pipeline_qa.jsonl # Stage 1 checkpoint — pipeline 中间结果(QA 级)
├── session_records.jsonl # 最终结果:session pipeline 数据 + eval 评判
├── qa_records.jsonl # 最终结果:QA pipeline 数据 + eval 评判
└── aggregate_metrics.json # 最终结果:baseline 级别汇总指标
文件详解
1. session_records.jsonl
每行一个 session,包含 pipeline 原始数据和 eval 评判结果:
{
"sample_id": "vab_minecraft_...",
"sample_uuid": "uuid-...",
"session_id": "S01",
"memory_snapshot": [
{
"memory_id": "3",
"text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
"session_id": "S01",
"status": "active",
"source": "FUMemory",
"raw_backend_id": "3",
"raw_backend_type": "linear",
"metadata": {}
}
],
"memory_delta": [
{
"session_id": "S01",
"op": "add",
"text": "user: OBSERVATION: ...",
"linked_previous": [],
"raw_backend_id": "3",
"metadata": {"baseline": "FUMemory"}
}
],
"gold_state": {
"session_id": "S01",
"cumulative_gold_memories": [...],
"session_new_memories": [...],
"session_update_memories": [...],
"session_interference_memories": []
},
"eval": {
"session_id": "S01",
"recall": 0.8,
"covered_count": 4,
"num_gold": 5,
"update_recall": 1.0,
"update_covered_count": 2,
"update_total": 2,
"recall_reasoning": "4 of 5 gold points are covered...",
"correctness_rate": 0.75,
"num_memories": 8,
"num_correct": 6,
"num_hallucination": 1,
"num_irrelevant": 1,
"correctness_reasoning": "...",
"correctness_records": [
{"id": 1, "label": "correct"},
{"id": 2, "label": "hallucination"}
],
"update_score": 1.0,
"update_num_updated": 2,
"update_num_both": 0,
"update_num_outdated": 0,
"update_total_items": 2,
"update_records": [
{"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
],
"interference_score": null,
"interference_num_rejected": 0,
"interference_num_memorized": 0,
"interference_total_items": 0,
"interference_records": []
}
}
eval 字段说明:
| 字段 | 含义 |
|---|---|
recall |
本 session gold points 被 delta 覆盖的比例 (0-1) |
update_recall |
update 类型 gold points 的覆盖比例 |
correctness_rate |
delta 中正确记忆的比例 |
num_hallucination |
delta 中幻觉记忆数量 |
num_irrelevant |
delta 中无关记忆数量 |
update_score |
更新处理得分 (updated=1.0, both=0.5, outdated=0.0) |
interference_score |
干扰拒绝得分 (rejected=1.0, memorized=0.0) |
2. qa_records.jsonl
每行一个 QA question,包含检索结果、模型回答和评判:
{
"sample_id": "vab_minecraft_...",
"sample_uuid": "uuid-...",
"checkpoint_id": "probe_e980c238",
"question": "What was in the agent's inventory at step 1?",
"gold_answer": "At step 1, the agent's inventory was empty.",
"gold_evidence_memory_ids": ["mp_S04_1"],
"gold_evidence_contents": ["The agent started with empty inventory"],
"question_type": "factual_recall",
"question_type_abbrev": "FR",
"difficulty": "easy",
"retrieval": {
"query": "What was in the agent's inventory at step 1?",
"top_k": 5,
"items": [
{
"rank": 0,
"memory_id": "memgallery:string_bundle",
"text": "user: OBSERVATION: Your Inventory: ...",
"score": 1.0,
"raw_backend_id": null
}
],
"raw_trace": {"baseline": "FUMemory"}
},
"generated_answer": "The agent's inventory was empty at step 1.",
"cited_memories": ["user: OBSERVATION: Inventory: nothing"],
"eval": {
"answer_label": "Correct",
"answer_reasoning": "The response matches the reference answer...",
"answer_is_valid": true,
"evidence_hit_rate": 1.0,
"evidence_covered_count": 1,
"num_evidence": 1,
"evidence_reasoning": "The cited memory covers the gold evidence...",
"num_cited_memories": 1
}
}
eval 字段说明:
| 字段 | 含义 |
|---|---|
answer_label |
Correct / Hallucination / Omission |
answer_is_valid |
评判是否成功(非 LLM 错误) |
evidence_hit_rate |
cited memories 覆盖了多少 gold evidence (0-1) |
evidence_covered_count |
被覆盖的 gold evidence 数量 |
num_cited_memories |
模型回答时引用的记忆条数 |
3. aggregate_metrics.json
baseline 级别的 6 维汇总指标:
{
"baseline_id": "FUMemory",
"memory_recall": {
"avg_recall": 0.72,
"avg_update_recall": 0.65,
"num_sessions_with_recall": 110,
"num_sessions_with_update": 85,
"total_covered": 320,
"total_gold": 445
},
"memory_correctness": {
"avg_correctness": 0.81,
"avg_hallucination": 0.08,
"avg_irrelevant": 0.11,
"num_sessions": 110,
"total_memories": 1200,
"total_correct": 972,
"total_hallucination": 96,
"total_irrelevant": 132
},
"update_handling": {
"score": 0.65,
"num_updated": 52,
"num_both": 18,
"num_outdated": 15,
"num_total": 85
},
"interference_rejection": {
"score": 0.0,
"num_rejected": 0,
"num_memorized": 0,
"num_total": 0
},
"question_answering": {
"correct_ratio": 0.58,
"hallucination_ratio": 0.22,
"omission_ratio": 0.20,
"num_total": 990,
"num_valid": 990
},
"evidence_coverage": {
"hit_rate": 0.43,
"num_covered": 425,
"num_total": 990
}
}
6 个维度:
| 维度 | 聚合方式 | 核心指标 | 方向 |
|---|---|---|---|
| Memory Recall | 按 session 平均 | avg_recall |
↑ |
| Memory Correctness | 按 session 平均 | avg_correctness, avg_hallucination |
↑, ↓ |
| Update Handling | 跨 session 池化 | score |
↑ |
| Interference Rejection | 跨 session 池化 | score |
↑ |
| Question Answering | 跨 question 池化 | correct_ratio, hallucination_ratio |
↑, ↓ |
| Evidence Coverage | 跨 question 池化 | hit_rate |
↑ |
4. pipeline_sessions.jsonl / pipeline_qa.jsonl
Stage 1 的 checkpoint 文件,结构与 session_records.jsonl / qa_records.jsonl 相同但不含 eval 字段。
用途:--eval-only 模式跳过 pipeline 直接从 checkpoint 恢复,只重跑 eval 阶段。典型场景:
# 首次完整运行
python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU
# 换 judge 模型重评(不重跑 pipeline)
OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
--dataset ... --baseline FUMemory --output-dir results/FU --eval-only
结果分析示例
import json
# 读取汇总
with open("results/FUMemory/aggregate_metrics.json") as f:
agg = json.load(f)
print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")
# 按 QA type 分析正确率
qa_by_type = {}
with open("results/FUMemory/qa_records.jsonl") as f:
for line in f:
rec = json.loads(line)
qt = rec["question_type_abbrev"]
label = rec["eval"]["answer_label"]
qa_by_type.setdefault(qt, []).append(label)
for qt, labels in sorted(qa_by_type.items()):
correct = sum(1 for l in labels if l == "Correct")
print(f" {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")