Eval Framework 输出格式

输出目录结构

运行完成后 --output-dir 下包含 5 个文件：

output-dir/
├── pipeline_sessions.jsonl     # Stage 1 checkpoint — pipeline 中间结果（session 级）
├── pipeline_qa.jsonl           # Stage 1 checkpoint — pipeline 中间结果（QA 级）
├── session_records.jsonl       # 最终结果：session pipeline 数据 + eval 评判
├── qa_records.jsonl            # 最终结果：QA pipeline 数据 + eval 评判
└── aggregate_metrics.json      # 最终结果：baseline 级别汇总指标

文件详解

1. `session_records.jsonl`

每行一个 session，包含 pipeline 原始数据和 eval 评判结果：

{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "session_id": "S01",
  "memory_snapshot": [
    {
      "memory_id": "3",
      "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
      "session_id": "S01",
      "status": "active",
      "source": "FUMemory",
      "raw_backend_id": "3",
      "raw_backend_type": "linear",
      "metadata": {}
    }
  ],
  "memory_delta": [
    {
      "session_id": "S01",
      "op": "add",
      "text": "user: OBSERVATION: ...",
      "linked_previous": [],
      "raw_backend_id": "3",
      "metadata": {"baseline": "FUMemory"}
    }
  ],
  "gold_state": {
    "session_id": "S01",
    "cumulative_gold_memories": [...],
    "session_new_memories": [...],
    "session_update_memories": [...],
    "session_interference_memories": []
  },
  "eval": {
    "session_id": "S01",
    "recall": 0.8,
    "covered_count": 4,
    "num_gold": 5,
    "update_recall": 1.0,
    "update_covered_count": 2,
    "update_total": 2,
    "recall_reasoning": "4 of 5 gold points are covered...",
    "correctness_rate": 0.75,
    "num_memories": 8,
    "num_correct": 6,
    "num_hallucination": 1,
    "num_irrelevant": 1,
    "correctness_reasoning": "...",
    "correctness_records": [
      {"id": 1, "label": "correct"},
      {"id": 2, "label": "hallucination"}
    ],
    "update_score": 1.0,
    "update_num_updated": 2,
    "update_num_both": 0,
    "update_num_outdated": 0,
    "update_total_items": 2,
    "update_records": [
      {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
    ],
    "interference_score": null,
    "interference_num_rejected": 0,
    "interference_num_memorized": 0,
    "interference_total_items": 0,
    "interference_records": []
  }
}

eval 字段说明：

字段	含义
`recall`	本 session gold points 被 delta 覆盖的比例 (0-1)
`update_recall`	update 类型 gold points 的覆盖比例
`correctness_rate`	delta 中正确记忆的比例
`num_hallucination`	delta 中幻觉记忆数量
`num_irrelevant`	delta 中无关记忆数量
`update_score`	更新处理得分 (updated=1.0, both=0.5, outdated=0.0)
`interference_score`	干扰拒绝得分 (rejected=1.0, memorized=0.0)

2. `qa_records.jsonl`

每行一个 QA question，包含检索结果、模型回答和评判：

{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "checkpoint_id": "probe_e980c238",
  "question": "What was in the agent's inventory at step 1?",
  "gold_answer": "At step 1, the agent's inventory was empty.",
  "gold_evidence_memory_ids": ["mp_S04_1"],
  "gold_evidence_contents": ["The agent started with empty inventory"],
  "question_type": "factual_recall",
  "question_type_abbrev": "FR",
  "difficulty": "easy",
  "retrieval": {
    "query": "What was in the agent's inventory at step 1?",
    "top_k": 5,
    "items": [
      {
        "rank": 0,
        "memory_id": "memgallery:string_bundle",
        "text": "user: OBSERVATION: Your Inventory: ...",
        "score": 1.0,
        "raw_backend_id": null
      }
    ],
    "raw_trace": {"baseline": "FUMemory"}
  },
  "generated_answer": "The agent's inventory was empty at step 1.",
  "cited_memories": ["user: OBSERVATION: Inventory: nothing"],
  "eval": {
    "answer_label": "Correct",
    "answer_reasoning": "The response matches the reference answer...",
    "answer_is_valid": true,
    "evidence_hit_rate": 1.0,
    "evidence_covered_count": 1,
    "num_evidence": 1,
    "evidence_reasoning": "The cited memory covers the gold evidence...",
    "num_cited_memories": 1
  }
}

eval 字段说明：

字段	含义
`answer_label`	`Correct` / `Hallucination` / `Omission`
`answer_is_valid`	评判是否成功（非 LLM 错误）
`evidence_hit_rate`	cited memories 覆盖了多少 gold evidence (0-1)
`evidence_covered_count`	被覆盖的 gold evidence 数量
`num_cited_memories`	模型回答时引用的记忆条数

3. `aggregate_metrics.json`

baseline 级别的 6 维汇总指标：

{
  "baseline_id": "FUMemory",
  "memory_recall": {
    "avg_recall": 0.72,
    "avg_update_recall": 0.65,
    "num_sessions_with_recall": 110,
    "num_sessions_with_update": 85,
    "total_covered": 320,
    "total_gold": 445
  },
  "memory_correctness": {
    "avg_correctness": 0.81,
    "avg_hallucination": 0.08,
    "avg_irrelevant": 0.11,
    "num_sessions": 110,
    "total_memories": 1200,
    "total_correct": 972,
    "total_hallucination": 96,
    "total_irrelevant": 132
  },
  "update_handling": {
    "score": 0.65,
    "num_updated": 52,
    "num_both": 18,
    "num_outdated": 15,
    "num_total": 85
  },
  "interference_rejection": {
    "score": 0.0,
    "num_rejected": 0,
    "num_memorized": 0,
    "num_total": 0
  },
  "question_answering": {
    "correct_ratio": 0.58,
    "hallucination_ratio": 0.22,
    "omission_ratio": 0.20,
    "num_total": 990,
    "num_valid": 990
  },
  "evidence_coverage": {
    "hit_rate": 0.43,
    "num_covered": 425,
    "num_total": 990
  }
}

6 个维度：

维度	聚合方式	核心指标	方向
Memory Recall	按 session 平均	`avg_recall`	↑
Memory Correctness	按 session 平均	`avg_correctness`, `avg_hallucination`	↑, ↓
Update Handling	跨 session 池化	`score`	↑
Interference Rejection	跨 session 池化	`score`	↑
Question Answering	跨 question 池化	`correct_ratio`, `hallucination_ratio`	↑, ↓
Evidence Coverage	跨 question 池化	`hit_rate`	↑

4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl`

Stage 1 的 checkpoint 文件，结构与 session_records.jsonl / qa_records.jsonl 相同但不含 eval 字段。

用途：--eval-only 模式跳过 pipeline 直接从 checkpoint 恢复，只重跑 eval 阶段。典型场景：

# 首次完整运行
python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU

# 换 judge 模型重评（不重跑 pipeline）
OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
    --dataset ... --baseline FUMemory --output-dir results/FU --eval-only

结果分析示例

import json

# 读取汇总
with open("results/FUMemory/aggregate_metrics.json") as f:
    agg = json.load(f)
print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")

# 按 QA type 分析正确率
qa_by_type = {}
with open("results/FUMemory/qa_records.jsonl") as f:
    for line in f:
        rec = json.loads(line)
        qt = rec["question_type_abbrev"]
        label = rec["eval"]["answer_label"]
        qa_by_type.setdefault(qt, []).append(label)

for qt, labels in sorted(qa_by_type.items()):
    correct = sum(1 for l in labels if l == "Correct")
    print(f"  {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")

Eval Framework 输出格式

输出目录结构

文件详解

1. session_records.jsonl

2. qa_records.jsonl

3. aggregate_metrics.json

4. pipeline_sessions.jsonl / pipeline_qa.jsonl

结果分析示例

1. `session_records.jsonl`

2. `qa_records.jsonl`

3. `aggregate_metrics.json`

4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl`