eval_framework / docs /OUTPUT_FORMAT.md
LCZZZZ's picture
Upload eval_framework source code
85b19cf verified

Eval Framework 输出格式

输出目录结构

运行完成后 --output-dir 下包含 5 个文件:

output-dir/
├── pipeline_sessions.jsonl     # Stage 1 checkpoint — pipeline 中间结果(session 级)
├── pipeline_qa.jsonl           # Stage 1 checkpoint — pipeline 中间结果(QA 级)
├── session_records.jsonl       # 最终结果:session pipeline 数据 + eval 评判
├── qa_records.jsonl            # 最终结果:QA pipeline 数据 + eval 评判
└── aggregate_metrics.json      # 最终结果:baseline 级别汇总指标

文件详解

1. session_records.jsonl

每行一个 session,包含 pipeline 原始数据和 eval 评判结果:

{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "session_id": "S01",
  "memory_snapshot": [
    {
      "memory_id": "3",
      "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...",
      "session_id": "S01",
      "status": "active",
      "source": "FUMemory",
      "raw_backend_id": "3",
      "raw_backend_type": "linear",
      "metadata": {}
    }
  ],
  "memory_delta": [
    {
      "session_id": "S01",
      "op": "add",
      "text": "user: OBSERVATION: ...",
      "linked_previous": [],
      "raw_backend_id": "3",
      "metadata": {"baseline": "FUMemory"}
    }
  ],
  "gold_state": {
    "session_id": "S01",
    "cumulative_gold_memories": [...],
    "session_new_memories": [...],
    "session_update_memories": [...],
    "session_interference_memories": []
  },
  "eval": {
    "session_id": "S01",
    "recall": 0.8,
    "covered_count": 4,
    "num_gold": 5,
    "update_recall": 1.0,
    "update_covered_count": 2,
    "update_total": 2,
    "recall_reasoning": "4 of 5 gold points are covered...",
    "correctness_rate": 0.75,
    "num_memories": 8,
    "num_correct": 6,
    "num_hallucination": 1,
    "num_irrelevant": 1,
    "correctness_reasoning": "...",
    "correctness_records": [
      {"id": 1, "label": "correct"},
      {"id": 2, "label": "hallucination"}
    ],
    "update_score": 1.0,
    "update_num_updated": 2,
    "update_num_both": 0,
    "update_num_outdated": 0,
    "update_total_items": 2,
    "update_records": [
      {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."}
    ],
    "interference_score": null,
    "interference_num_rejected": 0,
    "interference_num_memorized": 0,
    "interference_total_items": 0,
    "interference_records": []
  }
}

eval 字段说明:

字段 含义
recall 本 session gold points 被 delta 覆盖的比例 (0-1)
update_recall update 类型 gold points 的覆盖比例
correctness_rate delta 中正确记忆的比例
num_hallucination delta 中幻觉记忆数量
num_irrelevant delta 中无关记忆数量
update_score 更新处理得分 (updated=1.0, both=0.5, outdated=0.0)
interference_score 干扰拒绝得分 (rejected=1.0, memorized=0.0)

2. qa_records.jsonl

每行一个 QA question,包含检索结果、模型回答和评判:

{
  "sample_id": "vab_minecraft_...",
  "sample_uuid": "uuid-...",
  "checkpoint_id": "probe_e980c238",
  "question": "What was in the agent's inventory at step 1?",
  "gold_answer": "At step 1, the agent's inventory was empty.",
  "gold_evidence_memory_ids": ["mp_S04_1"],
  "gold_evidence_contents": ["The agent started with empty inventory"],
  "question_type": "factual_recall",
  "question_type_abbrev": "FR",
  "difficulty": "easy",
  "retrieval": {
    "query": "What was in the agent's inventory at step 1?",
    "top_k": 5,
    "items": [
      {
        "rank": 0,
        "memory_id": "memgallery:string_bundle",
        "text": "user: OBSERVATION: Your Inventory: ...",
        "score": 1.0,
        "raw_backend_id": null
      }
    ],
    "raw_trace": {"baseline": "FUMemory"}
  },
  "generated_answer": "The agent's inventory was empty at step 1.",
  "cited_memories": ["user: OBSERVATION: Inventory: nothing"],
  "eval": {
    "answer_label": "Correct",
    "answer_reasoning": "The response matches the reference answer...",
    "answer_is_valid": true,
    "evidence_hit_rate": 1.0,
    "evidence_covered_count": 1,
    "num_evidence": 1,
    "evidence_reasoning": "The cited memory covers the gold evidence...",
    "num_cited_memories": 1
  }
}

eval 字段说明:

字段 含义
answer_label Correct / Hallucination / Omission
answer_is_valid 评判是否成功(非 LLM 错误)
evidence_hit_rate cited memories 覆盖了多少 gold evidence (0-1)
evidence_covered_count 被覆盖的 gold evidence 数量
num_cited_memories 模型回答时引用的记忆条数

3. aggregate_metrics.json

baseline 级别的 6 维汇总指标:

{
  "baseline_id": "FUMemory",
  "memory_recall": {
    "avg_recall": 0.72,
    "avg_update_recall": 0.65,
    "num_sessions_with_recall": 110,
    "num_sessions_with_update": 85,
    "total_covered": 320,
    "total_gold": 445
  },
  "memory_correctness": {
    "avg_correctness": 0.81,
    "avg_hallucination": 0.08,
    "avg_irrelevant": 0.11,
    "num_sessions": 110,
    "total_memories": 1200,
    "total_correct": 972,
    "total_hallucination": 96,
    "total_irrelevant": 132
  },
  "update_handling": {
    "score": 0.65,
    "num_updated": 52,
    "num_both": 18,
    "num_outdated": 15,
    "num_total": 85
  },
  "interference_rejection": {
    "score": 0.0,
    "num_rejected": 0,
    "num_memorized": 0,
    "num_total": 0
  },
  "question_answering": {
    "correct_ratio": 0.58,
    "hallucination_ratio": 0.22,
    "omission_ratio": 0.20,
    "num_total": 990,
    "num_valid": 990
  },
  "evidence_coverage": {
    "hit_rate": 0.43,
    "num_covered": 425,
    "num_total": 990
  }
}

6 个维度:

维度 聚合方式 核心指标 方向
Memory Recall 按 session 平均 avg_recall
Memory Correctness 按 session 平均 avg_correctness, avg_hallucination ↑, ↓
Update Handling 跨 session 池化 score
Interference Rejection 跨 session 池化 score
Question Answering 跨 question 池化 correct_ratio, hallucination_ratio ↑, ↓
Evidence Coverage 跨 question 池化 hit_rate

4. pipeline_sessions.jsonl / pipeline_qa.jsonl

Stage 1 的 checkpoint 文件,结构与 session_records.jsonl / qa_records.jsonl 相同但不含 eval 字段

用途:--eval-only 模式跳过 pipeline 直接从 checkpoint 恢复,只重跑 eval 阶段。典型场景:

# 首次完整运行
python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU

# 换 judge 模型重评(不重跑 pipeline)
OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \
    --dataset ... --baseline FUMemory --output-dir results/FU --eval-only

结果分析示例

import json

# 读取汇总
with open("results/FUMemory/aggregate_metrics.json") as f:
    agg = json.load(f)
print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}")
print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}")

# 按 QA type 分析正确率
qa_by_type = {}
with open("results/FUMemory/qa_records.jsonl") as f:
    for line in f:
        rec = json.loads(line)
        qt = rec["question_type_abbrev"]
        label = rec["eval"]["answer_label"]
        qa_by_type.setdefault(qt, []).append(label)

for qt, labels in sorted(qa_by_type.items()):
    correct = sum(1 for l in labels if l == "Correct")
    print(f"  {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}")