# Eval Framework 输出格式 ## 输出目录结构 运行完成后 `--output-dir` 下包含 5 个文件: ``` output-dir/ ├── pipeline_sessions.jsonl # Stage 1 checkpoint — pipeline 中间结果(session 级) ├── pipeline_qa.jsonl # Stage 1 checkpoint — pipeline 中间结果(QA 级) ├── session_records.jsonl # 最终结果:session pipeline 数据 + eval 评判 ├── qa_records.jsonl # 最终结果:QA pipeline 数据 + eval 评判 └── aggregate_metrics.json # 最终结果:baseline 级别汇总指标 ``` ## 文件详解 ### 1. `session_records.jsonl` 每行一个 session,包含 pipeline 原始数据和 `eval` 评判结果: ```json { "sample_id": "vab_minecraft_...", "sample_uuid": "uuid-...", "session_id": "S01", "memory_snapshot": [ { "memory_id": "3", "text": "user: OBSERVATION: ...\nassistant: THOUGHT: ...", "session_id": "S01", "status": "active", "source": "FUMemory", "raw_backend_id": "3", "raw_backend_type": "linear", "metadata": {} } ], "memory_delta": [ { "session_id": "S01", "op": "add", "text": "user: OBSERVATION: ...", "linked_previous": [], "raw_backend_id": "3", "metadata": {"baseline": "FUMemory"} } ], "gold_state": { "session_id": "S01", "cumulative_gold_memories": [...], "session_new_memories": [...], "session_update_memories": [...], "session_interference_memories": [] }, "eval": { "session_id": "S01", "recall": 0.8, "covered_count": 4, "num_gold": 5, "update_recall": 1.0, "update_covered_count": 2, "update_total": 2, "recall_reasoning": "4 of 5 gold points are covered...", "correctness_rate": 0.75, "num_memories": 8, "num_correct": 6, "num_hallucination": 1, "num_irrelevant": 1, "correctness_reasoning": "...", "correctness_records": [ {"id": 1, "label": "correct"}, {"id": 2, "label": "hallucination"} ], "update_score": 1.0, "update_num_updated": 2, "update_num_both": 0, "update_num_outdated": 0, "update_total_items": 2, "update_records": [ {"memory_id": "mp_S08_3", "label": "updated", "reasoning": "..."} ], "interference_score": null, "interference_num_rejected": 0, "interference_num_memorized": 0, "interference_total_items": 0, "interference_records": [] } } ``` **eval 字段说明:** | 字段 | 含义 | |------|------| | `recall` | 本 session gold points 被 delta 覆盖的比例 (0-1) | | `update_recall` | update 类型 gold points 的覆盖比例 | | `correctness_rate` | delta 中正确记忆的比例 | | `num_hallucination` | delta 中幻觉记忆数量 | | `num_irrelevant` | delta 中无关记忆数量 | | `update_score` | 更新处理得分 (updated=1.0, both=0.5, outdated=0.0) | | `interference_score` | 干扰拒绝得分 (rejected=1.0, memorized=0.0) | ### 2. `qa_records.jsonl` 每行一个 QA question,包含检索结果、模型回答和评判: ```json { "sample_id": "vab_minecraft_...", "sample_uuid": "uuid-...", "checkpoint_id": "probe_e980c238", "question": "What was in the agent's inventory at step 1?", "gold_answer": "At step 1, the agent's inventory was empty.", "gold_evidence_memory_ids": ["mp_S04_1"], "gold_evidence_contents": ["The agent started with empty inventory"], "question_type": "factual_recall", "question_type_abbrev": "FR", "difficulty": "easy", "retrieval": { "query": "What was in the agent's inventory at step 1?", "top_k": 5, "items": [ { "rank": 0, "memory_id": "memgallery:string_bundle", "text": "user: OBSERVATION: Your Inventory: ...", "score": 1.0, "raw_backend_id": null } ], "raw_trace": {"baseline": "FUMemory"} }, "generated_answer": "The agent's inventory was empty at step 1.", "cited_memories": ["user: OBSERVATION: Inventory: nothing"], "eval": { "answer_label": "Correct", "answer_reasoning": "The response matches the reference answer...", "answer_is_valid": true, "evidence_hit_rate": 1.0, "evidence_covered_count": 1, "num_evidence": 1, "evidence_reasoning": "The cited memory covers the gold evidence...", "num_cited_memories": 1 } } ``` **eval 字段说明:** | 字段 | 含义 | |------|------| | `answer_label` | `Correct` / `Hallucination` / `Omission` | | `answer_is_valid` | 评判是否成功(非 LLM 错误) | | `evidence_hit_rate` | cited memories 覆盖了多少 gold evidence (0-1) | | `evidence_covered_count` | 被覆盖的 gold evidence 数量 | | `num_cited_memories` | 模型回答时引用的记忆条数 | ### 3. `aggregate_metrics.json` baseline 级别的 6 维汇总指标: ```json { "baseline_id": "FUMemory", "memory_recall": { "avg_recall": 0.72, "avg_update_recall": 0.65, "num_sessions_with_recall": 110, "num_sessions_with_update": 85, "total_covered": 320, "total_gold": 445 }, "memory_correctness": { "avg_correctness": 0.81, "avg_hallucination": 0.08, "avg_irrelevant": 0.11, "num_sessions": 110, "total_memories": 1200, "total_correct": 972, "total_hallucination": 96, "total_irrelevant": 132 }, "update_handling": { "score": 0.65, "num_updated": 52, "num_both": 18, "num_outdated": 15, "num_total": 85 }, "interference_rejection": { "score": 0.0, "num_rejected": 0, "num_memorized": 0, "num_total": 0 }, "question_answering": { "correct_ratio": 0.58, "hallucination_ratio": 0.22, "omission_ratio": 0.20, "num_total": 990, "num_valid": 990 }, "evidence_coverage": { "hit_rate": 0.43, "num_covered": 425, "num_total": 990 } } ``` **6 个维度:** | 维度 | 聚合方式 | 核心指标 | 方向 | |------|---------|---------|------| | Memory Recall | 按 session 平均 | `avg_recall` | ↑ | | Memory Correctness | 按 session 平均 | `avg_correctness`, `avg_hallucination` | ↑, ↓ | | Update Handling | 跨 session 池化 | `score` | ↑ | | Interference Rejection | 跨 session 池化 | `score` | ↑ | | Question Answering | 跨 question 池化 | `correct_ratio`, `hallucination_ratio` | ↑, ↓ | | Evidence Coverage | 跨 question 池化 | `hit_rate` | ↑ | ### 4. `pipeline_sessions.jsonl` / `pipeline_qa.jsonl` Stage 1 的 checkpoint 文件,结构与 `session_records.jsonl` / `qa_records.jsonl` 相同但**不含 `eval` 字段**。 用途:`--eval-only` 模式跳过 pipeline 直接从 checkpoint 恢复,只重跑 eval 阶段。典型场景: ```bash # 首次完整运行 python -m eval_framework.cli --dataset ... --baseline FUMemory --output-dir results/FU # 换 judge 模型重评(不重跑 pipeline) OPENAI_MODEL=gpt-4o-mini python -m eval_framework.cli \ --dataset ... --baseline FUMemory --output-dir results/FU --eval-only ``` ## 结果分析示例 ```python import json # 读取汇总 with open("results/FUMemory/aggregate_metrics.json") as f: agg = json.load(f) print(f"Recall: {agg['memory_recall']['avg_recall']:.2%}") print(f"QA Correct: {agg['question_answering']['correct_ratio']:.2%}") # 按 QA type 分析正确率 qa_by_type = {} with open("results/FUMemory/qa_records.jsonl") as f: for line in f: rec = json.loads(line) qt = rec["question_type_abbrev"] label = rec["eval"]["answer_label"] qa_by_type.setdefault(qt, []).append(label) for qt, labels in sorted(qa_by_type.items()): correct = sum(1 for l in labels if l == "Correct") print(f" {qt}: {correct}/{len(labels)} = {correct/len(labels):.0%}") ```