diff --git "a/data/chunks/2603.10600_semantic.json" "b/data/chunks/2603.10600_semantic.json"
new file mode 100644--- /dev/null
+++ "b/data/chunks/2603.10600_semantic.json"
@@ -0,0 +1,1014 @@
+[
+  {
+    "chunk_id": "c9385325-a4d9-4dea-b6c8-1ce04556cace",
+    "text": "Gaodan Fang, Vatche Isahagian, K. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum,\nGegi Thomas∗\nAgents and Automation Lab, IBM Research\nUSA Abstract done at IBM. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/\nLLM-powered agents face a persistent challenge: learning from nnnnnnn.nnnnnnn\ntheir execution experiences to improve future performance. While\nagents can successfully complete many tasks, they often repeat 1 Introduction\ninefficient patterns, fail to recover from similar errors, and miss opLarge Language Model (LLM) powered agents have enabled inportunities to apply successful strategies from past executions. We\ncreasingly sophisticated automation of tasks ranging from web2026 present a novel framework for automatically extracting actionable\nnavigation to API orchestration. These agents operate by iteratively\nlearnings from agent execution trajectories and utilizing them to\nreasoning about tasks, selecting actions, executing them, and obimprove future performance through contextual memory retrieval.\nserving results. However, a fundamental limitation persists: AgentsMar Our approach comprises four components: (1) a Trajectory Intelli- have amnesia because most LLMs are stateless. Agents lack systemgence Extractor that performs semantic analysis of agent reasoning\natic mechanisms to learn from their execution experiences [4, 17].\npatterns, (2) a Decision Attribution Analyzer that identifies which11 An agent that struggles with a particular API authentication flow\ndecisions and reasoning steps led to failures, recoveries, or ineffitoday will struggle with the same flow tomorrow unless its prompts\nciencies, (3) a Contextual Learning Generator that produces three\nare manually updated. An agent that discovers an efficient strategy\ntypes of guidance—strategy tips from successful patterns, recovery\nfor a task cannot automatically apply that strategy to similar future\ntips from failure handling, and optimization tips from inefficient\ntasks. An agent that successfully recovers from an error provides\nbut successful executions—and (4) an Adaptive Memory Retrieval\nno benefit to future executions that encounter similar errors. System that injects relevant learnings into agent prompts based on[cs.AI] Consider a simple e-commerce task: adding items to a shopmulti-dimensional similarity. Unlike existing memory systems that\nping cart and completing checkout. An agent might successfully\nstore generic conversational facts, our framework understands execomplete this task but do so inefficiently—for instance, by callcution patterns, extracts structured learnings with provenance, and\ning amazon_remove_from_cart(item_id) in a loop to empty the\nretrieves guidance tailored to specific task contexts. Evaluation on\ncart when a single amazon_empty_cart() call would suffice.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 1,
+    "total_chunks": 44,
+    "char_count": 2816,
+    "word_count": 379,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "26271868-4136-4bff-a701-4db4443bbb95",
+    "text": "In\nthe AppWorld benchmark demonstrates consistent improvements,\nanother execution, the agent might fail entirely because it attempts\nwith up to 14.3 percentage point gains in scenario goal completion\ncheckout without first adding a payment method, then successfully\non held-out tasks and particularly strong benefits on complex tasks\nrecover by recognizing the error and adding payment information.\n(28.5 pp scenario goal improvement, a 149% relative increase). In yet another execution, the agent might execute the task cleanly\nfrom the start by systematically verifying prerequisites before each CCS Concepts\noperation.\n• Computing methodologies →Information extraction; Multi- Each of these trajectories contains valuable learnings (for future\nagent systems; Knowledge representation and reasoning; • executions), but of different types. The inefficient success suggests\nInformation systems →Enterprise applications; Information re- an optimization tip: when emptying a cart with multiple items,\ntrieval. use the bulk operation rather than iterating through individual removals. The failure-then-recovery suggests a recovery tip: when\nKeywordsarXiv:2603.10600v1 checkout fails due to missing payment method, verify payment inagentic memory, self evolving agents formation is configured before retrying. The clean success suggests\nACM Reference Format: a strategy tip: before initiating checkout operations, systematically\nGaodan Fang, Vatche Isahagian, K. Jayaram, Ritesh Kumar, Vinod Muthusamy, verify all prerequisites including cart contents, shipping address,\nPunleuk Oum, Gegi Thomas. 2026.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 2,
+    "total_chunks": 44,
+    "char_count": 1598,
+    "word_count": 211,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e0bf6ad7-58c8-4d88-bda7-b06bb2258c1b",
+    "text": "Trajectory-Informed Memory Generation and payment method availability.\nfor Self-Improving Agent Systems. In Technical Report describing research Current approaches to agent improvement are inadequate for\n∗Author names listed alphabetically. capturing these diverse learning opportunities. Rule-based systems\nrequire developers to manually anticipate patterns and encode\nPermission to make digital or hard copies of all or part of this work for personal or them as decision rules, making them brittle and unable to adapt\nclassroom use is granted without fee provided that copies are not made or distributed\nfor profit or commercial advantage and that copies bear this notice and the full citation to unforeseen situations. Prompt engineering improves common\non the first page. Copyrights for third-party components of this work must be honored. patterns through iteratively refined instructions and examples, but\nFor all other uses, contact the owner/author(s). this guidance is generic rather than specific to actual deployment\nTechnical Report, Yorktown Heights, NY\n© 2026 Copyright held by the owner/author(s). experiences, and there is no mechanism for automatic improvement\nhttps://doi.org/10.1145/nnnnnnn.nnnnnnn based on observed outcomes. Generic memory systems [2, 15] store Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 3,
+    "total_chunks": 44,
+    "char_count": 1353,
+    "word_count": 189,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a9d0f7d9-46fd-4cc8-adcd-a3ebe7614c6c",
+    "text": "facts extracted from conversations in vector databases for later re- 2 Problem Statement\ntrieval, but these systems lack several critical capabilities: they have 2.1 The Agent Learning Challenge\nno understanding of agent execution patterns and reasoning flows,\nLLM-powered agents execute tasks by iteratively reasoning, select-they cannot perform causal analysis to identify which decisions led\ning actions, and observing outcomes. Each execution trajectory—theto failures or inefficiencies, they lack structured learning extraction\ncomplete sequence of thoughts, actions, and results from initialwith categories like strategy, recovery, and optimization, and they\nrequest to final outcome—contains patterns that could inform fu-provide no provenance tracking from learnings back to source trature executions [11]. However, extracting actionable learnings fromjectories. Recent work has begun extracting reusable knowledge\nthese trajectories is non-trivial for several reasons.from agent trajectories—including workflows from successful exeFirst, valuable patterns exist across diverse outcome cate-cutions [6, 13], procedural instructions [5], reasoning strategies [9],\ngories. Not all learning opportunities arise from failures. An agentand evolving context playbooks [16]—but these approaches typithat successfully completes a task may have employed an elegantcally learn only from successful trajectories, lack explicit causal\nstrategy worth replicating, discovered an efficient API usage pat-attribution of failures, or produce monolithic documents rather\ntern, or executed a thorough validation sequence that preventedthan structured, retrievable memory entries. Empirical studies furerrors. Conversely, an agent that ultimately succeeds may havether demonstrate that naive experience accumulation leads to error\ndone so inefficiently—taking unnecessary steps, making redundantpropagation and misaligned replay [14], underscoring the need for\nAPI calls, or using granular operations where bulk operations exist.quality-aware memory curation. And agents that encounter failures may successfully recover, with We present a framework that addresses these limitations through\nthe recovery pattern itself being valuable to capture.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 4,
+    "total_chunks": 44,
+    "char_count": 2232,
+    "word_count": 282,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3002c2d7-bf07-4fcb-bae5-6949da8240d3",
+    "text": "A comprehen-trajectory-informed memory generation and retrieval. Our key insive learning system must extract insights from clean successes,sight is that agent execution histories—trajectories—contain rich\ninefficient successes, failure-then-recovery sequences, and com-semantic information about not just what happened, but why agents\nplete failures.made decisions, how they reasoned about tasks, which strategies\nSecond, causality is often non-obvious from raw logs. Whensucceeded, which patterns proved inefficient, and where decision\nan agent fails at step 15 of an execution, the problematic decisionchains led to failures and recoveries. By analyzing these trajectomay have occurred at step 3.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 5,
+    "total_chunks": 44,
+    "char_count": 698,
+    "word_count": 90,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3d0ab9e7-739b-4187-b1a0-92477beb52de",
+    "text": "When an agent successfully recoversries with semantic understanding, we can automatically extract\nfrom an error, identifying which specific reasoning led to the recov-actionable learnings across multiple categories, attribute failures\nery requires semantic understanding of the agent's thoughts, notand inefficiencies to specific decisions and reasoning steps, generate\njust observation of the final outcome. When an agent completes acontext-aware guidance, and retrieve relevant learnings based on\ntask inefficiently, determining which alternative approach wouldmultiple contextual dimensions.\nbe more efficient requires understanding both what the agent did Our contributions are as follows:\nand what other options were available. Third, learnings must be contextually retrieved.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 6,
+    "total_chunks": 44,
+    "char_count": 781,
+    "word_count": 100,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "70240ea8-95cf-479a-a386-836db4439cb2",
+    "text": "An opti-\n• We introduce trajectory intelligence extraction that moves mization tip about using bulk cart operations is relevant when the\nagent is performing cart management but irrelevant for email com- beyond raw logging to semantic understanding of agent\nposition tasks. A recovery tip about handling authentication failures reasoning patterns, including analytical thoughts, planning\nis critical for tasks involving authenticated APIs but unnecessary patterns, validation behaviors, reflection patterns, and selffor read-only operations. The retrieval system must match learn- correction sequences.\n• We present automated decision attribution that distinguishes ings to contexts based on multiple dimensions: task type, domain,\nsemantic similarity to current request, and the specific execution immediate causes, proximate causes, and root causes of failpatterns involved. The importance of precise retrieval is amplified ures, while also identifying which decisions led to successful\nby empirical evidence that agents closely follow retrieved experi- recoveries and which execution patterns prove inefficient\nences [14], making mismatched retrieval a direct source of degraded despite succeeding.\n• We develop contextual learning generation that produces performance. Fourth, learnings must be actionable and specific. Generic three distinct types of guidance: strategy tips encoding sucadvice like \"be careful with API calls\" provides little value. Effective cessful patterns from clean executions, recovery tips capturlearnings specify concrete validation checks, particular API usage ing failure handling and error correction approaches, and\npatterns, specific error recovery sequences, or explicit prerequisite optimization tips identifying efficiency improvements from\nverification steps. They must be formulated in terms the agent can successful but suboptimal executions.\n• We design adaptive memory retrieval that combines seman- directly apply: \"Before initiating checkout, verify payment method\nis configured by calling get_payment_methods() and checking for tic similarity with metadata filtering and priority-based ranknon-empty results\" is actionable; \"make sure payment works\" is ing to ensure agents receive the most relevant guidance for\nnot. their specific context, including task type, domain, and exeFifth, learnings must be traceable to their source. Each learn- cution patterns.\n• We demonstrate the framework's effectiveness on the App- ing must maintain provenance—a link back to the specific trajectory\nand outcome from which it was derived [3]. This enables validation World benchmark, showing consistent improvements across\nof whether learnings are effective (do similar failures still occur all difficulty levels, with particularly strong gains on complex\ntasks where learned experience is most valuable.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 7,
+    "total_chunks": 44,
+    "char_count": 2835,
+    "word_count": 383,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "11dee733-beb0-49a2-9e5b-450875fdb5c7",
+    "text": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY after the learning is deployed?), investigation of why certain guid- identify and classify distinct reasoning modes—analytical thoughts\nance was generated, and auditing of the learning system's decisions. (examining data or constraints), planning thoughts (formulating\nWithout provenance, it is impossible to debug incorrect guidance, action sequences), validation thoughts (checking prerequisites or\nassess learning quality over time, or build trust in the system's intermediate results), reflection thoughts (evaluating past actions),\nrecommendations. and self-correction sequences (recognizing and recovering from\nerrors)—to understand how agents reasoned about tasks and where\ntheir reasoning succeeded or failed. This structured understanding\n2.2 Learning Requirements of reasoning flows is what enables the extraction of meaningful\nFor agents that reason and act iteratively (e.g., ReAct-style, plan-and- learnings from trajectories rather than surface-level pattern matchexecute), the learning system must satisfy several requirements. ing on actions alone. Strategy extraction from successful patterns: When an agent\nexecutes a task cleanly—without errors, unnecessary steps, or recovery sequences—its approach often embodies effective strategies. 2.3 Limitations of Existing Approaches\nThe system must identify these patterns: Did the agent verify prereq- Existing approaches to agent improvement fail to address these\nuisites before attempting operations? Did it systematically explore challenges comprehensively.\navailable APIs before selecting one? Did it validate intermediate Rule-based systems encode decision rules based on anticipated\nresults before proceeding to dependent steps? These successful pat- patterns, but they cannot adapt to unforeseen situations and reterns should be encoded as strategy tips that guide future executions quire constant manual maintenance as new patterns emerge. They\ntoward similarly effective approaches. also cannot automatically extract rules from observed execution\nRecovery extraction from failure handling: When an agent trajectories—each rule must be manually crafted by developers who\nencounters an error but successfully recovers, the recovery se- may not have visibility into actual deployment patterns.\nquence is valuable. The system must identify what went wrong, Prompt engineering improves agent performance through iterwhat the agent recognized about the failure, how it adjusted its atively refined guidance and examples, but this guidance is generic\napproach, and what specific actions led to successful recovery. For rather than specific to actual deployment experiences.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 8,
+    "total_chunks": 44,
+    "char_count": 2768,
+    "word_count": 363,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0b39824d-b370-48e7-b978-693450f55610",
+    "text": "If an agent\nexample, if an agent attempts checkout without payment config- repeatedly fails at a particular API authentication flow, prompt\nured, receives an error, recognizes the missing payment method, engineering might eventually capture this pattern, but only after\nadds payment information, and successfully retries, this entire se- manual observation and prompt modification. There is no mechaquence should be encoded as a recovery tip including the failure nism for automatic improvement based on observed outcomes, and\npattern, recognition signals, and correction steps. no systematic way to capture the full range of learning opportuniOptimization extraction from inefficient successes: When ties from successes, failures, and recoveries.\nan agent successfully completes a task but does so suboptimally, Generic memory systems represent a more sophisticated apthe system must identify the inefficiency and determine the more proach but still fall short. Systems like Mem0 [2] and Letta [10]\nefficient alternative. This requires understanding not just what the store facts extracted from conversations in vector databases for\nagent did, but what other options were available. For example, if later retrieval.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 9,
+    "total_chunks": 44,
+    "char_count": 1216,
+    "word_count": 174,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "13d6e50e-1613-4360-a6fd-5fd84d476fcb",
+    "text": "However, these systems lack several critical capaan agent removes items from a cart one-by-one in a loop when a bilities for agent learning. They have no understanding of agent\nbulk empty_cart() operation exists, the system must recognize execution patterns—they treat all memories uniformly rather than\nthis pattern, identify the more efficient alternative, and encode an distinguishing between strategy patterns, recovery sequences, and\noptimization tip specifying when and how to use the bulk operation. optimization opportunities. They cannot perform causal analysis to\nStep-level decision attribution: When failures or inefficiencies identify which decisions led to failures or inefficiencies—they store\noccur, the system must identify which specific reasoning steps and outcomes but not the decision chains that produced them. They\ndecisions led to the outcome.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 10,
+    "total_chunks": 44,
+    "char_count": 867,
+    "word_count": 122,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3994d66d-7f17-43d4-bf8f-9aaafef56a9f",
+    "text": "This requires semantic analysis of the lack structured learning extraction with categories, priorities, and\nagent's thoughts, not just observation of actions. If an agent fails actionable steps—memories are typically free-form text without the\nbecause it assumed an API was available without verifying, the structure needed for agent guidance. They provide no provenance\nattribution must identify the assumption step, explain why it was tracking from learnings back to source trajectories, making it improblematic, and specify what verification should have occurred. possible to validate whether learnings are effective or to investigate\nThought pattern recognition: Agents often exhibit meta-cognitive why certain guidance was generated [17].\nbehaviors that indicate their reasoning quality. An agent that explic- Reinforcement learning approaches learn from reward sigitly validates prerequisites is demonstrating a positive pattern. An nals but have their own limitations for this problem. They require\nagent that recognizes its own errors and self-corrects is exhibiting extensive training data to learn effective policies, which may not be\nreflection. An agent that makes assumptions without verification is available when failures are rare but consequential. They are compuexhibiting a negative pattern. The system must identify these cog- tationally expensive to train and update, making them impractical\nnitive patterns semantically—recognizing that \"I should verify all for continuously evolving agent systems. They provide limited interAPIs are available\" exhibits a validation pattern even without using pretability regarding why certain decisions improve outcomes—the\nthe word \"validate\"—and use them to guide learning extraction. learned policy is often a black box. For scenarios where underSemantic reasoning analysis: Beyond recognizing individual standing the reasoning behind improvements is valuable (such as\nthought patterns, the system must move beyond raw execution logs debugging or auditing agent behavior), RL approaches provide into understand the full structure of agent reasoning. The system must sufficient transparency. Additionally, RL approaches struggle with Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 11,
+    "total_chunks": 44,
+    "char_count": 2262,
+    "word_count": 310,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "2711bbc9-6d73-4134-ac37-f2eb7ce0456a",
+    "text": "the multi-category learning problem—they optimize for overall re- what traditional logging provides: why agents made particular deward but do not naturally distinguish between strategy patterns, cisions, how they validated their reasoning, where they exhibited\nrecovery sequences, and optimization opportunities. self-corrective behavior, and what patterns characterized successful\nversus unsuccessful executions.\n3 Approach The component receives raw agent trajectories containing sequential steps with agent invocations, prompts or contexts, agentAs illustrated in Figure 1, we propose a framework that transresponses including thoughts and reflections, actions taken andforms raw agent execution trajectories into actionable, contextuallytheir results, and optionally, evaluation reports or ground-truthretrieved guidance for future invocations. The framework operates\noutcome assessments. Each trajectory represents a complete taskas a three-phase pipeline:\nexecution from initial user request through final outcome. Crucially,\n(1) Phase 1: Trajectory Analysis and Tips Extraction. Given ground-truth outcome labels (success or failure) are not required:\nan agent's execution trajectory for a completed task, the sys- when they are available—for instance, from a benchmark evaluation\ntem analyzes the reasoning trace to identify causal decision harness—the system uses them directly to classify the trajectory;\nchains—why outcomes occurred—and extracts structured when they are absent, the system infers outcome from the agent's\ntips capturing effective strategies, recovery patterns, and own self-reflective signals identified in subsequent stages.\noptimization opportunities. Tips are extracted at two com- The first processing stage parses agent responses to identify and\nplementary granularities: task-level tips that capture holistic categorize reasoning into four types based on cognitive function:\nend-to-end patterns, and subtask-level tips that decompose Analytical thoughts where the agent analyzes the situation and\ntrajectories into reusable logical phases (authentication, data assesses constraints; Planning thoughts where the agent decides\nretrieval, processing, etc.) for cross-task transfer. what actions to take and in what sequence; Validation thoughts\n(2) Phase 2: Tip Storage and Management. Extracted tips where the agent checks assumptions or verifies preconditions; and\nare generalized, clustered, and consolidated before storage. Reflection thoughts where the agent reconsiders its approach,\nSubtask descriptions are abstracted to remove entity-specific often triggered by unexpected results. Beyond categorization, the\ndetails, enabling semantic clustering of tips from different extractor identifies status indicators, execution summaries, and\ntasks that share common subtask patterns. An LLM-based error recognition statements, enabling downstream components to\nmerging process consolidates redundant or overlapping tips understand the reasoning process that led to actions.\nwithin each cluster, producing a curated memory of non- The second stage uses an LLM to identify cognitive patterns\nredundant, high-quality guidance. Tips are stored with dual within extracted thoughts through semantic understanding rather\nrepresentations—vector embeddings for semantic search and than keyword matching. The system recognizes: Validation patstructured metadata for filtering. terns—any expression of checking or verifying assumptions, even\n(3) Phase 3: Runtime Retrieval. When an agent is invoked for without validation-related keywords (e.g., \"I need to ensure all rea new task, the system retrieves relevant tips from memory quired APIs are included\" exhibits validation behavior); Reflection\nand injects them into the agent's prompt as guidelines before patterns��reconsideration of previous decisions, often after errors;\nreasoning begins. Two retrieval strategies are supported: Self-correction patterns—proactively identifying and fixing ercosine similarity retrieval (fast, no LLM call) and LLM-guided rors before external signals; Error recognition patterns—noticing\nselection (richer reasoning about task context at the cost of problems that may affect task completion; API discovery patan additional LLM invocation). terns—systematic exploration of available APIs; and Efficiency\nThese phases form a self-reinforcing cycle: as more trajectories awareness patterns—considering whether more efficient alterare processed, the memory system accumulates increasingly com- natives exist. This semantic approach generalizes across linguistic\nprehensive and refined guidance.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 12,
+    "total_chunks": 44,
+    "char_count": 4600,
+    "word_count": 589,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e99ce193-5537-4530-bdb7-45fb12e1eecf",
+    "text": "Agents that receive this guidance variations, unlike rule-based keyword matching.\nproduce higher-quality trajectories that may reveal subtler patterns The third stage determines the trajectory outcome. When groundfor further learning.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 13,
+    "total_chunks": 44,
+    "char_count": 234,
+    "word_count": 29,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f1b8e3a3-1782-4d1d-841a-a81b343192bf",
+    "text": "The following subsections detail each phase. truth evaluation reports are present, the stage interprets them with\nsemantic understanding: a report stating \"API response returned\n3.1 Phase 1: Trajectory Analysis and Tips 400 Bad Request\" is converted into \"Checkout API failed because reExtraction quired payment method parameter was not provided,\" and for each\noutcome indicator, the module determines what the test validates,This phase analyzes completed agent trajectories to extract strucwhy it failed (if applicable), the impact on task completion, andtured, actionable tips. It comprises three stages: trajectory intellioverall quality assessment. When ground-truth labels are absent,gence extraction, decision attribution analysis, and tip generation.\nthe stage instead synthesizes outcome from the self-reflective sig-A key design dimension of the tip generation stage is the granunals extracted in stages 1 and 2—reflection thoughts, self-correctionlarity at which tips are extracted—either at the level of entire task\npatterns, and error recognition patterns—to infer whether the agenttrajectories (task-level) or at the level of individual logical subtasks\nsucceeded, failed, or recovered. In both cases, the result is an out-within a trajectory (subtask-level). We explore both granularities\ncome classification used by downstream components.and compare their effectiveness in Section 4. A fourth stage specifically analyzes successful executions, dis-\n3.1.1 Trajectory Intelligence Extractor. The Trajectory Intelligence tinguishing: Clean successes—task completed without errors or\nExtractor transforms raw agent execution data into a structured in- unnecessary steps, with patterns that are candidates for strategy\ntermediate representation that captures semantic meaning beyond tips; Inefficient successes—task completed but suboptimally (e.g., Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY EXTRACTION STORAGE & MGMT RETRIEVAL & USAGE Trajectory Intelligence Extractor Description Generalization Cosine similarity or top-k selection TIPS TIPS\nDecision Attribution Analyzer Semantic Clustering LLM-guided selection Contextual Learning Generator Tip Merging and Consolidation Priority Weighted Ranking Subtask-level Decomposition Dual-Indexed Store Prompt integration Agent Trajectory Agent Task Description Figure 1: Overview of our approach",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 14,
+    "total_chunks": 44,
+    "char_count": 2439,
+    "word_count": 312,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "55158338-5cee-4250-92cb-59ce19836a64",
+    "text": "repeated operations that could be batched), yielding candidates for what made execution suboptimal, what more efficient alternative\noptimization tips; and Recovery sequences—successful error han- exists, why the alternative is better, and whether the agent was\ndling within otherwise successful executions, yielding candidates aware of the inefficiency. For success patterns, it identifies what\nfor recovery tips. strategies contributed to clean success, why they were effective,\nThe output is a structured intermediate representation enriched and what made the approach particularly good.\nwith extracted thoughts, identified patterns with confidence scores, The final stage generates specific prevention or improvement\nevaluation intelligence, success patterns, and metadata including steps for each attributed decision point. These steps must be actrajectory identifier, task intent, step count, and overall outcome tionable—the agent can actually perform them; specific—concrete\nclassification. actions rather than vague advice; causal—directly addressing the\nroot cause; and preventive or improving—stopping similar failures from occurring or specifying more efficient approaches.3.1.2 Decision Attribution Analyzer. The Decision Attribution Analyzer performs automated causal analysis to identify which deci- 3.1.3 Contextual Learning Generator. The Contextual Learning\nsions and reasoning steps led to observed outcomes. It analyzes all Generator converts decision analyses into reusable memory entries\noutcome types—not just failures. that are actionable, contextually rich, and properly categorized. The first stage scans the intermediate representation for outcome The key innovation is generating three distinct tip types based on\nindicators across four categories: Failure indicators—failed evalua- trajectory outcomes.\ntions, error messages, task incompletion signals; Recovery indica- Strategy tips encode effective patterns from clean successful\ntors—failure followed by successful completion, error recognition executions—what worked well and should be replicated. Example:\nfollowed by corrective actions; Inefficiency indicators—repeated\noperations that could be batched, unnecessary intermediate steps, Content: \"When performing checkout operations,\ngranular operations where bulk alternatives exist; and Success pat- systematically verify all prerequisites (cart\nterns—clean completion, systematic prerequisite verification, effi- contents, shipping address, payment method) before\ncient API usage. For each detected outcome, contextual information initiating the checkout sequence.\"\nis extracted as the starting point for causal analysis. Importantly, Category: strategy\nthe outcome location is typically not the cause location. Steps:\nThe causal analysis module uses an LLM to trace backwards 1. Call get_cart_items() to verify cart is not empty\nthrough the agent's reasoning steps to identify which decisions led 2. Call get_shipping_address() to verify address is\nto the observed outcome. For failures, the analysis distinguishes: configured\nthe immediate cause (what directly triggered the failure), the prox- 3. Call get_payment_methods() to verify payment method\nimate cause (recent decisions that enabled it), the root cause (the exists\nunderlying issue that originated the chain), and contributing fac- 4.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 15,
+    "total_chunks": 44,
+    "char_count": 3332,
+    "word_count": 430,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b99488d6-5b7d-43f5-9455-d2ff858f4b51",
+    "text": "Only proceed with checkout if all prerequisites are\ntors. For recoveries, it identifies what enabled the failure, how the satisfied\nagent recognized the problem, what corrective action was taken, Trigger: \"When task involves checkout, purchase, or\nand why the correction succeeded. For inefficiencies, it identifies payment operations\" Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 16,
+    "total_chunks": 44,
+    "char_count": 406,
+    "word_count": 58,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c3461736-5e40-4c2b-ab15-33355ded4ac9",
+    "text": "Recovery tips encode both the failure pattern and the recovery 3.1.4 Task-Level vs. Subtask-Level Extraction. The tip generation\npattern from failure-then-recovery sequences. Example: stage can operate at two granularities. Task-level extraction treats an\nentire trajectory as a unit, producing holistic tips that capture endContent: \"When checkout fails with ´payment method\nto-end execution patterns. Subtask-level extraction first decomposes\nrequired´ error, verify payment configuration and\nthe trajectory into logical subtasks and then extracts focused tips\nadd payment method if missing before retrying.\"\nfor each subtask independently. Category: recovery\nThe two approaches offer different tradeoffs. Task-level tips are\nSteps:\nstraightforward to extract and capture overarching strategies span-\n1. Recognize error message indicating missing payment\nning the full task. However, their reusability is limited by task specimethod\nficity: a tip extracted from \"Name the artist most recommended to\n2. Call get_payment_methods() to check current\nme on Spotify\" may not transfer to \"Move my go-to-sleep phone\nconfiguration\nalarm to 20 minutes later,\" even though both share common sub-\n3. If empty, call add_payment_method() with appropriate\ntasks such as authentication and paginated data retrieval. Task-level\ndetails\ntips also bundle concerns from distinct execution phases, reducing\n4.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 17,
+    "total_chunks": 44,
+    "char_count": 1390,
+    "word_count": 190,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f11c3838-fc22-40d8-b254-b1b87de061dc",
+    "text": "Retry the checkout operation\nretrieval precision. Trigger: \"When checkout or payment operations fail\"\nSubtask-level extraction addresses these limitations by scoping\nNegative Example: \"Do not simply retry without addressing\neach tip to a single logical phase. Many tasks share common subthe missing payment method.\"\ntasks that generalize across contexts:\nOptimization tips identify efficiency improvements from suc- • Authentication subtasks follow a common pattern across\ncessful but suboptimal executions. Example: apps (Spotify, Phone, Venmo): retrieve credentials from a\nContent: \"When emptying a shopping cart with multiple supervisor, login, and store the access token.\nitems, use empty_cart() instead of iterating • Data retrieval subtasks share pagination patterns: issue\nremove_from_cart(item_id) for each item.\" paginated API calls, aggregate results, and store them for\nCategory: optimization downstream processing. Steps: • Data processing subtasks involve domain-independent op-\n1.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 18,
+    "total_chunks": 44,
+    "char_count": 994,
+    "word_count": 131,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c7d7e686-ea71-4df0-8300-c5dda0f33064",
+    "text": "Check if cart has multiple items to remove erations: counting, filtering, aggregation, and transformation\n2. Instead of looping remove_from_cart(), call of retrieved data.\nempty_cart() once • Task completion subtasks are near-universal: reporting\n3. Verify cart is empty with get_cart_items() results and marking tasks complete. Trigger: \"When task requires removing all items from cart\" By extracting tips at this granularity, we enable cross-task transNegative Example: \"Do not use for i in items: fer (authentication tips from Spotify tasks help with Phone app\nremove_from_cart(i) when emptying the entire cart.\" tasks), better matching (a task about updating alarms retrieves tips\nfrom a \"retrieve all alarms\" subtask even if the original task was\nThe system analyzes trajectories to determine contextual diabout deleting alarms), and compositional learning (new complex\nmensions for both generation and retrieval: the application context\ntasks leverage tips from multiple simpler subtasks).\n(which domain the task involves), the task category (type of operation within the domain), and the complexity level. Tip content is Two-Phase Extraction Pipeline. The subtask-level extraction opgenerated using specialized prompts for each category, incorpo- erates as a two-phase LLM-based pipeline.\nrating the relevant execution patterns, and each prompt includes Phase A: Trajectory Segmentation. An LLM analyzes the full\nguidelines for generating actionable, specific, generalizable tips. agent trajectory and segments it into logical subtasks. For each\nEach generated memory entry contains: a unique identifier, tip subtask, the model produces a generalized description (deliberately\ncategory (strategy, recovery, optimization), actionable content, ex- generic, e.g., \"Authenticate with Spotify\" rather than \"Login as\nplanatory purpose, concrete implementation steps, trigger condi- user@gmail.com\"), the set of applications involved, the step range in\ntion, optional negative example, application context (or null for the original trajectory (maintaining traceability), and the subtask's\ngeneric tips), task category (or null for generic tips), priority level purpose. The segmentation prompt instructs the model to iden-\n(critical/high/medium/low based on outcome severity), source tra- tify natural boundaries between distinct logical phases—transitions\njectory ID, and source outcome description. from authentication to data retrieval, from data retrieval to processThe system also generates both domain-specific and generic tips ing, and so on.\nfrom the same trajectory, maximizing precision and coverage. From For example, a trajectory for \"Name the artist most recommended\na failure involving missing payment APIs in e-commerce checkout, to me on Spotify\" might be segmented into: (1) discover relevant\nthe system generates a domain-specific tip (\"For e-commerce tasks APIs and their specifications, (2) authenticate with Spotify, (3) reinvolving checkout, verify payment method is configured before trieve recommended songs via paginated requests, and (4) analyze\ninitiating checkout\") and a generic tip (\"When initiating operations recommendations to determine the most recommended artist.\nthat have prerequisites, systematically verify all prerequisites before Phase B: Per-Subtask Tips Extraction. An LLM independently\nbeginning\").",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 19,
+    "total_chunks": 44,
+    "char_count": 3342,
+    "word_count": 452,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "19a91586-dc98-4773-b5e4-923ae202d819",
+    "text": "This dual-level generalization ensures high precision extracts 2–4 actionable tips for each subtask. By scoping each exwhen context matches domain-specific tips and broad coverage traction call to a single subtask, the prompts remain focused and the\nthrough generic tips that apply even in novel domains. tips avoid conflating concerns from different execution phases. Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 20,
+    "total_chunks": 44,
+    "char_count": 498,
+    "word_count": 68,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b41f6e35-9a00-4069-a47e-93c2f0139a45",
+    "text": "tips are constrained to be concrete (specific API patterns rather than • Context removal: Strips task-specific contextual qualifiers\nvague advice), generalizable (avoiding task-specific details such as that do not affect the subtask's core operation. \"Retrieve creparticular email addresses, song names, or payment amounts), and dentials in order to check subscription status\" is reduced to\nactionable (directly applicable by an agent encountering a similar \"Retrieve service account credentials,\" since the downstream\nsubtask). Optionally, different models can be used for Phase A and purpose does not change how credential retrieval should be\nPhase B—a more capable model for segmentation and a lighter performed.\nmodel for per-subtask extraction—to balance cost and quality. Example output for the \"Authenticate with Spotify\" subtask: These transformations are applied using an LLM with a prompt\nthat instructs it to produce maximally abstract descriptions while Tips:\npreserving the core operation. The generalized descriptions serve 1. \"Always retrieve account credentials from\nas the basis for clustering: tips whose generalized subtask descrip- supervisor.show_account_passwords() before\ntions are semantically similar are likely to contain overlapping or attempting authentication\"\ncomplementary guidance. 2. \"Immediately store and validate access tokens after\nlogin to ensure successful subsequent API calls\"\n3. \"Filter credentials by app name to select the correct 3.2.2 Semantic Clustering. The system clusters tips by computing\npassword for the target service\" cosine similarity between the vector embeddings of their generalized subtask descriptions, then applying hierarchical agglomerative Subtask-level and task-level tips are complementary rather than\nclustering with a similarity threshold. Two generalized descrip-competing. Task-level tips capture holistic patterns about end-totions such as \"Retrieve service account credentials\" and \"Authenti-end execution strategy (e.g., \"verify all prerequisites before checkcate with external service\" may describe distinct subtasks despiteout\"), while subtask-level tips capture focused patterns about spesurface-level relatedness, while \"Retrieve service account creden-cific execution phases (e.g., \"use paginated retrieval when fetching\ntials\" and \"Obtain application login credentials\" describe the samelarge result sets\"). Both levels are stored in the same memory system\noperation. Hierarchical clustering with an appropriate thresholdand can be retrieved together during Phase 3.\n(empirically, ∼0.85 on generalized descriptions) groups truly equiva-\n3.2 Phase 2: Tip Storage and Management lent subtask descriptions while keeping distinct operations separate. Within each cluster, all associated tips are collected regardless\nAs tips accumulate from many trajectories across diverse tasks, of their source trajectory, task context, or extraction granularity.\nthe memory system must address redundancy, inconsistency, and A cluster for \"Retrieve service account credentials\" might contain\nscalability. Two trajectories involving e-commerce checkout may in- tips from Spotify authentication trajectories, Venmo login trajecdependently produce tips about verifying payment methods; dozens tories, and Phone app credential retrieval—all reflecting the same\nof trajectories across different apps will produce authentication- underlying subtask pattern observed across different tasks.\nrelated tips with overlapping guidance. Without consolidation, the\nmemory grows linearly with the number of processed trajectories,\nretrieval quality degrades as near-duplicate tips compete for limited 3.2.3 Tip Consolidation and Merging. Within each cluster, an LLMprompt space, and contradictory tips from different trajectories may based consolidation process merges redundant tips, resolves conconfuse the agent. flicts, and produces a curated set of non-overlapping guidance. The\nPhase 2 addresses these challenges through a pipeline of subtask consolidation operates in three steps:\ndescription generalization, semantic clustering, and LLM-based tip Deduplication. Tips with near-identical content are identified\nconsolidation. and merged. \"Always call show_account_passwords() before login\" and \"Retrieve credentials using the supervisor password API\n3.2.1 Subtask Description Generalization.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 21,
+    "total_chunks": 44,
+    "char_count": 4346,
+    "word_count": 561,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6a324e8b-4476-433e-806d-3cd2c54fd7ce",
+    "text": "Subtask descriptions pro- before authentication\" convey the same guidance; the consolidation\nduced by Phase 1 contain varying levels of specificity that hinder produces a single canonical tip that captures the shared insight.\nclustering. \"Retrieve Spotify password for john.doe@email.com Conflict resolution. When tips from different trajectories offer\nusing supervisor API,\" \"Get Venmo login credentials for user al- contradictory guidance (e.g., one tip recommends retrying failed auice_smith,\" and \"Fetch Phone app password from supervisor\" all thentication immediately while another recommends re-retrieving\ndescribe the same abstract operation: retrieving service credentials. credentials first), the system uses outcome metadata—tip category,\nTo enable meaningful clustering, the system generalizes subtask priority level, and source trajectory success/failure status—to deterdescriptions through three transformations: mine which guidance is more reliable. Tips derived from successful\n• Entity abstraction: Replaces specific user names, email trajectories take precedence over those from failed ones, and recovaddresses, app names, item IDs, and other entity references ery tips that encode proven correction patterns take precedence\nwith generic placeholders. \"Retrieve Spotify password for over speculative prevention strategies.\njohn.doe@email.com\" becomes \"Retrieve service account Synthesis. Complementary tips that address different aspects\ncredentials.\" of the same subtask are synthesized into coherent, comprehensive\n• Action normalization: Maps semantically equivalent verbs guidance. If one tip covers credential retrieval and another covers\nand phrases to canonical forms. \"Get,\" \"fetch,\" \"retrieve,\" and token validation after login, the consolidated output combines both\n\"obtain\" are normalized to a single canonical verb. \"Log in,\" into a single tip with ordered steps covering the full authentication\n\"sign in,\" and \"authenticate\" are similarly unified. workflow. Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. The consolidation also produces a canonical cluster description—a the threshold, preventing prompt bloat when many stored\nsingle generalized subtask description that represents the clus- tasks are moderately similar.\nter for retrieval purposes. This description is re-embedded and In practice, these two mechanisms are combined: the system\nstored alongside the consolidated tips, replacing the individual retrieves all tips with similarity ≥𝜏, then selects the top 𝑘by\nper-trajectory descriptions. similarity score.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 22,
+    "total_chunks": 44,
+    "char_count": 2574,
+    "word_count": 336,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e0b96541-1c97-47c0-9894-5828b29cefa7",
+    "text": "Typical values are 𝜏∈[0.5, 0.7] and 𝑘∈[5, 10]. 3.2.4 Storage Representation. Each consolidated memory entry is 3.3.2 LLM-Guided Selection. A more expressive approach uses an\nstored with two complementary representations. The vector em- LLM at retrieval time to analyze the task description 𝑑, detect the\nbedding is a dense vector computed from the tip content and pur- application context and task category, and reason about which types\npose using a text embedding model. This captures semantic mean- of guidance are most relevant. The LLM constructs a structured\ning, enabling similarity search across different terminology—for retrieval query that combines:\ninstance, a tip about \"renewing a subscription\" can match a task de-\n• Metadata filters: The LLM identifies that a task about \"Com-scription mentioning \"extending my membership,\" and a tip about\n\"scheduling a recurring event\" can match \"set up a weekly meeting.\" plete my pending Venmo payment requests\" involves the\nThe structured metadata consists of filterable attributes: tip cate- Venmo application and payment operations, and constrains\ngory (strategy, recovery, optimization), priority level, application retrieval to tips from the payment domain (or generic tips\ncontext, task category, source trajectory IDs (plural, since consol- with null application context).\n• Category awareness: Based on the task description, theidated tips may derive from multiple trajectories), and creation\ntimestamp. LLM may determine that recovery tips are particularly releTips are indexed by their canonical cluster description for subtask- vant (e.g., the task mentions retrying a failed payment) or\nlevel tips, and by the original task description for task-level tips, that strategy tips should be prioritized (e.g., the task involves\ncreating natural groupings that enable retrieval at both granulari- a multi-step workflow).\nties. LLM-guided selection is more expensive (requiring an additional\nLLM call per task) but can reason about nuances that pure embed-\n3.3 Phase 3: Runtime Retrieval ding similarity misses. For instance, an LLM can recognize that\nWhen an agent is invoked to execute a new task with description \"Delete all my read emails older than 30 days\" and \"Clean up my\n𝑑, the system retrieves relevant tips from memory and injects them inbox by removing old messages\" are the same task even when their\ninto the agent's prompt as guidelines before reasoning begins. The embeddings diverge, and it can infer that a task involving \"checkretrieval strategy directly affects whether the agent receives rel- out\" implies payment-related tips are relevant even if \"payment\" is\nevant, actionable guidance or is distracted by irrelevant tips. We never mentioned in the task description.\nconsider two strategies with different cost-accuracy tradeoffs. Cosine similarity retrieval is simple, fast, and\nrequires no LLM calls at runtime—making it suitable for latency-\n3.3.1 Cosine Similarity Retrieval.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 23,
+    "total_chunks": 44,
+    "char_count": 2958,
+    "word_count": 440,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "97ab7e49-397e-44f7-a65e-71b67fbf6af5",
+    "text": "The most straightforward ap- sensitive or cost-constrained deployments. LLM-guided selection\nproach embeds the incoming task description 𝑑and computes co- provides richer reasoning about task context at the cost of an addisine similarity against the embeddings of stored task (and subtask) tional LLM invocation. We evaluate both strategies empirically in\ndescriptions. Tips associated with the most similar stored descrip- Section 4.\ntions are retrieved and injected into the prompt. This strategy requires no LLM calls at retrieval time and is fast and inexpensive—a 3.3.3 Prompt Integration.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 24,
+    "total_chunks": 44,
+    "char_count": 594,
+    "word_count": 85,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3dbd0712-61ed-47d1-bfad-53f5d27f8eea",
+    "text": "Regardless of retrieval strategy, the sepure vector database lookup. lected tips are injected into the agent's prompt as a \"guidelines\"\nTwo complementary mechanisms control which tips are selected: section positioned after the task context but before the standard\nagent instructions. Each tip is formatted to be quickly scannable and\n• Similarity threshold 𝜏: Only tips whose source description\nactionable, highlighting priority level, category, actionable content,\nhas cosine similarity ≥𝜏with 𝑑are eligible. A high threshold\npurpose, implementation steps, and trigger condition. For example:\n(e.g., 𝜏≥0.85) ensures retrieved tips are closely related to\nthe current task, but risks excluding tips from tasks that are [PRIORITY: HIGH] Recovery Tip:\nsemantically equivalent yet phrased differently. For example, When a login attempt fails with \"invalid credentials,\"\n\"I want an Amazon Prime membership\" and \"Sign me up verify you are using the correct app-specific\nfor Amazon Prime\" describe the same task but may have password by re-calling\ncosine similarity below 0.85 due to lexical differences. A low supervisor.show_account_passwords() and filtering by\nthreshold (e.g., 𝜏≤0.6) casts a wider net, but risks pulling the target app name.\nin tips from unrelated tasks—tips from \"Book a flight to\nNew York\" are unlikely to help an agent executing \"Update Apply when: Authentication fails on any app after an\nmy calendar for next week,\" yet both involve scheduling- initial login attempt.\nadjacent language that could produce moderate similarity Steps:\nscores. 1. Re-retrieve credentials from supervisor\n• Top-𝑘selection: After filtering by threshold, the system 2. Filter for the specific app name (exact match)\nselects the 𝑘highest-scoring tips. This bounds the number 3. Retry login with the correct credentials\nof tips injected into the prompt regardless of how many pass Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 25,
+    "total_chunks": 44,
+    "char_count": 2003,
+    "word_count": 292,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4ea58e86-dc51-4cb5-8ddc-f0e59db77fdb",
+    "text": "This formatting enables agents to quickly identify the type of determines the task is complete or encounters an unrecoverable\nguidance, prioritize critical tips, and understand both what to do failure. Both the agent and the tip extraction pipeline use GPT-4.1.\nand why. The prompt integration creates a feedback loop: agents The base agent (without memory) receives only the task inreceiving relevant tips avoid failure patterns, execute more effi- struction and standard prompting that includes its role descripciently, and apply successful strategies, producing higher-quality tion, available APIs, and general guidelines for task execution. The\ntrajectories that reinforce the memory system's value. memory-enhanced agent additionally receives retrieved tips from\nthe memory system, injected into the prompt before the agent\n4 Evaluation begins reasoning. We evaluate our trajectory-informed memory generation framework on the AppWorld benchmark, a comprehensive evaluation 4.1.3 Tip Extraction Configurations. We evaluate two tip extraction\nsuite for LLM agents that perform complex tasks across multiple ap- granularities:\nplications. Our evaluation examines two dimensions: (1) the effect Task-level tips are extracted from entire trajectories as deof tip extraction granularity (task-level vs. subtask-level tips), and scribed in Section 3.1.3. Each trajectory produces a holistic set of\n(2) the effect of retrieval strategy (cosine similarity vs. LLM-guided strategy, recovery, and optimization tips that capture end-to-end\nselection). The evaluation demonstrates that agents equipped with execution patterns. Task-level tips are well-suited for capturing\nlearned memory from past executions substantially outperform overarching strategies (e.g., \"verify all prerequisites before checkagents without memory, with particularly strong improvements on out\") but may bundle unrelated concerns from different execution\nchallenging tasks. phases. Subtask-level tips are extracted using the two-phase pipeline\n4.1 Experimental Setup described in Section 3.1.4. Trajectories are first segmented into\nlogical subtasks (authentication, data retrieval, data processing,4.1.1 Benchmark Description. AppWorld is a benchmark designed\netc.), and tips are then extracted independently for each subtask.to evaluate LLM agents on realistic task completion across diverse\nSubtask-level tips are more focused and reusable across tasks thatapplication domains. The benchmark contains tasks spanning eshare common subtasks.commerce, email, calendar, file management, and other common\nBoth tip types were generated from agent executions on theapplication scenarios.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 26,
+    "total_chunks": 44,
+    "char_count": 2651,
+    "word_count": 351,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5b5c2626-7d42-45f6-8f54-ec5e7575a6a1",
+    "text": "Each task consists of a natural language\nAppWorld training and development partitions, processed throughinstruction that the agent must execute by interacting with APIs\nour pipeline.provided for various applications.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 27,
+    "total_chunks": 44,
+    "char_count": 216,
+    "word_count": 28,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7d65fd8e-81b4-497b-bcba-96ce65d2358b",
+    "text": "The benchmark includes two key evaluation metrics:\n4.1.4 Retrieval Strategy Configurations. We evaluate two retrieval Task Goal Completion (TGC) measures the percentage of instrategies for selecting which tips to inject into the agent's promptdividual tasks where the agent passes all programmatic unit tests,\nat runtime:which verify correct API usage, database state changes, and exCosine similarity retrieval embeds the task instruction usingpected end states. Each task is a complex, multi-step, app-based\na text embedding model and retrieves the top-𝑘tips whose vectorchallenge that typically requires multiple API calls across an avembeddings have the highest cosine similarity to the query embed-erage of 1.8 apps and 9.5 APIs. A task is successful only if all unit\nding. This is a standard retrieval approach that requires no LLMtests pass.\ncalls at retrieval time and is fast and inexpensive. Scenario Goal Completion (SGC) measures the percentage\nLLM-guided selection uses an LLM to analyze the task instruc-of scenarios where the agent correctly completes all task variants\ntion, detect the application context and task category, and construct(typically three) associated with a given scenario, testing for consisa retrieval query that combines semantic similarity with metadatatency across related tasks. A scenario is only counted as successful\nfiltering and priority-weighted ranking.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 28,
+    "total_chunks": 44,
+    "char_count": 1397,
+    "word_count": 200,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7dcdedf8-00c9-49f7-aee8-bb922da42d47",
+    "text": "This approach is more ex-if every variant passes, making this a stricter metric than TGC.\npressive—it can reason about which tip categories are most relevant Tasks in AppWorld are categorized by difficulty level:\nand ensure critical tips surface—but requires an additional LLM call\n• Difficulty 1 (Easy): Simple tasks requiring basic API interat retrieval time.\nactions, typically single-domain with straightforward execuFor both strategies, the top 5 tips are retrieved and injected into\ntion sequences\nthe agent's prompt before reasoning begins.\n• Difficulty 2 (Medium): Moderate complexity tasks that\nmay span multiple domains or require conditional logic and\n4.1.5 Evaluation Protocol. We evaluated configurations on three\nerror handling\npartitions of AppWorld: (1) the test-normal partition, which con-\n• Difficulty 3 (Hard): Complex multi-step tasks requiring\ntains held-out tasks not seen during memory generation, measuring\ncareful planning, prerequisite management, cross-domain\nthe agent's ability to generalize learned patterns to novel tasks; (2)\ncoordination, and robust error recovery, often involving 50+\nthe train partition, from which tips were generated, measuring\nlines of equivalent code and up to 26 APIs\nhow effectively tips improve performance when the same task is\n4.1.2 Agent Configuration. We evaluate using a single-agent con- encountered again; and (3) the dev partition, also used during tip\nfiguration implementing a simplified ReAct-style reasoning and generation, providing a complementary view.\naction loop. The agent iteratively reasons about the current task Each task was executed independently with a maximum of 30\nstate, selects actions to take, executes those actions via API calls, reasoning-action steps. Task and scenario goal completion were asand observes the results. The agent continues this loop until it sessed using AppWorld's automated evaluation framework, which Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 29,
+    "total_chunks": 44,
+    "char_count": 1984,
+    "word_count": 285,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b4e6c670-2d4c-4ab1-a3bb-63bd67688a8b",
+    "text": "verifies that all explicit requirements (task goals) and implicit re- configurations to examine the effect of threshold and top-𝑘selecquirements (scenario goals) are satisfied by examining the final tion.\nstate of all involved applications. Table 3: Task-Level Tips + Cosine (𝜏≥0.5, top-3): Test-Normal\n4.2 Held-Out Results (Test-Normal)\nType Task Goal Scenario Goal\nThe test-normal partition contains tasks not seen during memory\nAggregate 66.7 48.2\ngeneration, providing the most rigorous evaluation of the memory\nDifficulty 1 86.0 68.4\nsystem's ability to generalize learned patterns to novel tasks. We\nDifficulty 2 70.8 56.2\npresent results for multiple configurations. Difficulty 3 46.0 23.8\n4.2.1 Subtask-Level Tips with LLM-Guided Selection. Tables 1 and 2\npresent results for subtask-level tips with LLM-guided selection—\nthe best-performing configuration for scenario goal completion. Table 4: Task-Level Tips + Cosine (𝜏≥0.6): Test-Normal Table 1: Subtask Tips + LLM Selection: Test-Normal Type Task Goal Scenario Goal\nAggregate 72.0 62.5\nType Task Goal Scenario Goal Difficulty 1 91.2 84.2\nAggregate 73.2 64.3 Difficulty 2 72.9 68.8\nDifficulty 1 91.2 89.5 Difficulty 3 54.0 38.1\nDifficulty 2 70.8 56.2\nDifficulty 3 58.7 47.6\nTable 5: Task-Level Tips + Cosine (𝜏≥0.5): Test-Normal Type Task Goal Scenario Goal Table 2: Baseline Agent (No Memory): Test-Normal\nAggregate 70.2 57.1\nDifficulty 1 91.2 84.2\nType Task Goal Scenario Goal\nDifficulty 2 64.6 43.8\nAggregate 69.6 50.0\nDifficulty 3 55.6 42.9\nDifficulty 1 89.5 79.0\nDifficulty 2 66.7 56.2\nDifficulty 3 54.0 19.1 The three cosine similarity configurations reveal important interactions between threshold, top-𝑘selection, and task complexity. Top-𝑘restriction hurts performance. The most restrictive\nThe memory-enhanced agent achieves 73.2% TGC compared to\nconfiguration (𝜏≥0.5, top-3) performs below the baseline at the\n69.6% for the baseline (+3.6 pp) and 64.3% SGC compared to 50.0%\naggregate level (66.7% TGC, 48.2% SGC), a drop of −2.9 pp and\n(+14.3 pp). The larger SGC improvement suggests that the memory\n−1.8 pp respectively. The top-3 restriction limits the agent to tips\nsystem not only helps agents complete individual tasks correctly\nfrom only three matched task descriptions, which may exclude\nbut substantially improves consistency across task variants within\nrelevant guidance. This is especially damaging for complex tasks:\nscenarios.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 30,
+    "total_chunks": 44,
+    "char_count": 2412,
+    "word_count": 353,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "382b10d0-0ded-4d22-a72d-74ddac22aeb1",
+    "text": "Since SGC requires all variants to pass, it is sensitive\nDifficulty 3 drops to 46.0% TGC (−8.0 pp from baseline).\nto sporadic failures—exactly the brittleness that learned tips help\nThreshold 𝜏= 0.6 is the sweet spot. The configuration with\nmitigate.\n𝜏≥0.6 (no top-𝑘restriction) achieves the strongest overall results\nThe benefits scale with task complexity. Difficulty 1 tasks show\namong cosine similarity configurations: 72.0% TGC (+2.4 pp) and\nimprovements of +1.7 pp TGC and +10.5 pp SGC, with the baseline\n62.5% SGC (+12.5 pp). This threshold strikes an effective balance:\nalready achieving high TGC. Difficulty 2 tasks show +4.1 pp TGC\ntight enough to exclude tips from unrelated tasks, yet loose enough\nwith no SGC change, benefiting from learned patterns around erto capture semantically equivalent task descriptions that differ\nror handling and prerequisite verification. Difficulty 3 tasks show\nlexically (e.g., \"I want an Amazon Prime membership\" and \"Sign\nthe most dramatic improvements: +4.7 pp on TGC and a remarkme up for Amazon Prime\"). The Difficulty 3 SGC improvement is\nable +28.5 pp on SGC (19.1% →47.6%), a 149% relative increase.\nstriking: 19.1% →38.1% (+19.0 pp), a 99% relative increase. These complex tasks require sophisticated planning and robust\nLower threshold includes noise.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 31,
+    "total_chunks": 44,
+    "char_count": 1305,
+    "word_count": 200,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "25f0ae16-fd9f-408c-a4f6-cbfb0172bd74",
+    "text": "Dropping to 𝜏≥0.5 (no\nerror recovery—areas where the memory system provides the most\ntop-𝑘) yields 70.2% TGC (+0.6 pp) and 57.1% SGC (+7.1 pp)—better\nguidance.\nthan the baseline but weaker than 𝜏≥0.6 on both metrics. The\n4.2.2 Task-Level Tips with Cosine Similarity Retrieval. We next lower threshold admits tips from marginally related tasks, diluting\nevaluate task-level tips with cosine similarity retrieval. Task-level the signal. Interestingly, Difficulty 3 TGC is slightly higher with\ntips extract holistic insights from entire trajectories rather than 𝜏≥0.5 (55.6%) than with 𝜏≥0.6 (54.0%), suggesting that for the\ndecomposing them into subtasks. At retrieval time, the incoming most complex tasks, casting a wider net occasionally surfaces useful\ntask description is embedded and compared against stored task de- tips from loosely related tasks. However, the reverse pattern holds\nscription embeddings; tips from descriptions exceeding a similarity for Difficulty 2 (64.6% vs. 72.9%), where the noise from irrelevant\nthreshold 𝜏are retrieved. We evaluate three retrieval parameter tips is more damaging.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 32,
+    "total_chunks": 44,
+    "char_count": 1111,
+    "word_count": 161,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "787bf578-fb05-4498-9781-1fb86467bacf",
+    "text": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY 4.2.3 Subtask-Level Tips with Cosine Similarity Retrieval. To isolate for this by reasoning about the overall task context and ensuring\nthe effect of the retrieval strategy from the effect of tip granularity, consistent tip selection across variants.\nwe also evaluate subtask-level tips with cosine similarity retrieval All configurations substantially outperform the baseline, con-\n(𝜏≥0.6, no top-𝑘)—the same retrieval parameters as the best task- firming that the memory system provides genuine value regardless\nlevel cosine configuration, but with subtask-level tips instead. of the specific configuration chosen. The best configuration depends\non the deployment objective: subtask-level tips with LLM-guided\nselection for the best overall performance, subtask-level tips with Table 6: Subtask Tips + Cosine (𝜏≥0.6): Test-Normal\ncosine similarity for the highest individual task accuracy at lower\nretrieval cost, or task-level tips with cosine similarity for a strong\nType Task Goal Scenario Goal\nbalance without LLM retrieval overhead. Aggregate 73.8 57.1\nDifficulty 1 91.2 73.7\n4.3 Source Partition Results (Train and Dev) Difficulty 2 72.9 56.2\nDifficulty 3 58.7 42.9 The train and dev partitions were used during tip generation: tips\nwere extracted from agent trajectories on these tasks. Results on\nthese partitions measure a distinct scenario from test-normal: what\nThis configuration achieves 73.8% TGC (+4.2 pp over baseline)— happens when the agent encounters the same or structurally identhe highest TGC of any configuration—and 57.1% SGC (+7.1 pp). tical tasks again, augmented with tips derived from its own prior\nComparing with subtask-level tips with LLM-guided selection (Ta- executions? This setting evaluates the memory system's ability to\nble 1) isolates the effect of the retrieval strategy while holding tip enable self-improvement on recurring tasks, complementing the\ngranularity constant: TGC is slightly higher with cosine retrieval generalization evaluation on test-normal.\n(73.8% vs. 73.2%), but SGC drops substantially (57.1% vs. 64.3%, a Tables 8–11 present results for subtask-level tips with LLM-\n7.2 pp gap). This divergence is most pronounced on Difficulty 3, guided selection on the source partitions.\nwhere SGC drops from 47.6% to 42.9%. The LLM-guided selection's As expected, improvements on the source partitions are larger\nability to reason about task context and prioritize tip categories than on test-normal: +4.4 pp TGC / +10.0 pp SGC on train, and\nappears critical for cross-variant consistency, even though simple +12.3 pp TGC / +26.3 pp SGC on dev. Tips are most contextually\ncosine retrieval suffices (and marginally excels) for individual task relevant when the agent encounters tasks structurally similar to\ncompletion. those from which the tips were derived, so these larger gains are\nexpected.\n4.2.4 Configuration Comparison. Table 7 compares all configuraTwo partition-specific patterns are worth noting. On train Diffi-tions on the held-out test-normal partition, using 𝜏≥0.6 (no top-𝑘)\nculty 1 tasks where the baseline already achieves 100%, the memoryfor both cosine similarity configurations.\nenhanced agent scores slightly lower (94.4% TGC, 83.3% SGC), sugThe three configurations reveal a clear separation between what\ngesting that for simple tasks where the agent already performs\ndrives task goal completion versus scenario goal completion.\noptimally, injecting additional tips can introduce minor interferTip granularity drives TGC. Subtask-level tips outperform\nence. On dev, the Difficulty 3 baseline already achieves 100% TGC\ntask-level tips on TGC regardless of retrieval strategy: 73.8% (cosine)\nand 100% SGC, so the aggregate dev gains (+12.3 pp TGC, +26.3 pp\nand 73.2% (LLM-guided) versus 72.0% (task-level cosine). The finerSGC) are driven entirely by Difficulty 1 and 2 improvements.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 33,
+    "total_chunks": 44,
+    "char_count": 3984,
+    "word_count": 574,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6fcabf84-0f2a-42d0-b357-0a6958bb35f9",
+    "text": "In\ngrained decomposition into reusable subtask patterns provides\nboth cases, the overall gains on the tasks that benefit from memory\nmore targeted guidance for completing individual tasks, particularly\nsubstantially outweigh any ceiling or interference effects.\nfor Difficulty 3 tasks where subtask-level tips yield 58.7% TGC\nversus 54.0% for task-level (+4.7 pp). Retrieval strategy drives SGC. LLM-guided selection dramati- 4.4 Cross-Configuration Summary\ncally improves scenario goal completion compared to cosine simi- Table 12 summarizes the aggregate improvements for subtask-level\nlarity at the same tip granularity: 64.3% versus 57.1% for subtask- tips with LLM-guided selection across all three partitions.\nlevel tips (+7.2 pp). This gap is consistent across difficulty levels, Several observations emerge. First, the memory system improves\nwith Difficulty 1 showing the largest difference (89.5% vs. 73.7%, performance on all three partitions, confirming that the benefits\n+15.8 pp).",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 34,
+    "total_chunks": 44,
+    "char_count": 993,
+    "word_count": 137,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "401e8083-ae89-497f-9041-7f99c60d6055",
+    "text": "The LLM's ability to reason about task context, prioritize are not limited to tasks that generated the tips. The test-normal\ntip categories, and apply metadata filters produces more consis- gains (+3.6 TGC, +14.3 SGC) demonstrate genuine generalization\ntent guidance across task variants within a scenario, reducing the to unseen tasks. Second, the source partitions show larger TGC\nsporadic failures that SGC penalizes. improvements, as expected—tips are most contextually relevant\nInteraction effect. Interestingly, task-level tips with cosine when the agent re-encounters tasks from which the tips were desimilarity achieve higher SGC (62.5%) than subtask-level tips with rived. Interestingly, the test-normal SGC gain (+14.3 pp) exceeds\ncosine similarity (57.1%), despite lower TGC. Task-level tips encode the train SGC gain (+10.0 pp), suggesting that the subtask-level\nholistic end-to-end strategies that promote uniform execution pat- decomposition and LLM-guided retrieval generalize particularly\nterns across related task variants, while subtask-level tips—though well for improving cross-variant consistency. Third, the SGC immore precise for individual task completion—may retrieve different provements consistently exceed the TGC improvements across all\nsubsets of subtask tips for different variants of the same scenario, in- partitions, indicating that the memory system is particularly eftroducing behavioral variance. LLM-guided selection compensates fective at improving consistency across task variants.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 35,
+    "total_chunks": 44,
+    "char_count": 1521,
+    "word_count": 202,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4f2e5e73-f795-47ce-9efe-7fed3bce0287",
+    "text": "Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. Table 7: Configuration Comparison on Test-Normal (Aggregate) Tip Granularity Retrieval Strategy TGC Δ TGC SGC Δ SGC\nBaseline (no memory) 69.6 — 50.0 —\nSubtask-level LLM-guided selection 73.2 +3.6 64.3 +14.3\nSubtask-level Cosine sim. (𝜏≥0.6) 73.8 +4.2 57.1 +7.1\nTask-level Cosine sim. (𝜏≥0.6) 72.0 +2.4 62.5 +12.5 Table 8: Subtask Tips + LLM Selection: Train 5.1 Memory Taxonomies and Architectures\nTwo recent surveys provide comprehensive taxonomies of memory\nType Task Goal Scenario Goal in LLM-based agents. Zhang et al. [17] organize the design space\nAggregate 91.1 83.3 along three dimensions—memory sources (agent-environment inDifficulty 1 94.4 83.3 teractions, internal reasoning, user feedback), memory forms (natuDifficulty 2 88.9 83.3 ral language, embeddings, databases, structured knowledge), and\nDifficulty 3 88.9 83.3 memory operations (read, write, reflect, manage)—and identify key\nlimitations of existing work: overly simplistic representations, unsophisticated operations for deciding what to remember or forget,\nTable 9: Baseline Agent (No Memory): Train and fragmented evaluation. Du et al. [4] take a complementary\noperations-centric view, defining six atomic memory operations:\nType Task Goal Scenario Goal consolidation, updating, indexing, forgetting, retrieval, and comAggregate 86.7 73.3 pression. In their vocabulary, our tip extraction constitutes a form\nDifficulty 1 100.0 100.0 of consolidation (converting raw trajectories into abstract tips), tip\nDifficulty 2 77.8 58.3 refinement is updating, and selective retention is forgetting. Both\nDifficulty 3 77.8 50.0 surveys note that most existing systems store raw or lightly processed text, lacking the structured abstraction and quality-aware\ncuration that effective agent memory requires. Our framework\nTable 10: Subtask Tips + LLM Selection: Dev directly addresses these identified gaps.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 36,
+    "total_chunks": 44,
+    "char_count": 1940,
+    "word_count": 270,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3829f22a-fe01-4b22-a52a-b31e90bcfa40",
+    "text": "Type Task Goal Scenario Goal 5.2 Semantic Memory Systems\nAggregate 89.5 73.7\nThe most widely deployed agent memory systems operate at the se- Difficulty 1 90.0 80.0\nmantic level, storing factual knowledge extracted from interactions. Difficulty 2 87.5 62.5\nMem0 [2] extracts and consolidates factual snippets—user pref- Difficulty 3 100.0 100.0\nerences, entities, relationships—from conversations into a vector\nstore, achieving strong latency and token efficiency for conversaTable 11: Baseline Agent (No Memory): Dev tional personalization. A-MEM [15] introduces a self-organizing\nmemory architecture inspired by the Zettelkasten method, where\neach memory is stored as a structured note with contextual descrip- Type Task Goal Scenario Goal\ntions, keywords, and explicit links to related memories, creating\nAggregate 77.2 47.4\nan emergent knowledge network. While both systems are wellDifficulty 1 80.0 60.0\nengineered for their purposes, they fundamentally store declarative\nDifficulty 2 70.8 25.0\nknowledge (what is known) rather than procedural or experienDifficulty 3 100.0 100.0\ntial knowledge (what to do and what was learned from doing it).",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 37,
+    "total_chunks": 44,
+    "char_count": 1148,
+    "word_count": 164,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "01a6528d-dea4-436d-8a5f-4120cff81fa1",
+    "text": "They have no mechanism for analyzing execution trajectories, performing causal attribution of failures, or generating categorized\ntips and strategy tips encode prerequisite verification and error behavioral guidance. Our framework addresses this gap by extracthandling patterns that reduce behavioral variance, enabling the ing structured, actionable tips from execution experience rather\nagent to reliably complete all variants rather than succeeding on than conversational facts.\nsome and failing on others.\n5.3 Learning from Execution Trajectories\n5 Related Work A growing body of work addresses how agents can learn from\nOur work sits at the intersection of agent memory systems, trajectory- their past execution traces, which is most directly related to our\nbased learning, and self-improving agents. We organize related contribution.\nwork along three axes: memory architectures for LLM agents, sys- Workflow and procedure extraction. Agent Workflow Memtems that learn from execution trajectories, and approaches to agent ory (AWM) [13] extracts reusable multi-step workflows from sucself-improvement through experience. cessful agent trajectories in web navigation, achieving 24.6% and Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY Table 12: Summary of Aggregate Improvements: Subtask Tips + LLM Selection Partition Task Goal Task Goal Scenario Goal Scenario Goal\n(Baseline) (+Memory) (Baseline) (+Memory)\nTest-Normal 69.6 73.2 (+3.6) 50.0 64.3 (+14.3)\nTrain 86.7 91.1 (+4.4) 73.3 83.3 (+10.0)\nDev 77.2 89.5 (+12.3) 47.4 73.7 (+26.3) 51.1% relative improvements on Mind2Web and WebArena respec- selective deletion yields a 10% absolute performance gain over naive\ntively. AWM demonstrates a compelling \"snowball effect\" where memory growth. These findings directly motivate our structured\nsimple workflows compose into more complex ones.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 38,
+    "total_chunks": 44,
+    "char_count": 1927,
+    "word_count": 264,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "306e733d-21c4-4cca-a21d-291e77bf7bbf",
+    "text": "However, approach: by extracting abstract tips with explicit applicability conAWM only learns from successful trajectories—it has no mecha- ditions rather than storing raw trajectories, and by categorizing\nnism for extracting lessons from failures, recoveries, or inefficient tips with metadata for precise contextual matching, our framework\nexecutions. Mem𝑝[5] treats procedural memory as a first-class mitigates both failure modes.\noptimization object, systematically exploring strategies for building memory from trajectories, retrieving relevant procedures, and 6 Conclusions\nupdating entries over time. While Mem𝑝addresses the full mem- We presented a framework for automatically extracting actionable\nory lifecycle, it focuses on procedural instructions (\"how to do X\") learnings from LLM-agent execution trajectories and storing them\nrather than the diagnostic behavioral insights (\"what went wrong as structured memory tips that improve future agent performance.\nand why\") that our tip categories capture. AgentRR [6] borrows Our four-component pipeline—trajectory intelligence extraction,\nthe record-and-replay paradigm from software engineering, record- decision attribution analysis, contextual learning generation, and\ning complete agent interaction traces and summarizing them into adaptive memory retrieval—captures the full spectrum of learning\nstructured experiences for future replay. Like AWM, it primarily opportunities across failures, recoveries, inefficient successes, and\nlearns from successful executions. clean successes. Evaluation on the AppWorld benchmark demonReasoning and strategy extraction. ReasoningBank [9] is strates consistent improvements, with up to 14.3 percentage point\namong the closest works to ours, distilling generalizable reasoning gains in scenario goal completion on held-out tasks, and particularly\nstrategies from an agent's self-judged successful and failed expe- strong benefits on complex, multi-step tasks (28.5 pp SGC improveriences. It shares our insight that agents should learn from both ment, a 149% relative increase). The framework naturally extends\nsuccesses and failures.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 39,
+    "total_chunks": 44,
+    "char_count": 2135,
+    "word_count": 276,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "fbfb0611-da63-47b0-b555-44c9ff71f25a",
+    "text": "The key distinction is in abstraction level: to multi-agent systems with cross-agent attribution and agent-roleReasoningBank focuses on meta-cognitive reasoning strategies, aware guidance, which we leave to future work. We also plan to\nwhile our tips focus on concrete behavioral guidance derived from evaluate the framework with additional state-of-the-art and openspecific execution patterns. The two approaches are complemen- source models—such as Qwen [12] and GPT-OSS [1]—to assess how\ntary. tip quality and retrieval effectiveness vary across model families. Context engineering and self-improvement. ACE (Agentic The techniques described in this paper are being applied to IBM's\nContext Engineering) [16] treats an agent's context as an evolv- Configurable Generalist Agent (CUGA) [7, 8] platform for building \"playbook\" that accumulates and refines strategies through ing and deploying enterprise agentic systems, where trajectorya generate-reflect-curate cycle, achieving a 10.6 percentage point informed memory enables agents to continuously improve from\nimprovement on AppWorld. Our framework differs from ACE in operational experience.\nseveral respects: we produce structured memory entries with typed\ncategories (strategy, recovery, optimization), rich metadata, and References\nselective retrieval rather than an evolving text document included [1] . 2025.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 40,
+    "total_chunks": 44,
+    "char_count": 1369,
+    "word_count": 183,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "ee0eabd2-682f-441b-a36c-4c30913efdac",
+    "text": "TODO: Add GPT-OSS reference. Placeholder — please replace with the\nin full; we perform explicit causal attribution tracing outcomes to correct GPT-OSS citation.\nspecific decisions; and we maintain provenance tracking from tips [2] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term\nto source trajectories. Memory. arXiv preprint arXiv:2504.19413 (2025). Experience replay with learned retrieval.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 41,
+    "total_chunks": 44,
+    "char_count": 494,
+    "word_count": 66,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "585e6976-1fab-404b-8172-dccc170b0073",
+    "text": "Memento [18] in- [3] Chad DeChant. 2025. Episodic Memory in AI Agents Poses Risks That Should Be\nStudied and Mitigated. arXiv preprint arXiv:2501.11739 (2025).\ntroduces a memory-augmented MDP formalization where a learned [4] Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sébastien Montella,\nneural policy selects which stored trajectories to retrieve for a given Mirella Lapata, Kam-Fai Wong, and Jeff Z. Rethinking Memory\ntask. However, Memento stores raw trajectories without abstracting in AI: Taxonomy, Operations, Topics, and Future Directions. arXiv preprint\nthem into transferable insights—the consolidation from trajectory [5] Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun\nto actionable lesson is left to the LLM's in-context reasoning. Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. 2025. Mem𝑝: Exploring Agent\nProcedural Memory. arXiv preprint arXiv:2508.06433 (2025).\n[6] Erhu Feng, Wenbo Zhou, Zibin Liu, Le Chen, Yunpeng Dong, Cheng Zhang,\n5.4 Empirical Foundations Yisheng Zhao, Dong Du, Zhichao Hua, Yubin Xia, and Haibo Chen. 2025. Get\nExperience from Practice: LLM Agents with Record & Replay. arXiv preprint\nXiong et al. [14] provide critical empirical grounding for trajectory- arXiv:2505.17716 (2025).\nbased memory systems, identifying the experience-following prop- [7] IBM. 2025. CUGA: Configurable Generalist Agent. https://github.com/cugaproject/cuga-agent.\nerty and two failure modes: error propagation and misaligned ex- [8] Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad\nperience replay. They find that combining selective addition with Sela, Asaf Adi, and Nir Mashkif. 2025. Towards Enterprise-Ready Computer Using",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 42,
+    "total_chunks": 44,
+    "char_count": 1709,
+    "word_count": 239,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "53d3c5c1-ee4c-432c-80b6-26e444c6139a",
+    "text": "Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. Generalist Agent. arXiv preprint arXiv:2503.01861 (2025). Agents: An Empirical Study of Experience-Following Behavior. arXiv preprint\n[9] Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun arXiv:2505.16067 (2025). Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George [15] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv preprint arXiv:2502.12110\n2025. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. (2025).\narXiv preprint arXiv:2509.25140 (2025). [16] Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong,\n[10] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish\nStoica, and Joseph E.",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 43,
+    "total_chunks": 44,
+    "char_count": 980,
+    "word_count": 138,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "cee74117-fc84-4456-b652-b5a6974c225e",
+    "text": "MemGPT: Towards LLMs as Operating Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context EngineerSystems. arXiv preprint arXiv:2310.08560 (2023). ing: Evolving Contexts for Self-Improving Language Models. arXiv preprint\n[11] Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, arXiv:2510.04618 (2025).\nand Mariya Toneva. 2025. Position: Episodic Memory is the Missing Piece for [17] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu,\nLong-Term LLM Agents. arXiv preprint arXiv:2502.06975 (2025). Zhenhua Dong, and Ji-Rong Wen. 2025. A Survey on the Memory Mechanism of\n[12] Qwen Team. 2025. Qwen2.5 Technical Report. arXiv preprint arXiv:2412.15115 Large Language Model based Agents. ACM Transactions on Information Systems\n(2025). (TOIS) (2025). doi:10.1145/3748302 arXiv:2404.13501.\n[13] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024. Agent [18] Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang,\nWorkflow Memory. arXiv preprint arXiv:2409.07429 (2024). Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. 2025. Me-\n[14] Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu mento: Fine-tuning LLM Agents without Fine-tuning LLMs. arXiv preprint\nLakkaraju, and Zhen Xiang. 2025. How Memory Management Impacts LLM arXiv:2508.16153 (2025).",
+    "paper_id": "2603.10600",
+    "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems",
+    "authors": [
+      "Gaodan Fang",
+      "Vatche Isahagian",
+      "K. R. Jayaram",
+      "Ritesh Kumar",
+      "Vinod Muthusamy",
+      "Punleuk Oum",
+      "Gegi Thomas"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10600v1",
+    "chunk_index": 44,
+    "total_chunks": 44,
+    "char_count": 1365,
+    "word_count": 196,
+    "chunking_strategy": "semantic"
+  }
+]
\ No newline at end of file