diff --git a/data/chunks/2603.10528_semantic.json b/data/chunks/2603.10528_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..c15194a0d5b6911610b124287cb919bd236cd9c3 --- /dev/null +++ b/data/chunks/2603.10528_semantic.json @@ -0,0 +1,326 @@ +[ + { + "chunk_id": "30bd0843-e564-47a1-bc8a-7d39f99c9119", + "text": "UAV-MARL: Multi-Agent Reinforcement Learning\nfor Time-Critical and Dynamic Medical Supply\nDelivery Islam Guven, Mehmet Parlak ICTEAM, Universit´e catholique de Louvain\nOttignies-Louvain-la-Neuve, Belgium\nislam.guven@uclouvain.be", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 0, + "total_chunks": 18, + "char_count": 228, + "word_count": 23, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f381634b-cfeb-494d-939b-054fc8270e28", + "text": "Abstract—Unmanned aerial vehicles (UAVs) are increasingly natural disasters, or supply chain disruptions, when ground2026 used to support time-critical medical supply delivery, providing transportation infrastructure may be compromised and rapid\nrapid and flexible logistics during emergencies and resource response is essential for patient outcomes. Although UAVs\nshortages. However, effective deployment of UAV fleets requires\nprovide rapid and flexible delivery capabilities independent of coordination mechanisms capable of prioritizing medical re-Mar quests, allocating limited aerial resources, and adapting de- road networks, efficient coordination of multiple UAVs under\n11 liverypaper schedulespresents a undermulti-agentuncertainreinforcementoperationallearningconditions.(MARL)This dynamicproblem, particularlyoperational inconstraintstime-criticalremainshealthcarean openlogistics.research\nframework for coordinating UAV fleets in stochastic medical Reliable medical supply chains are essential for maintaining\ndelivery scenarios where requests vary in urgency, location, and\neffective healthcare services, particularly where rapid and delivery deadlines. The problem is formulated as a partially\nobservable Markov decision process (POMDP) in which UAV flexible delivery of critical resources is required. Intelligent\nagents maintain awareness of medical delivery demands while decision-support systems are therefore needed to assist medhaving limited visibility of other agents due to communication ical practitioners in allocating limited resources and coordiand localization constraints. The proposed framework employs[cs.LG] nating logistics operations efficiently [2], [3]. Beyond routing\nProximal Policy Optimization (PPO) as the primary learning\noptimization, healthcare logistics requires integrated frame- algorithm and evaluates several variants, including asynchronous\nextensions, classical actor–critic methods, and architectural mod- works that support personnel allocation, automated sensing\nifications to analyze scalability and performance trade-offs. The for inventory management, and adaptive supply chain control\nmodel is evaluated using real-world geographic data from selected capable of responding to changing patient needs and clinical\nclinics and hospitals extracted from the OpenStreetMap dataset. urgency. While prior work in drone logistics has explored\nThe framework provides a decision-support layer that prioritizes\nlast-mile planning, facility siting, and fleet coordination [2]– medical tasks, reallocates UAV resources in real time, and assists\nhealthcare personnel in managing urgent logistics. Experimental [5], a key research gap remains in developing learning-based\nresults show that classical PPO achieves superior coordination systems that jointly address clinical priority, strict delivery\nperformance compared to asynchronous and sequential learning deadlines, payload constraints, and stochastic task arrivals\nstrategies, highlighting the potential of reinforcement learning under limited communication and information availability.\nfor adaptive and scalable UAV-assisted healthcare logistics. Traditional optimization methods such as mixed-integer pro- Index Terms—Multi-agent reinforcement learning (MARL),\nUAV coordination, swarm, autonomous drone delivery, medical gramming, metaheuristics, and genetic algorithms are effective\nsupply delivery, healthcare logistics, dynamic task allocation, for static UAV routing but often fail to adapt efficiently to\nproximal policy optimization, reward shaping, time-critical de- dynamic medical supply requests with heterogeneous urgencyarXiv:2603.10528v1 livery, stochastic logistics, drone delivery systems. levels. Each new task typically requires costly re-optimization,\nlimiting their scalability for real-time healthcare logistics [2],\nI. Prior multi-UAV systems, including our own previous\nUnmanned aerial vehicles (UAVs) are increasingly utilized work [6], leverage evolutionary approaches effectively in\nin autonomous navigation, mission-critical data collection, and fixed-task settings but suffer from computational inefficiency\nreal-time environmental monitoring applications such as pre- when applied to highly dynamic, time-sensitive environments.\ncision agriculture [1]. Beyond sensing and monitoring tasks, Recent advances in UAV routing and multi-agent reinUAVs are also emerging as a promising solution for time- forcement learning (MARL) demonstrate strong potential for\ncritical logistics, particularly in the distribution of medical scalable, adaptive decision-making. Wang et al. [7] introduced\nsupplies from central depots to healthcare facilities.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 1, + "total_chunks": 18, + "char_count": 4668, + "word_count": 559, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b7e31eab-38ad-43b1-b4e5-c3f74449900a", + "text": "Such op- the C-SPPO framework, a centralized reinforcement learning\nerations require the coordination of multiple delivery vehicles model that minimizes flight conflicts and delivery delays in\nunder strict time constraints and payload limitations. The chal- large-scale logistics routing. Cui et al. [8] proposed a GCNlenge becomes even more critical during epidemic outbreaks, based policy network that improves multi-UAV task allocation TABLE I: Summary of Environment Parametersefficiency under distance constraints, while Qiu et al. [9]\ndeveloped a distributed cooperative UAV search and rescue Symbol Description Value\nframework robust to communication limitations. Gabler and\nN Number of UAVs 5–20\nWollherr [10] emphasized decentralized actor–critic structures G Grid dimensions 30 × 30\nto enhance scalability and real-world deployability, and Kong c Cell size (m) 400 m\nv UAV speed (m/s) 50 m/s\nand Sousa [11] demonstrated how UAVs can simultaneously Pmax Max payload 5 units\nperform package delivery and wireless coverage through deep Rcomm Comm. range 400 m\nTmax Max episode steps 200\nQ-learning-based trajectory control. Recent surveys [5] have λ Task arrival rate 0.1–0.3\nhighlighted ongoing challenges in integrating autonomy, coor- Kmax Max active tasks 10\np(t) UAV payload level 0–Pmax\ndination, and real-world uncertainty. xi(t) UAV position Grid cell\nMARL offers significant potential for adaptive decision- τ Delivery task —\nu Urgency class {crit., urg., std.}\nmaking in such contexts, yet designing a general framework dcrit Critical deadline 10 steps\nfor heterogeneous tasks that are dynamically assigned has not durg Urgent deadline 20 steps\ndstd Standard deadline 50 steps\nbeen investigated yet. This paper addresses the gap between H Number of hospitals 4\nexisting MARL approaches and practical medical supply ap- D Number of depots 2\nI0 Initial inventory 10 units\nplications by presenting a unified learning-based framework Td Pickup/delivery time 5 s\nfor adaptive healthcare logistics optimization. We introduce a ρ Consumption rate 0.1\nmulti-agent reinforcement learning model that operates under\nstochastic demand, partial fleet observability, and strict delivwhere each cell corresponds to a vertex v ∈V , and each UAV\nery deadlines for real-world medical UAV operations.\ncan move to one of its four neighboring cells at each time step. To examine the trade-off between throughput and coordiThe main components of the system are:\nnation performance, we also evaluated two distributed ac-\n• Depots D ⊆V : nodes where UAVs collect supplies andtor–learner architectures—Asynchronous PPO (APPO) [12]\nrefuel.and IMPALA [13] and a classical actor-critic method (A2C)\n• Clinics H ⊆V : nodes where delivery requests originate.[14]. These frameworks are designed for large-scale environ-\n• UAV fleet U = {1, 2, . . . , N}: each UAV has a maximumments where many agents collect experience in parallel.\npayload capacity Pmax and a discrete position xi(t) ∈V . The contributions of this paper are as follows.\n• Delivery tasks T (t): dynamically appearing requests re-\n• A partially observable MDP formulation for multi-UAV\nquiring pickup from a depot and delivery to a clinic\nmedical delivery with full task visibility but partial fleet\nbefore a deadline.\nposition awareness, modeling depot resupply, stochastic\ntask arrivals, and clinical urgency. Case Study: Brussels Operational Region\n• A reward shaping framework with proximity guidance, Figure 1 illustrates the case study region centered on the\ndistance reduction bonuses, and urgency-based weighting Brussels Capital Area. The area is modeled as a 12 km ×\nthat accelerates learning with minimal computational 12 km grid, divided into 30 × 30 cells, each representing a\noverload.\n• Experimental analysis with various MARL methods for\nobserving the effect of network architecture, policyupdate mechanism, and data collection in dynamic delivery missions.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 2, + "total_chunks": 18, + "char_count": 3925, + "word_count": 589, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "feb74d68-7e7f-47d4-94bf-d8cb0b354e17", + "text": "The paper is organized as follows. Section II presents the\nproblem formulation. Section III details the MARL framework\nwith observation design and reward structure. Section IV\npresents experimental results. This study models the coordination of multiple UAVs for\nreal-time delivery of medical supplies in an urban environment. The system includes the operational characteristics of\nhealthcare logistics: Dynamic demand, time-critical deliveries,\nand limited UAV resources. The framework provides the mathematical model for the reinforcement learning formulation\npresented in Section III.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 3, + "total_chunks": 18, + "char_count": 587, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f4c8a263-4449-4057-9157-d805fd6c98da", + "text": "Environment Representation Fig. 1: Operational region centered around the Brussels Capital\nRegion. The grid defines UAV navigation cells with depots Table I summarizes the environment parameters. The en-\n(green) and clinics (red).vironment is represented as a grid-based graph G = (V, E), Green markers indicate depots (hospitals each delivery or pick-up is accounted for extra flight time Td\nwith storage and refilling capacity), while red markers repre- due to altitude changes.\nsent clinics that request supplies.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 4, + "total_chunks": 18, + "char_count": 516, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "677d6d44-8b7b-47f7-aedb-8b6c673fdcd3", + "text": "Communications assume a disc model. When two UAVs are\nwithin a distance Rcomm, they can communicate and shareC. Medical Dynamic Task Model\ntheir information to each other. For battery considerations,\nEach delivery request is represented as a medical task τ a flight time limit is given from take-off until landing. For\ndefined by a pickup location psource, a target hospital ptarget, an reference, we assumed the batteries have a capacity of 0.5kWh\nurgency class u, a creation time tcreated, and a deadline tdeadline: and each movement and delivery costs 0.8Wh energy for our\ntest scenario. A more detailed analysis on battery and altitude τ = (psource, ptarget, u, tcreated, tdeadline).\nconsiderations will be part of future work. New tasks arrive stochastically with probability λ at\neach time step, reflecting the irregular and unpredictable E.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 5, + "total_chunks": 18, + "char_count": 847, + "word_count": 134, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b096f96c-0b89-42c0-8a37-076564267c6c", + "text": "Objective Function\nnature of clinical demand. The urgency level u ∈\n{critical, urgent, standard} determines the feasable delivery The system aims to maximize the overall delivery perforwindow. The deadline is assigned as mance across all UAVs and tasks.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 6, + "total_chunks": 18, + "char_count": 253, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ad35504-f622-4fee-acef-d79c1f97731e", + "text": "The total objective balances\nthree key components:\ntdeadline = tcreated + ∆(u), where ∆(u) is the allowed time for the task and ∆crit < ∆urg < J = Rdeliveries + Rurgency −Cdelays −Cinefficiency.\n∆std. Critical tasks correspond to life-saving items such as\nblood units or emergency medication, whereas standard tasks Here:\nrepresent routine consumables with more flexible timing. • Rdeliveries: rewards for successful and timely deliveries. Hospitals maintain initial inventories I0 that decrease at • Rurgency: additional bonuses for critical and urgent tasks.\neach step according to a consumption rate ρ, which models • Cdelays: penalties for late or expired deliveries.\ncontinuous clinical use of supplies: • Cinefficiency: penalties for unnecessary movement or idling. Ih(t + 1) = max 0, Ih(t) −ρ 1 + |Ph(t)|10 , This formulation encourages UAVs to complete highpriority deliveries first while maintaining efficiency in motion\nwhere Ph(t) denotes the set of patients waiting at hospital and resource usage.\nh. As inventories decline, hospitals generate new tasks with At each time step, the system must decide how UAVs\nurgency linked to patient conditions and required treatment should move and which task to pick up. These sequential\ntypes. Patients enter a waiting list upon arrival, each inheriting and uncertain decisions form a dynamic control problem.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 7, + "total_chunks": 18, + "char_count": 1360, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b44d980d-1d47-4954-8ec6-391d23bef69f", + "text": "To\na personal deadline du. If treatment cannot begin before this solve this, the UAV coordination process is represented as\ndeadline, the patient is considered deceased, and a mortality a Markov decision process (MDP), where the state captures\npenalty is applied to the learning agent. UAV positions, payload levels, and current task information. Each task requires UAVs to pick up a supply package Section III reformulates this system model within the reat a depot and deliver it to the designated hospital. Depots inforcement learning framework, defining the corresponding\nbroadcast available tasks to all UAVs, agents are tasked with states, actions, transition dynamics, and reward structure used\nchoosing actions based on urgency, distance, and remaining for multi-agent training.\ntime to deadline. UAV Operations and Constraints III. REINFORCEMENT LEARNING FORMULATION\nEach UAV follows a periodic process: A. Markov Decision Process Specification\n1) Travel to a depot and collect available supplies. The delivery task is formulated as a partially observable\n2) Pick up an assigned delivery task if available. Markov decision process:\n3) Transport the package to the corresponding clinic.\n4) Refill payload at a depot if capacity is low.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 8, + "total_chunks": 18, + "char_count": 1242, + "word_count": 187, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "325dd8aa-65fa-41b2-8846-727ed55f5a50", + "text": "M = ⟨N, S, {Ai}, P, R, {Ωi}, O, γ⟩,\nThe UAV's state at time t includes its grid position xi(t),\nremaining payload pi(t), and any assigned task τi(t). Move- where:\nments are limited to one adjacent cell per step.\n• N is the set of UAVs,\nPayload evolves according to:\n• S is the global state (all UAV positions, all task states,\nPmax, if refilling at depot, time),\n pi(t + 1) = pi(t) −1, if picking up a task, • Ai = {up, down, left, right, stay} is agent i's action space,\n• R : S × AN →R is the reward function,\npi(t), otherwise. • Ωi is agent i's observation space,\nA UAV can carry at most one active task at a time and must • O : S × i →Ωi is the observation function,\ndeliver it before its deadline. UAVs fly on constant altitude and • γ = 0.99 is the discount factor. TABLE II: Reward structure for UAV medical delivery tasks.B. Each UAV agent receives a compact observation vector at Category Description Value\neach time step. This vector provides the information required Clinical outcomes (sparse rewards)\nto make navigation and task allocation decisions while coor- Delivery completion Successful delivery +50.0\ndinating with other UAVs in the environment. Critical delivery Critical task completed +20.0\nThe observation includes six main components: Urgentbonus delivery bonus Urgent task completed +10.0\n• Positional information: Normalized positions of UAVs Early delivery bonus Remaining time before deadline +5.0 × ratio\nDeadline violation Late or missed delivery –15.0\nbased on last communication time. Task discovery and progress (dense rewards)\n• Own state: Information about UAV's internal status,\nincluding its current payload level (normalized between Task proximity Near pending task +0.2 × proximity\nPickup success Task picked up at depot +5.0\n0 and 1) and whether it is carrying a delivery task. Urgent task claim Claim of urgent/critical task +3.0\nDistance reduction Moved closer to target +0.3 × distance gain • Closest pending task: Information about the nearest unasProgress movement Step toward assigned target +0.5\nsigned delivery request, such as the relative position of the\nResource management and penalties\npickup depot and destination hospital, the task's urgency\nlevel, and the remaining time before its deadline. This RefillDepot atvisitdepot(low) VisitRefill depotwhen whenpayloadhalf-emptyis low +1.0+2.0\nhelps the UAV decide which pending task to prioritize Movement cost Per movement step –0.001\nbased on distance and urgency. IdleMortalitypenaltypenalty IdleExpiredawaycriticalfrom depottask –0.01–20.0\n• Current carried task: Details about the task currently\nbeing delivered, including the relative position of the\ndelivery target and the time remaining until the deadline. Action Space\nIf the UAV is not carrying any task, this part of the\nEach UAV executes one of five discrete actions per timestep: observation remains zero.\n• Closest depot: The relative position of the nearest depot, Ai = {up, down, left, right, stay}.\nallowing the UAV to plan refilling or resupply actions\nwhen its payload is low. Actions move the UAV one grid cell or keep it stationary.\n• Closest hospital: The relative position of the nearest The environment handles task claiming, pickup, delivery, and\nhospital or clinic, which supports decisions related to refilling automatically when position and state conditions\nfuture deliveries or route planning. are satisfied. We used a discrete action space instead of a\n• Global context: General information about the environ- continuous model in order to focus on the long-term planning\nment, including the total number of active tasks, the aspect of our model.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 9, + "total_chunks": 18, + "char_count": 3631, + "word_count": 578, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "df505482-decc-45b0-bd6a-69d00b775e9c", + "text": "Furthermore, with cell movements being\nproportion of pending tasks, and the normalized simula- synchronized, UAVs only exchange information when all\ntion time. This provides situational awareness and helps UAVs reach their waypoints, which decreases communication\nbalance workload distribution across the UAV fleet. complexity and allows for synchronized updates. Together, these components allow each agent to maintain\nawareness of its own operational status, nearby opportunities E. Training Algorithms\nfor delivery, and the overall mission context. We implemented a family of policy gradient algorithms\nC. Reward Shaping for Medical Delivery using Ray RLlib [15] to study both architectural and systemslevel design choices. The synchronous on-policy baseline is\nThe learning process combines sparse clinical rewards with\nProximal Policy Optimization (PPO), instantiated with a threedense shaping rewards to guide UAV agents toward efficient\nlayer multilayer perceptron (MLP) policy. Two architectural\nand meaningful behaviors. Table II summarizes the reward\nvariants of PPO were considered:\ncomponents used during training. This reward design encour-\n• PPO Large FCNet: a deeper fully connected networkages UAV agents to prioritize critical medical deliveries while\nwith hidden layers of size [512, 512, 256] sharing param-maintaining efficient resource use and coordinated movement.\neters between actor and critic. This variant tests whetherThe largest rewards are given upon successful and timely\nadditional capacity improves coordination under the samedeliveries. Urgency-based bonuses are also given to prioritize\non-policy update rule.critical and urgent tasks. Dense shaping rewards provide continuous feedback even • PPO LSTM: a recurrent policy based on an LSTMwhen deliveries are not yet completed. They help agents learn containing RLModule with stacked dense layers and a\nuseful intermediate behaviors such as moving toward pend- 256-unit LSTM cell. This configuration models tempoing tasks, reducing distance to delivery targets, and refilling ral dependencies such as task queues and approaching\nsupplies before depletion.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 10, + "total_chunks": 18, + "char_count": 2138, + "word_count": 295, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d0901646-f467-405d-bf67-46b75756a772", + "text": "Small penalties for unnecessary deadlines.\nmovement and idling discourage inefficient actions. Finally, To provide a classical on-policy baseline, we also include\nthe mortality penalty models the severe consequence of miss- Advantage Actor–Critic (A2C), representing a low-complexity\ning critical deliveries, reinforcing the importance of meeting alternative to PPO. Finally, we evaluate two distributed acmedical deadlines within the learning process. tor–learner architectures: • APPO: Asynchronous PPO with V-trace corrections and\ncentralized learners. Experience is collected in parallel\nfrom multiple workers and consumed off-policy.\n• IMPALA: an importance-weighted actor–learner architecture optimized for high-throughput sampling with Vtrace targets. All methods share the same observation and action spaces,\nreward structure (Table II), and discount factor γ = 0.99. Hyperparameters for each algorithm (batch sizes, entropy\nregularization, clipping coefficients, and gradient clipping) use\nRLlib defaults. Multi-agent training uses a centralized policy\nmapping in which each UAV runs its own copy of the policy\nnetwork which is evaluated on a decentralized setting.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 11, + "total_chunks": 18, + "char_count": 1174, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f788a940-69f3-4899-86e1-4c76aa1b5c00", + "text": "RESULTS AND DISCUSSION\nFig. 2: Training Times for each algorithm\nA. The experimental evaluation was conducted using a 30 by\nC. Learning Performance\n30 grid representing a 12 km by 12 km urban area centered on\nBrussels, with each cell spanning 400 meters. The infrastruc- We first examine learning dynamics for a fixed fleet size.\nture consisted of 2 supply depots and 6 clinic locations. UAV Fig. 3 shows the evolution of episode returns for 10 UAVs\nparameters include maximum payload of 5 units, movement over 2,000,000 training steps. PPO exhibits clear convergence,\nspeed of 50 meters per second, and communication range improving from an initial average return of approximately\n−600 to around −200 as training progresses. In contrast,of 400 meters. Medical supply requests arrived stochastically\nwith rate 0.2 per timestep, categorized into three urgency APPO and IMPALA remain close to their initial performance\nlevels: critical (15%, ∆crit = 5 steps), urgent (35%, ∆urg = 10 and fail to achieve meaningful learning progress in this\nsteps), and standard (50%, ∆std = 20 steps). This behavior indicates that in our setting, simple\n200 steps of movement until they have to return to their actor–critic updates (A2C) and off-policy V-trace corrections\ninitial positions.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 12, + "total_chunks": 18, + "char_count": 1272, + "word_count": 201, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "02b822e1-7045-4337-bf6a-78affb0626cd", + "text": "Each mission ends when at least 15 tasks (APPO, IMPALA) are insufficient to stabilize learning under\nare completed and all tasks that have been assigned must the combined challenges of strict delivery deadlines, stochastic\nbe completed. Since the arrival of tasks are nondeterministic, task arrivals, and cooperative multi-agent assignments.\nsome missions can have significantly more tasks and more The clipped policy updates in PPO, together with carefully\nurgency, which represents the real-life environment in terms shaped rewards that penalize missed deadlines and inefficient\nof a health crisis. routing, enable more stable policy improvement and encourage\nexploration of coordinated delivery strategies. As training We use Ray RLlib [15] with four algorithms (A2C, PPO,\nprogresses, the learned policies gradually reduce missionAPPO, IMPALA) using 3-layer MLP architecture (512-256-\ncompletion time while increasing the fraction of successfully128 units), Adam optimizer with learning rate 0.0003, and\ndelivered medical supplies. This behavior suggests that PPOdiscount factor 0.99. Each configuration trained for 2,000,000\neffectively balances exploration and exploitation in environ-steps using 8 parallel workers with 2 environments each on\nour 32 core Intel Xeon Gold 6444Y CPU. Evaluation consists\nof 1000 episodes per configuration with action selection across\nfleet sizes of 4, 8, 12, and 16 UAVs.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 13, + "total_chunks": 18, + "char_count": 1409, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1e7e0a12-e500-468d-a27a-1464bfb22900", + "text": "Computational Analysis Fig. 2 shows training times in seconds for each algorithm. Asynchronous models are trained around 900 seconds regardless of number of agents whereas classical models have\nincreasing training times from 350 to 1200 seconds. Evaluation times for a single episode range from 0.5 to\n1.2 seconds depending on episode length for all algorithms. LSTM-based PPO models take 3 seconds per episode due to\nmore complex model structure. Fast training times allow the existing model to be quickly\ntrained on new scenarios whereas fast evaluation requires\nminimal computational capacity that can be used on most UAV\nFig. 3: Episode return mean.processors. Fig. 4: Performance comparison across different configurations of PPO for varying fleet sizes. (a) Average mission time across\nUAV counts. (b) Success rate for different RL algorithms. ments where agents must jointly allocate limited resources Experimental results demonstrate that synchronous onunder time constraints. policy learning using PPO consistently achieves reliable coFrom a systems perspective, these results highlight the ordination across different fleet sizes. In particular, PPO\nimportance of stable on-policy learning mechanisms for co- converges from an initial average return of approximately\nordinating UAV fleets in time-critical healthcare logistics en- −600 to around −200 during training and achieves a 100%\nvironments. The findings further indicate that policy stability task completion rate during evaluation.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 14, + "total_chunks": 18, + "char_count": 1500, + "word_count": 215, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "68f956eb-649e-497f-b711-cd8a98eaf0b7", + "text": "Mission duration deplays a crucial role in multi-agent scheduling problems where creases significantly as fleet size increases, dropping from\ndelivery deadlines, spatial constraints, and agent cooperation approximately 1400 s with smaller fleets to about 800 s with\nmust be handled simultaneously. Consequently, the proposed larger fleets due to improved workload distribution among\nMARL framework provides a promising approach for enabling agents. In contrast, asynchronous approaches such as APPO\nreliable and scalable UAV-assisted medical logistics in dy- and IMPALA fail to achieve meaningful convergence in this\nnamic operational settings. environment, highlighting the importance of stable on-policy\nupdates and reward shaping for cooperative UAV coordinationD. Mission Performance\nunder strict deadlines. Figure 4 shows how algorithms scale with fleet size. PPO Computational analysis further shows that training can be\nsuccess rate is 100% for all fleet sizes with mission time completed within practical time scales, with asynchronous\ndecreasing from 1400 s to 800 s. We observe that effective models requiring roughly 900 s and classical models ranging\nworkload distribution and coordination benefit from additional between 350 s and 1200 s depending on fleet size. Evaluation\nagents, as reflected in the decline of average mission times. time remains low (0.5–1.2 s per episode), indicating that the\nFurthermore, we can observe that 15 UAVs provide nearly learned policies can be executed on resource-constrained UAV\nidentical results to 20 UAVs.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 15, + "total_chunks": 18, + "char_count": 1557, + "word_count": 224, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e98d27c5-e5c6-4e17-81c4-f3e1a314b84e", + "text": "Large architecture variant of platforms in real time. These findings demonstrate that PPOPPO (PPO Large FCNet) follows the similar trend and closely based coordination provides a practical and scalable approach\ntrack standard PPO in terms of mission-level performance with for UAV-assisted medical logistics during emergencies.\nminor improvements. LSTM provided poorer results whereas\nusing a larger fully connected network on standard PPO ACKNOWLEDGMENT\nimproved the performance. This indicates that the benefits\nfrom sequential actions are minimal and the mission requires This work was supported by the Brains for Brussels research\nmore adaptive decision-making. and innovation funding program of the R´egion de BruxellesCapitale–Innoviris under Grant RBC/BFB 2023-BFB-2. CONCLUSION\nThis paper presented a MARL framework for time-critical REFERENCES\nUAV medical supply delivery in urban environments with\n[1] I. Parlak, \"Blockchain, AI and IoT Empowered Swarm\nstochastic task arrivals and heterogeneous urgency levels. Us- Drones for Precision Agriculture Applications,\" in 2022 IEEE 1st\ning a realistic grid representation of a 12 km×12 km city envi- Global Emerging Technology Blockchain Forum: Blockchain and Beyond\nronment with multiple depots and clinics, we evaluated several (iGETblockchain), 2022, pp. 1–6.\n[2] V. Garg et al., \"Drones in last-mile delivery: A systematic review on\nreinforcement learning algorithms for coordinating UAV fleets efficiency, accessibility and sustainability,\" Transportation Research Part\nunder payload, communication, and deadline constraints. D: Transport and Environment, 2023. Sushma et al., \"Spatial drone path planning: A systematic review\nof parameters and algorithms,\" Journal of Transport Geography, 2025.\n[4] J. Gao, \"A survey\non vehicle–drone cooperative delivery operations optimization: Models,\nmethods, and future research directions,\" Swarm and Evolutionary\nComputation, vol. 92, p. 101780, 2025.\n[5] Z. Ning et al., \"A survey on multi-agent reinforcement learning and its\napplications,\" Neurocomputing, 2024.\n[6] ˙I. Yanmaz, \"Multi-objective path planning for multi-UAV\nconnectivity and area coverage,\" Ad Hoc Networks, vol. 160, p. 103520,\n2024.\n[7] F. Zhong, \"C-SPPO: A deep\nreinforcement learning framework for large-scale dynamic logistics UAV\nrouting problem,\" Chinese Journal of Aeronautics, vol. 38, no. 5, p.\n103229, May 2025.\n[8] Y. Zhao, \"Design of Multi-UAV Task Allocation Algorithm Based on Deep Reinforcement Learning,\" in 2025 6th\nInternational Conference on Computer Engineering and Application\n(ICCEA), Apr. 2025, pp. 440–443, iSSN: 2159-1288.\n[9] H.", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 16, + "total_chunks": 18, + "char_count": 2624, + "word_count": 362, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db47b3d3-6bcf-4f05-a38b-0de0cfdffdbe", + "text": "Jin, \"Distributed cooperative search method for multi-UAV with unstable communications,\"\nApplied Soft Computing, vol. 148, p. 110592, 2023.\n[10] V. Wollherr, \"Decentralized multi-agent reinforcement\nlearning based on best-response policies,\" Front Robot AI, vol. 11, p.\n1229026, Apr. 2024.\n[11] J. Sousa, \"Piggybacking on UAV Package Delivery\nSystems to Simultaneously Provide Wireless Coverage: A Deep Reinforcement Learning-Based Trajectory Design,\" in IEEE INFOCOM\n2024 - IEEE Conference on Computer Communications Workshops\n(INFOCOM WKSHPS), May 2024, pp. 1–6, iSSN: 2833-0587.\n[12] A. Han,\n\"Accelerating convergence in distributed reinforcement learning via\nAsynchronous PPO,\" in 2025 International Conference on Artificial\nIntelligence in Information and Communication (ICAIIC), 2025, pp.\n1059–1064.\n[13] L. Kavukcuoglu,\n\"IMPALA: Scalable distributed Deep-RL with importance weighted\nactor-learner architectures,\" 2018.\n[14] I. Babuska, \"A\nsurvey of actor-critic reinforcement learning: Standard and natural policy\ngradients,\" IEEE Transactions on Systems, Man, and Cybernetics, Part\nC (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, 2012.\n[15] E. Stoica, \"RLlib: Abstractions\nfor distributed reinforcement learning,\" 2018. [Online]. Available:\nhttps://arxiv.org/abs/1712.09381", + "paper_id": "2603.10528", + "title": "UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery", + "authors": [ + "Islam Guven", + "Mehmet Parlak" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10528v1", + "chunk_index": 17, + "total_chunks": 18, + "char_count": 1294, + "word_count": 163, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10535_semantic.json b/data/chunks/2603.10535_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..71ed4f6201d61b74ba57dcec4a43289e54ec5547 --- /dev/null +++ b/data/chunks/2603.10535_semantic.json @@ -0,0 +1,2000 @@ +[ + { + "chunk_id": "b0d86285-bf5d-4c67-9c95-6b41d475b0ae", + "text": "Tackling Length Inflation Without Trade-offs:\nGroup Relative Reward Rescaling for Reinforcement Learning", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 0, + "total_chunks": 74, + "char_count": 104, + "word_count": 12, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4600ed8c-6ed4-4e11-b4df-f4f1ec145fde", + "text": "Zichao Li 1 2 Jie Lou 3 Fangchen Dong 3 Zhiyuan Fan 3 Mengjie Ren 1 2 Hongyu Lin 1 Xianpei Han 1\nDebing Zhang 3 Le Sun 1 Yaojie Lu 1 Xing Yu 3 Abstract AIME-25: Score vs. Length (7B)\nReinforcement learning significantly enhances GR³ 46\nLLM capabilities but suffers from a critical issue:\n)442026 length inflation, where models adopt verbosity or\ninefficient reasoning to maximize rewards. Prior ( 42 GR3 breaks the performance length trade-off.\napproaches struggle to address this challenge in\nAdaptThink a general and lossless manner, primarily because 40Mar avg@32\nadditive penalties introduce a compensatory effect 38 Laser-DE R1-Distill\nthat creates optimization shortcuts, while heuris- DLER LCR111\ntic gating strategies lack generality beyond binary 36\nfeedback. To bridge this gap, we present Group\n4000 6000 8000 10000 12000 14000\nRelative Reward Rescaling (GR3), which re- Tokens ( )\nframes length control as a multiplicative rescaling\nparadigm, effectively establishing a generalized, Figure 1. Comparison of GR3 with open-source efficient reasoning\ncontinuous, and reward-dependent gating mech- models, all trained on DeepSeek-R1-Distill-7B. GR3 pioneers a[cs.LG]\nanism. To further ensure lossless optimization, new paradigm that sustains stable performance gains under\nwe incorporate group-relative regularization and RL while simultaneously mitigating the length inflation issue.\nadvantage-aware calibration, which dynamically\nadapt length budgets to instance difficulty and ference costs without proportional gains in quality. This\npreserve the advantage signal of high-quality tra- phenomenon arises across major RL paradigms. In RL from\njectories. Empirically, across both RLHF and human feedback (RLHF) (Ouyang et al., 2022), models\nRLVR settings, GR3 maintains training dynamics exploit reward-model biases toward verbosity, leading to\nand downstream performance comparable to stan- reward hacking (Gao et al., 2023). In RL with verifiable\ndard GRPO while significantly mitigating length rewards (RLVR) (Shao et al., 2024), length inflation instead\ninflation, outperforming state-of-the-art length- stems from reasoning inefficiency (Sui et al., 2025), where\nregularized baselines. models generate unnecessarily long chains of thought to\nmarginally improve the likelihood of a correct solution. Introduction Prior work has sought to mitigate length inflation.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 1, + "total_chunks": 74, + "char_count": 2377, + "word_count": 336, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb96416b-430d-4e7a-841c-eff68e43c3a3", + "text": "One\nline of research trains reward models that are invariant toarXiv:2603.10535v1\nReinforcement learning (RL) (Bai et al., 2022; Guo et al., response length (Chen et al., 2024a; Liu et al., 2024). While\n2025) has become the engine of post-training for Large Lan- effective in RLHF, this strategy does not extend to RLVR,\nguage Models (LLMs) (Achiam et al., 2023; Team et al., where rewards are derived from ground-truth verifiers rather\n2023). Yet this engine exhibits a persistent flaw, which we than learned proxies that can be debiased. A more general\nterm length inflation: a tendency for RL-trained models direction instead introduces explicit length penalties into the\nto produce unnecessarily lengthy trajectories, inflating in- reward (Luo et al., 2025a; Liu et al., 2025c; Yi et al., 2025). However, most existing methods rely on coarse regulariza-\n1Chinese Information Processing Laboratory, Institute of\ntion, which leads to suboptimal optimization dynamics.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 2, + "total_chunks": 74, + "char_count": 969, + "word_count": 147, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5f7eebe-8a85-42c0-9ed0-9de82c2c316d", + "text": "Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Xiaohongshu Inc. Correspondence A common design adopts additive shaping (Yu et al., 2025;\nto: Jie Lou , Yaojie Lu .\nlength term (e.g., R′ = R −λℓ). This introduces decouPreprint. pled incentives, creating a length-driven component that Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 3, + "total_chunks": 74, + "char_count": 527, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "57fac061-7da9-494a-becc-f11fdbe2ed22", + "text": "GR3 (Ours) GRPO\nRLVR (Math) RLVR (Code) RLHF (Chat) 0.65 0.45 0.8 reward 0.60 0.40\nTask 0.55 0.35 0.6\n0.30 0.4\n2500 8000 length\n7000 6000 2000\n6000 1500\n4000 Response 5000\n0 200 400 600 800 1000 0 200 400 600 800 0 100 200 300 400\nTraining step Training step Training step Training dynamics of GR3, which retains GRPO's reward gains without loss while significantly reducing average tokens. The\nbase models used for the two settings are DeepSeek-R1-Distill-1.5B and Qwen3-8B (without thinking mode), respectively.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 4, + "total_chunks": 74, + "char_count": 513, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d72fb843-d592-48fb-a869-2009216f8803", + "text": "makes extreme brevity an attractive shortcut independent of on AIME-25) while simultaneously improving accuracy\ntask success. To better align penalties with outcomes, some (e.g., +8 points), demonstrating that verbosity is not a preworks propose heuristic gating (Cheng et al., 2025; Arora requisite for intelligence. Furthermore, in RLHF settings,\n& Zanette, 2025), applying penalties only when R = 1. GR3 exhibits an adaptive length dynamic: it permits moderHowever, such designs are inherently limited to binary feed- ate growth when computation is beneficial but automatically\nback and do not extend naturally to continuous-reward set- curtails generation length as the policy matures (Figure 2).\ntings like RLHF. Moreover, many approaches rely on coarse This mechanism effectively mitigates reward hacking via\ncontrol mechanisms, such as static truncation thresholds verbosity without sacrificing capability. We will release our\nor uncalibrated penalty strengths (Liu et al., 2025b; Cheng code and model checkpoints to support future research.\net al., 2025), resulting in an inherent efficiency–performance\nIn summary, our contributions are threefold:\ntrade-off, as illustrated in Figure 1. These observations lead to a central question: Can we tackle • We propose GR3, a framework for lossless length conlength inflation in a general manner without compromising trol that substitutes additive penalties with multiplicathe capability gains of RL? In this work, we present Group tive reward rescaling. This design eliminates compenRelative Reward Rescaling (GR3), a principled frame- satory optimization shortcuts and provides a unified\nwork for lossless efficiency optimization. Rather than using mechanism for both binary and continuous rewards.\nadditive penalties, GR3 regularizes length through multi-\n• We develop an optimization-preserving strategy that\nplicative rescaling, which acts as a generalized gating mechintegrates group-relative regularization with advantageanism and removes the compensatory shortcuts inherent to\naware calibration, adapting constraints to on-policy\nadditive schemes. To further ensure lossless optimization,\nstatistics while preserving learning signals.\nwe introduce two fine-grained mechanisms. Specifically, we\nemploy group-relative regularization, which normalizes • Across mathematical reasoning, code generation, and\nlength against on-policy statistics rather than rigid thresh- RLHF alignment tasks, GR3 yields concise generolds, thereby dynamically adapting the length budget to ations while matching standard GRPO performance,\nthe inherent difficulty of each prompt. Complementing shifting the efficiency–performance Pareto frontier.\nthis, we introduce advantage-aware calibration to explicitly control the penalty strength.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 5, + "total_chunks": 74, + "char_count": 2772, + "word_count": 370, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a6ef26a-0dff-480a-a5ca-717bcd9db203", + "text": "This ensures that length 2. Preliminary\nregularization does not overturn the advantage signal of representative high-quality trajectories, thereby safeguarding 2.1. Group Relative Policy Optimization\nstable optimization toward capability improvements. LLM generation can be formulated as a token-level Markov\nEmpirically, GR3 resolves the efficiency–performance trade- Decision Process (Puterman, 1990).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 6, + "total_chunks": 74, + "char_count": 403, + "word_count": 48, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05bbd512-5029-4e3d-9a87-64737cf969b5", + "text": "Given a prompt x ∼\noff inherent in prior methods. As shown in Figure 1, our D, an autoregressive policy πθ generates a response y =\napproach significantly reduces token usage (e.g., over 40% (y1, . . . , yℓ) of length ℓ:= |y| by sampling tokens from Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning A scalar reward R(x, y) is defined over\nTable 1. Comparison of length-aware reward shaping methods (Liu\ncomplete responses, and reinforcement learning aims to\net al., 2025c; Arora & Zanette, 2025; Aggarwal & Welleck, 2025;\nmaximize the expected reward: Team et al., 2025; Yu et al., 2025; Cheng et al., 2025). ℓ(i) is the\nresponse length of sample i. ℓmin, ℓmax, ¯ℓ, and σℓare computed\nmax Ex∼D, y∼πθ(·|x) R(x, y) . (1) within each group. α, λ, ℓT and ℓC are fixed hyperparameters. πθ\ns(·) denotes the sigmoid function. With the emergence of reasoning models such as DeepSeekMethod S (Length shaping term)\nR1 (Guo et al., 2025), group-style RL has become prevalent\nfor LLM post-training. Among them, Group Relative Policy Additive: ˆR(+) = R + λ · S\nOptimization (GRPO) (Shao et al., 2024) is widely adopted\ndue to its scalability and its elimination of the need for a L1-Exact −|ℓ(i) −ℓT |\nseparate value model. For each prompt x, GRPO samples a 0, ℓ(i) ≤ℓT −ℓC,\ngroup of G responses {y(i)}Gi=1 from an old policy πθold(· |  ℓT −ℓC −ℓ(i) DAPO , ℓT −ℓC < ℓ(i) ≤ℓT ,x) and evaluates each by R(x, y(i)). It then constructs a ℓC\ngroup-relative advantage via within-group normalization: −1, ℓ(i) > ℓT  −ℓmin 0.5 −ℓ(i) , R = 1, ˆA(i) = ℓmax −ℓmin R(x, y(i)) −µR \nσR Kimi-k1.5\n−ℓmin −ℓ(i) , 0 , R = 0 G min 0.5 ℓmax −ℓmin 1\nµR := X R(x, y(j)), σR := std {R(x, y(j))}Gj=1 Truncation −I(R = 1) · I(ℓ(i) > ℓT ) G\nj=1\nEfficiently −I(R = 1) · s (ℓ(i) −¯ℓ)/σℓ (2)\nLC-R1 I(R = 1) · (1 −ℓ(i)/ℓmax)\nThe policy is optimized using a PPO-style clipped objective\nover the group-normalized advantages: Multiplicative: ˆR(×) = R · S GR3 (Ours) ℓ(i) 1\n¯ℓ JGRPO(θ) = Ex∼D, {y(i)}Gi=1 X X 1 + α · G\ni=1 t=1\nmin ri,t(θ)ˆA(i), clip(ri,t(θ), 1 −ε, 1 + ε)ˆA(i) (3) A common strategy for mitigating length inflation in reinforcement learning is to explicitly regularize response length #\n−βDKL(πθ∥πref) , through reward shaping. From a unified perspective (Liu\net al., 2025c), Most existing approaches can be instantiated\nas additive shaping:\nwhere the importance sampling ratio is defined as\nAdditive: ˆR(+) = R + λ · S, λ > 0 (5)\nπθ(y(i)t | x, y(i) 0. (15)\n1 + α · ℓ(i) ¯ℓ\nRS = RµS+R(S−µS) = µRµS+µS(R−µR)+R(S−µS),\nwhere ℓ(i) is the response length and ¯ℓis the group mean.\nand subtract E[RS] = µRµS + σRS. Eq. (14) is Eq. (2) This penalty decreases smoothly with length, while normalapplied to ˆR(×). izing against ¯ℓavoids arbitrary global thresholds and adapts\nRemark 3.3 (Why multiplicative shaping is reward-aware to the model's current generation behavior.\nunder group normalization). Under additive shaping, Propo- As shown in Table 2, we include the fixed-threshold trunsition 3.1 shows that the length deviation (S−µS) is injected cation method (Hou et al., 2025) as a minimalist baseline.\ninto the centered shaped reward with a fixed coefficient λ. We find that threshold-based truncation imposes a uniform\nThis creates a compensatory degree of freedom: the policy maximum response length even on difficult benchmarks,\ncan improve the shaped reward by manipulating S even which degrades reasoning performance on challenging probwhen R provides little learning signal. lems. We also compare against other group-relative methods\nIn contrast, Proposition 3.2 yields the decomposition and find that certain shaping strategies (Arora & Zanette,\n2025) introduce biases that favor shallow reasoning on easier\nRS −E[RS] = R(S −µS) + µS(R −µR) −σRS, benchmarks (see Appendix B for analysis). Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning evaluated another group-relative method, Kimi-k1.5 (Team Impact of on Reward Gap and CSR\n=0.033 =0.33et al., 2025), but it exhibited training collapse; therefore, we 99.9% 0.01 100.0%\nomit its results. We attribute this failure to the additive shap-\n0.00 =0.1 =0.2ing paradigm without gating, as discussed in Section 3.1. 100.0% 100.0% =0.5\nGap 0.01 99.6%\n0.023.3. Advantage-Aware Calibration\n0.03 Reward\nWithin the framework of group-relative policy optimization, =1\n0.04 96% 98% 100% 97.6%\nthe length penalty term S acts as a powerful shaper of the =1.5\n0.05 Constraint Satisfaction Rate 96.1%\nadvantage landscape. The interaction between the penalty\nstrength and group normalization is non-trivial: slight shifts 10 1 100\n(log scale)in S can noticeably redirect the optimization trajectory. In practice, an unconstrained or overly strong penalty may Figure 4.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 12, + "total_chunks": 74, + "char_count": 3203, + "word_count": 486, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "740c258a-cd12-49c2-917b-630cffe1e902", + "text": "Sensitivity of α: reward gap relative to the standard\npenalize high-quality responses so heavily that it creates a GRPO baseline versus α (log scale). Marker color indicates the\ncontradictory signal where the model is discouraged from average CSR measured during actual training, while the triangle\ngenerating its best responses. marker denotes the value of α selected during the calibration phase.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 13, + "total_chunks": 74, + "char_count": 398, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fda800d0-53f6-4751-add0-36781e23b308", + "text": "A natural but overly strict objective is to require that all highgap relative to the GRPO baseline is already positive, in-quality trajectories retain positive advantages. However, this\ndicating preserved task capability. Further reducing thebecomes intractable under high reward density, where most\npenalty strength (i.e., decreasing α) yields no consistent per-responses in a group achieve the maximum reward Rmax\nformance gains, and instead results in fluctuations consistent(e.g., 15 out of 16 are correct). Due to the zero-sum structure\nwith training variance.of group normalization, correct but above-average-length\nresponses may inevitably receive negative advantages. We\nprovide a formal analysis of this limitation in Appendix C.1. 4. SetupAverage-Case Advantage Preservation. Instead of protecting the longest outlier trajectory, we aim to preserve the Efficient Reasoning for RLVR. Following prior work,\nadvantage of a representative high-quality response. Specif- we adopt DeepSeek-R1-Distill-1.5B and DeepSeek-R1-\nically, we consider a hypothetical response that achieves the Distill-7B (Guo et al., 2025) as the base models. For\ngroup-wise maximum reward Rmax with the group-average mathematical reasoning, we use the DeepScaleR-Previewlength ¯ℓ, and require its advantage to remain non-negative. Dataset (Luo et al., 2025b) as training data, and include\nLet µRˆ denote the mean regularized reward in the group. open-sourced checkpoints of existing efficient-reasoning\nThis yields the condition: methods as baselines, including LC-R1 (Cheng et al., 2025),\nRmax Rmax Laser (Liu et al., 2025c), AdaptThink (Zhang et al., 2025),\n≥µRˆ =⇒ ¯ℓ ≥µRˆ (16) and DLER (Liu et al., 2025b). To demonstrate the generality 1 + α 1 + α · ¯ℓ of our approach, we further extend to code generation, using\nThis ensures that the penalty α does not overturn the ad- the prompts from DeepDistill (Tian et al., 2025).\nvantage of a typical high-quality response. In the limiting\ncase where all trajectories in a group achieve Rmax, the Mitigating Length Bias in RLHF. As for the RLHF setaverage-case constraint can still fail to hold.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 14, + "total_chunks": 74, + "char_count": 2122, + "word_count": 314, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "72f24743-137d-457a-98c3-d946c715cd88", + "text": "We therefore\nting, we use the non-reasoning versions of Qwen3-4B and\nfilter out such groups online (see Appendix C.2). Qwen3-8B (Yang et al., 2025) as base models. We construct\nIn practice, Eq. (16) is not enforced as a hard constraint at RL prompts from arena-human-preference-140k2, and emevery update due to on-policy sampling stochasticity. In- ploy Skywork-Reward-V2-Llama-3.1-8B (Liu et al., 2025a)\nstead, we treat it as a calibration criterion for selecting the as the reward model. To improve training stability, we apply\npenalty coefficient α. We run a short calibration phase at a reference-based sigmoid shaping (Fu et al., 2025) scheme:\nthe beginning of GRPO training and measure the Constraint\nSatisfaction Rate (CSR) over candidate α values. We then R(x, y(i)) = s Rorigin(x, y(i)) −Rorigin(x, yref) . (17)\nselect the largest α whose CSR remains consistently high\n(e.g., ≥99.9%), ensuring high-probability constraint satiswhere Rorigin(·) denotes the raw reward model score.faction while maintaining strong length regularization. Detailed experimental settings are provided in Appendix D.Empirically, the α selected by this protocol maintains a\nnear-perfect CSR throughout training (see Figure 4).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 15, + "total_chunks": 74, + "char_count": 1211, + "word_count": 178, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6eab1802-a2ef-456c-af3e-06bc92cabab5", + "text": "This 2https://huggingface.co/datasets/lmaren\npoint effectively marks a practical boundary: the reward a-ai/arena-human-preference-140k Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Mathematical reasoning performance for 7B models. Length-oriented methods reduce tokens but may sacrifice accuracy, while\nGR3 consistently achieves comparable accuracy with significantly fewer tokens, establishing a markedly better Pareto frontier.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 16, + "total_chunks": 74, + "char_count": 488, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5abf481f-bdd6-4722-b05a-2f216878f34d", + "text": "AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–7B 52.4 13213 39.4 14032 89.8 6385 92.1 3994 Length-oriented RL\nLCR1–7B 47.9 7548 36.3 7960 85.8 2963 89.1 1546\nLaser–DE–L4096–7B 51.7 4931 36.7 5048 88.1 2427 92.4 1580\nAdaptThink–7B 51.5 11070 39.3 11678 88.1 4280 90.6 2011\nDLER–R1–7B 49.5 3272 35.6 3288 91.4 2255 93.2 1650 Performance-oriented RL\nGRPO 57.1 11079 44.7 12540 90.3 7256 92.1 5006\nGR3 (ours) 60.1 7923 46.9 8582 93.0 3090 94.0 1764 Performance on code generation tasks. GR3 achieves Table 5.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 17, + "total_chunks": 74, + "char_count": 589, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5768bc59-ba13-40d9-b48e-33269b4ee920", + "text": "Performance on chat tasks. GRPO improves performance\ncompetitive scores with fewer tokens across model sizes. but incurs substantial length bias, while GR3 achieves stronger\nalignment gains while preserving the length of the initial policy. LiveCodeBench v6 MultiPL-E\nArena–Hard–Auto Alpaca–Eval\nModel Score↑ #Token↓ Score↑ #Token↓\nModel Score↑ #Token↓ Score↑ #Token↓\nDeepSeek–R1–Distill–1.5B\nQwen3–4B\nInitial 17.7 12665 45.1 6181\nGRPO 23.4 11830 51.5 6589 Initial 66.6 1139 40.1 737\nGR3 (ours) 24.9 8538 52.2 2414 GRPO 85.8 2374 33.9 1993\nGR3 (ours) 85.9 1377 44.1 859\nDeepSeek–R1–Distill–7B\nQwen3–8B\nInitial 37.7 11496 69.7 4121\nGRPO 42.4 10956 71.1 4794 Initial 77.2 1171 50.2 778\nGR3 (ours) 41.6 7504 70.9 2127 GRPO 90.6 2343 53.5 1670\nGR3 (ours) 92.8 1178 55.8 765 Main Results\nGR3 improves it to 46.9 with much fewer tokens (14,032 →\n4.2.1.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 18, + "total_chunks": 74, + "char_count": 846, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca416b91-4342-474c-868a-55187ca6b46c", + "text": "EFFICIENT REASONING FOR RLVR 8,582). This suggests that GR3 encourages more efficient\nreasoning trajectories rather than merely truncating reaThe experimental results for the 7B model are presented\nsoning, yielding consistent gains across scales.\nin Table 3, while those for the 1.5B model are provided in\nAppendix E.1. Notably, GR3 improves reasoning perfor- We further validate GR3 on code generation benchmarks,\nmance while reducing generation length, indicating a as summarized in Table 4. Consistent with our findings\ngenuine efficiency gain rather than a trade-off. in mathematical reasoning, GR3 achieves substantial efficiency gains while preserving task performance. Within the performance-oriented regime, compared with\nstandard GRPO, GR3 leads to substantially shorter gener-\n4.2.2. MITIGATING LENGTH BIAS IN RLHF\nations while maintaining or even improving performance. For instance, at the 7B scale on AIME24, GR3 reduces Results on alignment benchmarks are shown in Table 5.\nthe average length from 13,213 to 7,923 tokens while im- Compared with the initial models, RLHF training yields\nproving Avg@32 from 52.4 to 60.1, whereas GRPO only substantial improvements in chat quality. However, we\nreaches 57.1.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 19, + "total_chunks": 74, + "char_count": 1219, + "word_count": 177, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4efe0f9b-3c5e-4011-a9a9-b0fa156c9df8", + "text": "In contrast to existing length-oriented base- observe that standard GRPO suffers from severe reward\nlines, GR3 does not over-compress the reasoning length at hacking under length bias, where the model can artificially\nthe cost of accuracy; instead, it prioritizes preserving perfor- increase reward by generating unnecessarily long responses,\nmance while removing redundant reasoning. For example, resulting in explosive length inflation (e.g., on Qwen3-8B,\non AIME25 (7B), none of the length-oriented baselines sur- the average response length on Arena–Hard–Auto increases\npasses the initial checkpoint performance (39.4), whereas from 1,171 to 2,343 tokens). In contrast, GR3 attains com-", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 20, + "total_chunks": 74, + "char_count": 690, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7860070f-8a04-463a-acaf-330871f4d404", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning parable or even stronger alignment gains while keeping ward on the most important reasoning steps. By discouragresponse length almost unchanged, effectively decou- ing unnecessary verbosity, GR3 compresses reasoning\npling performance improvement from verbosity. For traces while preserving decisive steps.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 21, + "total_chunks": 74, + "char_count": 410, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "85921cf4-9991-43dd-b01a-dad146e40385", + "text": "This increases the\ninstance, on Qwen3-8B, GR3 improves the Arena–Hard– signal density of reward with respect to tokens, allowing\nAuto score from 77.2 to 92.8, while the token cost only optimization to focus more strongly on causally important\nincreases marginally (1,171 →1,178). reasoning patterns rather than distributing gradients over\nverbose but weakly relevant tokens. Qualitative rollout exWe further visualize the training dynamics in the RLHF\namples are provided in Appendix F.\nsetting in Figure 2. Under GRPO, response length grows\nmonotonically and uncontrollably throughout training. In\ncontrast, GR3 exhibits a clear \"increase-then-decrease\" pat- 5. Related Work\ntern: the model initially expands its reasoning to secure\nDespite remarkable progress, reinforcement learning (Kaelalignment improvements, and subsequently compresses rebling et al., 1996; Bai et al., 2022; Guo et al., 2025) suffers\ndundant generations once performance stabilizes. This dyfrom high inference costs and growing generation lengths, a\nnamic behavior aligns with our design intuition—GR3 pribottleneck we term length inflation.\noritizes achieving reliable alignment gains first, and then\nprogressively improves response efficiency by suppress- One line of work studies efficient reasoning (Feng et al.,\ning length-based exploitation. 2025; Sui et al., 2025), aiming to improve the accuracy–cost\ntrade-off of long chain-of-thought models. Early approaches\n4.3. Analysis and Discussion rely on prompting or supervised fine-tuning to encourage\nshorter reasoning traces (Ma et al., 2025a; Xia et al., 2025;\n4.3.1. ABLATION ON PENALTY STRENGTH α Ma et al., 2025b). More recent methods apply RL to directly\nWe study the effect of the penalty coefficient α by sweeping optimize efficiency via length-aware objectives (Arora &\nits value while keeping all other settings fixed. Detailed Zanette, 2025; Liu et al., 2025c;b; Yu et al., 2025). While\nresults and analysis are provided in Appendix E.2; here we effective at reducing token usage, such approaches can\nsummarize the key findings. degrade performance or introduce brittle optimization dynamics due to poorly calibrated penalties or shortcut soluWhen α is too large (e.g., 1.0), GR3 degenerates toward a tions (Cheng et al., 2025).\nnaive length regularizer: responses become much shorter,\nbut performance gains over the base model are limited, as Another line of work attributes length inflation in RLHF to\noptimization is dominated by compression rather than capa- reward hacking and length bias (Skalse et al., 2022; Gao\nbility improvement. As α decreases, response length grows et al., 2023; Singhal et al., 2023). Because reward models\nsmoothly, while task performance first improves and then may implicitly favor longer responses, verbosity can arise\nplateaus. This trend is consistent with the analysis in Section from exploiting reward artifacts rather than true capability\n3.3: once the advantage of representative high-quality gains (Shen et al., 2023).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 22, + "total_chunks": 74, + "char_count": 2999, + "word_count": 438, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "75b4078b-1c06-425e-b28f-b458a5525ec1", + "text": "Prior efforts mitigate this by imtrajectories is preserved, further reducing the penalty proving reward modeling and calibration (Chen et al., 2024a;\nmainly relaxes length control without yielding stronger Wang et al., 2025) or by applying post-hoc reward correclearning signals. The chosen value α = 0.33 lies near tion (Huang et al., 2024), though many of these solutions\nthis transition region, achieving substantial length reduction are tailored to specific training settings.\nwhile retaining most performance gains.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 23, + "total_chunks": 74, + "char_count": 520, + "word_count": 75, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "54affd3a-e167-4fe3-b5d0-3546b7fd89c3", + "text": "Our method is most closely related to RL-based efficient reasoning, and remains effective under reward hacking driven\n4.3.2. WHY DOES GR3 OUTPERFORM GRPO? by length bias. We propose GR3, a general framework\nfor length regularization that preserves performance whileWe observe a counterintuitive phenomenon: in many settings, GR3 not only shortens responses but also achieves improving the performance–cost Pareto frontier.\nstronger downstream performance than standard GRPO,\nwhile maintaining a positive reward gap relative to the 6. Conclusion\nGRPO baseline. We attribute this to a difference in how\nthe optimization signal is structured. In this work, we identify length inflation as a fundamental inefficiency in RL-trained LLMs, where models tend\nUnder unconstrained RL such as GRPO, policies often drift toward unnecessary verbosity or overthinking. We protoward over-extended reasoning trajectories. Although these pose Group Relative Reward Rescaling (GR3), a general\ntrajectories may eventually reach correct answers, they tend framework for lossless length control that regulates reato contain many low-contributing tokens. From an opti- soning length through a multiplicative, group-relative formization perspective, this spreads the learning signal thinly mulation with advantage-aware calibration. Across both\nacross long responses, reducing the effective impact of re- Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning RLVR and RLHF settings, GR3 consistently shifts the per- C.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 24, + "total_chunks": 74, + "char_count": 1546, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "27ac888b-f460-4947-bec7-6d8e0f7e2d3b", + "text": "Multipl-e: A scalable and\nformance–cost Pareto frontier outward, reducing token us- extensible approach to benchmarking neural code generaage while preserving or even improving model capability. tion. arXiv preprint arXiv:2208.08227, 2022. These results show that verbosity is not a prerequisite for\nChen, L., Zhu, C., Soselia, D., Chen, J., Zhou, T., Goldstein,intelligence, and position GR3 as a practical and general\nT., Huang, H., Shoeybi, M., and Catanzaro, B. Odin:\nparadigm for training efficient, high-performing LLMs. Disentangled reward mitigates hacking in rlhf. arXiv\nImpact Statement\nChen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song,\nThis paper presents a method to mitigate length inflation in L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think\nLarge Language Models trained via Reinforcement Learn- that much for 2+ 3=? on the overthinking of o1-like llms.\ning. The primary positive impact of this work lies in pro- arXiv preprint arXiv:2412.21187, 2024b.\nmoting computational efficiency and environmental susCheng, Z., Chen, D., Fu, M., and Zhou, T. Optimizingtainability (\"Green AI\"). By significantly reducing token\nlength compression in large reasoning models. arXivgeneration, e.g., saving over 40% of tokens in reasoning\npreprint arXiv:2506.14755, 2025.tasks without compromising performance, GR3 directly\ncontributes to lowering the financial costs, inference latency, Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B.\nand energy consumption associated with deploying large- Length-controlled alpacaeval: A simple way to debias\nscale reasoning models. automatic evaluators. arXiv preprint arXiv:2404.04475,\n2024.Furthermore, this work addresses the alignment challenge\nof reward hacking, where models exploit verbosity to maxi- Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A.,\nmize rewards rather than improving genuine capability. By D'Amour, A., Dvijotham, D., Fisch, A., Heller, K., Pfohl,\ndecoupling performance gains from unnecessary length, we S., Ramachandran, D., et al. Helping or herding? reward\nfacilitate the development of more concise, interpretable, model ensembles mitigate but do not eliminate reward\nand user-aligned AI systems.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 25, + "total_chunks": 74, + "char_count": 2196, + "word_count": 311, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "616357b7-b31d-4107-a246-f5079d6f043a", + "text": "Nevertheless, practition- hacking. arXiv preprint arXiv:2312.09244, 2023.\ners should monitor for potential over-truncation in safetyFeng, S., Fang, G., Ma, X., and Wang, X. Efficient reasoningcritical domains where exhaustive reasoning traces are esmodels: A survey. arXiv preprint arXiv:2504.10903,sential for verification.\n2025. References Fu, J., Zhao, X., Yao, C., Wang, H., Han, Q., and Xiao, Y. Reward shaping to mitigate reward hacking in rlhf. arXiv\nAchiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., preprint arXiv:2502.18770, 2025. L., Almeida, D., Altenschmidt, J., Altman, S.,\nAnadkat, S., et al. Gpt-4 technical report. arXiv preprint Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward\narXiv:2303.08774, 2023. model overoptimization. In International Conference on\nMachine Learning, pp. 10835–10866. Aggarwal, P. and Welleck, S. L1: Controlling how long\nGuo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., a reasoning model thinks with reinforcement learning. Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- arXiv preprint arXiv:2503.04697, 2025.\ncentivizing reasoning capability in llms via reinforcement\nAmodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- learning. arXiv preprint arXiv:2501.12948, 2025.\nman, J., and Man´e, D. Concrete problems in ai safety, Hou, B., Zhang, Y., Ji, J., Liu, Y., Qian, K., Andreas,\n2016. Thinkprune: Pruning long chain-ofthought of llms via reinforcement learning. arXiv preprint\nArora, D. and Zanette, A. Training language models to reaarXiv:2504.01296, 2025.\nson efficiently. arXiv preprint arXiv:2502.04463, 2025.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 26, + "total_chunks": 74, + "char_count": 1611, + "word_count": 230, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "602689f3-c0bc-4a39-b5b6-2b73c24695fe", + "text": "Huang, Z., Qiu, Z., Wang, Z., Ponti, E. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Post-hoc reward calibration: A case study on length bias. Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., arXiv preprint arXiv:2409.17407, 2024.\net al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T.,\narXiv:2204.05862, 2022. Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evalCassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps- uation of large language models for code. arXiv preprint\nCostin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, arXiv:2403.07974, 2024. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning American invitational mathematics examination -\ninforcement learning: A survey. Journal of artificial amc. In American Invitational Mathematics Examination\nintelligence research, 4:237–285, 1996. - AMC 2023, 2023. Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., MAA. American invitational mathematics examination -\nGonzalez, J.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 27, + "total_chunks": 74, + "char_count": 1195, + "word_count": 173, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b0dc449e-a766-4d14-b3ff-92223c7f0ad9", + "text": "From crowdsourced data to aime. In American Invitational Mathematics Examination\nhigh-quality benchmarks: Arena-hard and benchbuilder - AIME 2024, 2024.\npipeline. arXiv preprint arXiv:2406.11939, 2024. American invitational mathematics examination -\nLightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, aime. In American Invitational Mathematics Examination\nB., Lee, T., Leike, J., Schulman, J., Sutskever, I., and - AIME 2025, 2025. Let's verify step by step. arXiv preprint\nOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,\nMishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,\nLiu, C. Y., Zeng, L., Xiao, Y., He, J., Liu, J., Wang, C., et al. Training language models to follow instructions\nYan, R., Shen, W., Zhang, F., Xu, J., Liu, Y., and Zhou, with human feedback. Advances in neural information\nY. Skywork-reward-v2: Scaling preference data curation processing systems, 35:27730–27744, 2022.\nvia human-ai synergy. arXiv preprint arXiv:2507.01352,\nPuterman, M. Markov decision processes. Handbooks in\n2025a.\noperations research and management science, 2:331–434,\n1990.Liu, S.-Y., Dong, X., Lu, X., Diao, S., Liu, M., Chen, M.-H.,\nYin, H., Wang, Y.-C. F., Cheng, K.-T., Choi, Y., et al. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang,\nDler: Doing length penalty right-incentivizing more in- H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushtelligence per token via reinforcement learning. arXiv ing the limits of mathematical reasoning in open language\npreprint arXiv:2510.15110, 2025b. models. arXiv preprint arXiv:2402.03300, 2024. Liu, T., Xiong, W., Ren, J., Chen, L., Wu, J., Joshi, R., Gao, Shen, W., Zheng, R., Zhan, W., Zhao, J., Dou, S., Gui,\nY., Shen, J., Qin, Z., Yu, T., et al.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 28, + "total_chunks": 74, + "char_count": 1735, + "word_count": 265, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ed06fa49-6a52-41fe-a5d1-dad82cf73596", + "text": "Rrm: Robust reward T., Zhang, Q., and Huang, X.-J. Loose lips sink ships:\nmodel training mitigates reward hacking. arXiv preprint Mitigating length bias in reinforcement learning from\narXiv:2409.13156, 2024. human feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 2859–2873,Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y.,\n2023. Zhang, Y., and He, J. Learn to reason efficiently with\nadaptive length-based reward shaping. arXiv preprint Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang,\narXiv:2505.15612, 2025c. R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:\nLuo, H., Shen, L., He, H., Wang, Y., Liu, S., Li, W., Tan,\n2409.19256, 2024. N., Cao, X., and Tao, D. O1-pruner: Length-harmonizing\nfine-tuning for o1-like reasoning pruning. arXiv preprint Singhal, P., Goyal, T., Xu, J., and Durrett, G. A long way\narXiv:2501.12570, 2025a. to go: Investigating length correlations in rlhf. arXiv\nLuo, M., Tan, S., Wong, J., Shi, X., Tang, W., Roongta, M.,\nCai, C., Luo, J., Zhang, T., Li, E., Popa, R. A., and Stoica, Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 29, + "total_chunks": 74, + "char_count": 1184, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "07ba8202-4d08-44bd-9eb7-8293ad0ea187", + "text": "Deepscaler: Surpassing o1-preview with a 1.5b model Defining and characterizing reward gaming. Advances in\nby scaling rl. https://pretty-radio-b75.n Neural Information Processing Systems, 35:9460–9471,\notion.site/DeepScaleR-Surpassing-O1-P 2022.\nreview-with-a-1-5B-Model-by-Scaling\n-RL-19681902c1468005bed8ca303013a4e2, Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T.,\n2025b. Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for\nMa, W., He, J., Snell, C., Griggs, T., Min, S., and Zaharia, large language models. arXiv preprint arXiv:2503.16419,\nM. Reasoning models can be effective without thinking. 2025.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 30, + "total_chunks": 74, + "char_count": 670, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b7a89cce-2c17-482a-93ec-09077c72b768", + "text": "Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., SoriMa, X., Wan, G., Yu, R., Fang, G., and Wang, X. Cot-valve: cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican,\nLength-compressible chain-of-thought tuning. arXiv K., et al. Gemini: a family of highly capable multimodal\npreprint arXiv:2502.09601, 2025b. models. arXiv preprint arXiv:2312.11805, 2023.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 31, + "total_chunks": 74, + "char_count": 366, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e7a72344-4d72-4096-b3a1-581406891c00", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C.,\nLi, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5:\nScaling reinforcement learning with llms. arXiv preprint Tian, X., Zhao, S., Wang, H., Chen, S., Peng, Y., Ji, Y., Zhao,\nH., and Li, X. Deepdistill: Enhancing llm reasoning\ncapabilities via large-scale difficulty-graded data training. Wang, C., Zhao, Z., Jiang, Y., Chen, Z., Zhu, C., Chen,\nY., Liu, J., Zhang, L., Fan, X., Ma, H., et al. Beyond\nreward hacking: Causal rewards for large language model\nalignment. arXiv preprint arXiv:2501.09620, 2025. T., Wang, W., Li, Y., and Li, W.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 32, + "total_chunks": 74, + "char_count": 702, + "word_count": 115, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1fbedd8-bee5-4980-82af-72e50422867c", + "text": "Tokenskip: Controllable chain-of-thought compression in llms. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B.,\nYu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical\nreport. arXiv preprint arXiv:2505.09388, 2025. Yi, J., Wang, J., and Li, S. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient\nreasoning. arXiv preprint arXiv:2504.21370, 2025. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai,\nW., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source\nllm reinforcement learning system at scale. arXiv preprint Zhang, J., Lin, N., Hou, L., Feng, L., and Li, J. Adaptthink:\nReasoning models can learn when to think. arXiv preprint Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Connection to Heuristic Gating Mechanisms In Section 3.1, we motivated multiplicative shaping primarily through the lens of removing the compensatory optimization\nshortcut inherent in additive shaping. In this section, we provide an alternative perspective by analyzing the relationship\nbetween our approach and heuristic gating mechanisms (Arora & Zanette, 2025; Cheng et al., 2025). We demonstrate that\nmultiplicative shaping can be viewed as a principled generalization of heuristic gating: it mathematically reduces to gating\nin binary reward settings while providing a robust, \"soft\" gating mechanism in continuous reward landscapes where hard\nindicators fail. Equivalence in Binary Reward Settings.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 33, + "total_chunks": 74, + "char_count": 1500, + "word_count": 221, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac422931-f855-402f-af3e-5c002bea19c2", + "text": "Heuristic gating is a common enhancement to additive shaping in efficient\nreasoning (RLVR), which prevents models from optimizing length at the expense of accuracy. It typically employs an\nindicator function I(R = 1) to apply length penalties only when the response is correct. Let P denote a generic length-based penalty term (e.g., a negative function of length). Standard gated additive shaping\nmodifies the shaping term S in Eq. 5 to be conditional on task success: Gated Additive: ˆR(+,g) = R + λ · Sgate,\nwhere Sgate = I(R = 1) · P. Here, I(R = 1) acts as a hard gate, R ∈{0, 1} is the binary task outcome. Now, consider our multiplicative shaping defined in Eq. 7: Multiplicative: ˆR(×) = R · Smult. To facilitate comparison, we can decompose the scaling factor Smult into a baseline and a deviation term. We rewrite\nSmult = 1 + (Smult −1), where the deviation corresponds to the implicit penalty applied by the scaling mechanism: λP := Smult −1 =⇒ Smult = 1 + λP. We analyze the behavior in the two binary states: • Case R = 0 (Failure): ˆR(+,g) = 0 + λ · (0 · P) = 0\nˆR(×) = 0 · (1 + λP) = 0 Both methods deactivate the penalty, preventing premature termination on hard instances.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 34, + "total_chunks": 74, + "char_count": 1189, + "word_count": 218, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ceb63bd8-c698-4dc3-a2fa-2484accedcf4", + "text": "Both methods apply the full penalty to incentivize efficiency among correct solutions. Thus, in the strict binary reward setting typical of RLVR, multiplicative shaping is mathematically equivalent to heuristic\ngating. It inherits the desirable property of protecting the policy from being penalized on incorrect reasoning paths. Generalization to Continuous Rewards. The limitation of heuristic gating becomes apparent when transitioning to\ncontinuous reward settings, such as RLHF (where rewards are typically given by a reward model) or reasoning tasks with\npartial credit. In these scenarios, the hard indicator I(R = 1) is ill-defined. Naively replacing it with a threshold I(R > τ) introduces\nhyperparameters and optimization discontinuities. Conversely, removing the gate entirely (reverting to pure additive shaping)\nreintroduces the trade-off discussed in Proposition 3.1, where the model can improve ˆR(+) by shortening length even if R\ndegrades slightly. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Multiplicative shaping solves this by acting as a soft gating mechanism. As derived in Proposition 3.2, the group-normalized\nadvantage under multiplicative shaping contains the following term governing the length signal: A(ˆR(×)) ∝R · (S −µS) + . . . This decomposition demonstrates that the impact of length variation (S −µS) on the advantage is explicitly scaled by the\ntask reward R. This creates a dynamic reweighting of the learning signal: • Low Quality (R ≈0): The length signal is suppressed (R · (S −µS) ≈0).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 36, + "total_chunks": 74, + "char_count": 1588, + "word_count": 234, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2df738f7-be23-4c4b-8db0-ba4685527a34", + "text": "The advantage is dominated by the need\nto improve task correctness, and the policy receives little to no signal regarding length. This mimics an inactive gate,\npreventing the model from collapsing to short but incorrect responses. • High Quality (R ≈1): The length signal is fully active (R · (S −µS) ≈S −µS).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 37, + "total_chunks": 74, + "char_count": 309, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "77b8c185-0d72-421b-b49c-a1519ffc9123", + "text": "The advantage significantly favors\nshorter trajectories within the successful group. This mimics an active gate, effectively prioritizing efficiency once\ncapability is secured. This property effectively interpolates between \"no penalty\" and \"full penalty\" based on the response quality. Consequently,\nGR3 allows us to apply strong length regularization in RLHF without the risk of the model collapsing to short, low-quality\nresponses, as evidenced by the dynamics in Figure 2. Empirical Observation. We further conduct an analytical experiment, illustrated in Figure 5.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 38, + "total_chunks": 74, + "char_count": 569, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc7ece6b-8380-4f30-8d25-87b78a6bca53", + "text": "Specifically, we convert\nour multiplicative shaping term into a penalty form used in gated additive shaping by defining λP := Smult −1. We then\nintroduce different thresholds I(R > τ) to extend gated additive shaping to continuous-reward RLHF settings. We observe\nthat, due to optimization discontinuities, all choices of τ lead to performance that falls short of standard GRPO. At the same\ntime, generation length is reduced more aggressively, dropping below the typical level of the base policy model.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 39, + "total_chunks": 74, + "char_count": 503, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8143d704-8f78-4931-8525-a0a7192887d6", + "text": "Gated Additive Shaping\nStandard GRPO Gated Additive ( = 0.5) Gated Additive ( = 0.9)\nMultiplicative Gated Additive ( = 0.75)\n(a) Task Reward (b) Response Length\n0.9\n0.8 2000\n0.7\n0.6\n0.5 1000\n0.4\n0 100 200 300 400 0 100 200 300 400\nTraining Step Training Step Comparison of multiplicative and gated additive shaping during RLHF training. Multiplicative shaping matches standard\nGRPO in task reward while achieving controlled length reduction. In contrast, gated additive variants with different reward thresholds\nunderperform and produce overly short responses, reflecting optimization instability introduced by hard gating. Analysis of the Difficulty Over-Adaptation Phenomenon In Section 3.2, we discussed how group-relative length regularization adapts the length budget to on-policy statistics. While\nthis removes the rigidity of global thresholds, we observe an unintended side effect in certain shaping strategies (e.g.,\nEfficiently (Arora & Zanette, 2025)): the policy can become over-adaptive to perceived task difficulty, a phenomenon we\nterm difficulty over-adaptation. Concretely, the model tends to aggressively compress reasoning on easy prompts while\nfailing to effectively restrain excessive length on hard ones, as reflected in Table 2 and illustrated by the examples in Table 6. In other words, the regularizer distorts how reasoning effort is allocated across difficulty levels. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from the Efficiently (Arora & Zanette, 2025) 1.5B baseline at step 800, showing excessive reasoning truncation\nand omission of critical intermediate steps, resulting in an incoherent chain of thought.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 40, + "total_chunks": 74, + "char_count": 1716, + "word_count": 246, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83f3db2c-6743-4bec-aa58-aaef5196f5dd", + "text": "Prompt\nA 90◦rotation around the origin in the counter-clockwise direction is applied to 7 + 2i. What is the resulting complex number? Let's think step by step and output the final answer within \\boxed{}. \n7cos90 -2i sin90=0-2i\n// Extremely short reasoning performed\n // Fails to present the correct answer.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 41, + "total_chunks": 74, + "char_count": 321, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb9cb551-824b-45a8-a99a-5cabad0665d6", + "text": "To understand the mechanism behind this skew, we analyze the sensitivity of the reward function to changes in length. Consider the Efficiently (Arora & Zanette, 2025) formulation: ℓ(i) −¯ℓ\nˆR(x, y(i)) = R(x, y(i)) −λ · I(R(x, y(i)) = 1) · s . In this formulation, the input to the sigmoid function s(·) is amplified by the factor 1/σℓ. This means that the marginal\npenalty associated with increasing the response length is inversely proportional to the statistical dispersion (σℓ) of the group.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 42, + "total_chunks": 74, + "char_count": 494, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ea6b0bb-9a2a-4e46-b8ef-411747a10893", + "text": "This dependency creates instability across difficulty regimes. On easier prompts, the policy is typically confident and\nconverges to consistent responses, causing the length standard deviation to collapse (i.e., σℓ→0). Consequently, the\nscaling factor 1/σℓbecomes extremely large. In this low-variance regime, even a deviation of a single token is treated as\na massive statistical outlier, triggering a severe drop in reward. This hypersensitivity forces the model to over-compress\nsimple responses to avoid harsh penalties. Conversely, on harder prompts, the policy often explores diverse reasoning paths,\nresulting in a larger σℓ. This attenuates the penalty signal, allowing longer generations to persist with relatively little cost. In contrast, GR3 normalizes the penalty based on the characteristic scale (mean length ¯ℓ) rather than dispersion:", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 43, + "total_chunks": 74, + "char_count": 851, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0b5226cc-3c94-4058-8b58-a79f5630190e", + "text": "ˆR(x, y(i)) = R(x, y(i)) · .\n1 + α · ℓ(i) ¯ℓ Here, the sensitivity of the penalty depends on the ratio relative to ¯ℓ. While the mean length ¯ℓis naturally smaller for easier\ntasks (appropriately making the budget tighter), it represents the physical scale of the response and does not collapse to\nnear-zero values as the model becomes confident. By normalizing based on scale rather than variance, GR3 provides a\nstable regularization signal that remains robust to the model's convergence state, effectively mitigating the imbalance in\ncompression pressure across difficulty levels.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 44, + "total_chunks": 74, + "char_count": 583, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e534884f-19b7-44ec-9e46-7677214ab9e8", + "text": "The Dilemma of High Reward Density In this section, we analyze a fundamental structural tension that arises when length regularization is combined with groupnormalized advantages. We show that under high reward density, a seemingly desirable strict requirement—ensuring that\nall highest-reward trajectories retain positive advantage—is in general mathematically infeasible. We then show that even\na relaxed average-case criterion can degenerate in the limiting case where all sampled trajectories achieve Rmax, as a\nconsequence of the convexity of multiplicative length rescaling. These observations motivate the relaxed, calibration-based\nconstraint and the online filtering strategy adopted in Section 3.3. Impossibility of the Strict Advantage-Preservation Objective A Strict but Natural Objective. A natural objective for length-aware reinforcement learning is to preserve the optimization\nsignal of the best solutions. Concretely, consider the following strict condition: all trajectories achieving the maximum\ntask reward Rmax within a sampled group should receive positive advantages after length regularization. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning trajectories represent the highest-quality responses, and assigning them negative advantages risks discouraging correct\nreasoning behaviors.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 45, + "total_chunks": 74, + "char_count": 1365, + "word_count": 175, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "352a2c47-a1a8-4ffc-b965-31ea54a1f5aa", + "text": "Under GRPO-style normalization, the strict objective is therefore equivalent to requiring ˆR(x, y(j)) > µR,ˆ ∀j ∈H, where H := {j : R(x, y(j)) = Rmax} denotes the set of highest-reward trajectories, ˆR(x, y(j)) is the regularized reward\nand µRˆ is the mean regularized reward within the group. Impracticality Under High Reward Density. We now show that under high reward density, the strict objective of\npreserving positive advantages for all highest-reward trajectories is mathematically infeasible, regardless of how small the\nlength penalty coefficient α is. Under GR3, the regularized reward is\nR(x, y(j))\nˆR(x, y(j)) = .\n1 + α · ℓ(j) ¯ℓ For all j ∈H we have R(x, y(j)) = Rmax, so variation in ˆR depends only on length. Since the regularization term is\nmonotonically decreasing in ℓ(j), the longest trajectory in H, attains the smallest regularized reward\nRmax\nˆRmin = ℓmax . 1 + α · ¯ℓ", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 46, + "total_chunks": 74, + "char_count": 891, + "word_count": 150, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1938491d-9ae5-45b7-85b0-957ff87470e3", + "text": "When reward density is high, the group mean µRˆ is dominated by trajectories in H and is therefore approximately their\naverage. Since the minimum of a set cannot exceed its mean, we must have ˆRmin ≤µR.ˆ Hence, at least one highestreward trajectory receives a non-positive advantage. This shows that, under high reward density, it is impossible for all\nhighest-reward trajectories to maintain positive advantages after group normalization; the conflict is structural rather than a\nconsequence of hyperparameter choice. This impossibility persists even in the limit α →0. Using a first-order expansion, ℓ(j)\nˆR(x, y(j)) ≈Rmax · 1 −α · , and denoting by\n¯ℓH = X ℓ(j),\nj∈H the mean length among highest-reward trajectories, we obtain ℓ(j) −¯ℓH\nˆR(x, y(j)) −µRˆ ≈−Rmax · α · . ¯ℓ Hence\nˆA(j) ∝− ℓ(j) −¯ℓH .", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 47, + "total_chunks": 74, + "char_count": 802, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2f3deed1-2e04-48f4-a4bb-eaf846c6713e", + "text": "so any highest-reward trajectory longer than the mean length of H must receive a negative advantage, no matter how small α\nis. Group normalization enforces a zero-mean constraint that inevitably induces sign flips among equally correct trajectories\nwith different lengths. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Empirical Observation. Figure 6 empirically illustrates this phenomenon. The horizontal axis shows the proportion\nof trajectories in a group achieving Rmax (reward density), and the vertical axis shows the fraction of highest-reward\ntrajectories that satisfy the strict condition Ai > 0. Even with a very small penalty strength (Figure 6(a), α = 0.05), the\nsatisfaction rate declines as reward density increases. With a larger α (Figure 6(b), α = 5.0), the decline becomes much\nmore pronounced. These results confirm that violations of the strict condition arise from an inherent structural conflict rather\nthan poor hyperparameter tuning. This impossibility result directly motivates the relaxed calibration strategy in Section 3.3. Instead of attempting to protect\nthe longest highest-reward trajectory—which is generally infeasible—we adopt an average-case criterion that ensures a\ntypical high-quality trajectory, whose length is near the group mean, remains above the group-average regularized reward. This leads naturally to the practical constraint in Eq. (16).", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 48, + "total_chunks": 74, + "char_count": 1446, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd5c03d6-798b-4d96-97b2-ec6d556895e4", + "text": "1.00 0.8\nRate Rate\n0.98 0.6 0.4 Satisfaction 0.96 Satisfaction\nNaive Naive 0.2\n0.94 0.0\n0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9\nRmax Score Ratio Rmax Score Ratio (a) α = 0.05 (b) α = 5.00 Effect of reward density on the strict advantage-preservation objective. As reward density increases, the strict constraint\nsatisfaction rate decreases, even for a small penalty coefficient α = 0.05, and more noticeably for a larger penalty α = 5.0.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 49, + "total_chunks": 74, + "char_count": 477, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c451ffcd-b573-4365-8ca6-78ead05374f4", + "text": "Degeneracy of the Average-Case Criterion in the All-Rmax Limit Recall the average-case calibration criterion (Eq. (16)), which requires that a representative high-quality trajectory (reward\nRmax and average length ¯ℓ) retains non-negative advantage: where µRˆ denotes the within-group mean of the regularized rewards. As noted in Section 3.3, this constraint can fail in the\nlimiting case where all trajectories in the group achieve Rmax, motivating online filtering. All-Rmax case reduces the condition to a Jensen inequality. Assume a sampled group satisfies R(x, y(i)) = Rmax for\nall i ∈{1, . . . , G}. Under GR3, the regularized reward is ˆR(i) = Rmax · .\n1 + α · ℓ(i) ¯ℓ Define the normalized length ratio zi := ℓ(i)/¯ℓ, which obeys G1 PGi=1 zi = 1. f(z) := , z ≥0.\n1 + αz Then µRˆ = Rmax · G1 PGi=1 f(zi), and Eq. (16) becomes", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 50, + "total_chunks": 74, + "char_count": 832, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "79d22f55-719d-42bf-97ac-580e5d204c82", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Convexity flips the inequality. A direct calculation gives 2α2\nf ′′(z) = > 0,\n(1 + αz)3 so f is convex on [0, ∞). By Jensen's inequality, G ! G 1 1\nf X zi ≤ X f(zi). Using G1 Pi zi = 1, we obtain\nf(1) ≤ X f(zi),\ni=1 which is the reverse of the desired condition. Moreover, the inequality is strict whenever the lengths are not all identical\n(i.e., when zi is not constant). Therefore, in an all-Rmax group, the average-case constraint in Eq. (16) can only hold in the\ndegenerate case ℓ(i) = · · · = ℓ(G); otherwise it must fail. This explains why we exclude such groups via online filtering in\npractice.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 51, + "total_chunks": 74, + "char_count": 708, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "316cf32c-db7a-4581-907b-56bcefdd206b", + "text": "Detailed Experimental Settings We evaluate GR3 in representative post-training regimes spanning RLVR and RLHF. In the RLVR setting, we study whether\nthe model can reduce unnecessary long CoT reasoning while maintaining task performance improvement. As for the\nRLHF setting, we examine whether our method mitigates reward hacking (Amodei et al., 2016) and produces well-aligned\nresponses with natural response lengths. In this section, we provide the detailed hyperparameters and configurations used in our experiments. All experiments are\nconducted using veRL (Sheng et al., 2024) as the training framework, based on a standard GRPO advantage estimator. The\ndetailed implementation settings for the experiments are shown in Table 7. Moreover, we evaluate mathematical reasoning on AIME-24 (MAA, 2024), AIME-25 (MAA, 2025), AMC-23 (MAA,\n2023), and MATH500 (Lightman et al., 2023); code generation on LiveCodeBench v6 (Jain et al., 2024) and MultiPL-E\n(Cassano et al., 2022); and conversational ability on Arena-Hard-Auto (Li et al., 2024) and AlpacaEval (Dubois et al., 2024). Following standard practice, we report the length-controlled (LC) win rate for AlpacaEval. For Arena-Hard-Auto, we do not\napply length control and instead use scores derived directly from the original pairwise judge.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 52, + "total_chunks": 74, + "char_count": 1292, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8ab87b4-067a-4dc4-bdae-4b8bf0b58ef1", + "text": "Hyper-parameters used in our experiments. Hyperparameter Math Code Chat\nGR3 Coefficient α 0.33 0.33 0.00133\nKL Coefficient β 0.0 0.0 0.001\nBatch Size 128 128 128\nMini Batch Size 128 128 128\nPrompt Length 1024 1024 1024\nRollout Length 16384 16384 4096\nRollout Temperature 1.0 1.0 1.0\nRollout Group Size 16 8 8\nOptimizer AdamW AdamW AdamW\nWeight Decay 0.01 0.01 0.01\nLearning Rate 2e-6 2e-6 1e-6\nScheduler Type Constant Constant Constant\nTraining Step 1000 800 400\nEvaluation Length 32768 32768 8192\nEvaluation Temperature 0.6 0.6 0.0 Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Additional Experimental Results Mathematical Reasoning Results on 1.5B Models This section presents experimental results on mathematical reasoning tasks, using DeepSeek–R1–Distill–1.5B as the base\nmodel. As shown in Table 8, the findings are consistent with the experiments of DeepSeek–R1–Distill–7B: GR3 not only\nsignificantly enhances the model's reasoning capabilities but also substantially reduces the length of the generated output in\nterms of tokens, demonstrating good performance across different scales. Mathematical reasoning performance for 1.5B models. AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–1.5B 30.0 16531 23.6 15799 70.8 9351 83.9 5399 Length-oriented RL\nLCR1–1.5B 23.5 9071 20.6 8275 67.8 4170 81.9 2520\nLaser–DE–L4096–1.5B 30.1 5770 24.1 5008 73.4 3110 84.6 1931\nAdaptThink–1.5B 34.2 9204 24.7 9234 63.3 2859 80.9 1649\nDLER–R1–1.5B 34.3 3839 27.8 3153 82.1 2419 87.2 1783 Performance-oriented RL\nGRPO 39.6 13054 31.4 12985 81.9 9917 85.6 7138\nGR3 (ours) 45.2 8381 32.8 8137 81.6 4153 89.3 2214 Ablation Results on Penalty Strength α We analyze the sensitivity of GR3 to the length penalty coefficient α by sweeping its value under identical training settings.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 53, + "total_chunks": 74, + "char_count": 1907, + "word_count": 281, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4a2d2f3-978c-463a-a340-dbceb6518872", + "text": "Detailed results on 1.5B models are shown in Table 9. When α is large (e.g., α = 1.0), the multiplicative rescaling term heavily penalizes long trajectories, largely independent of\ntheir reward level. In this regime, GR3 behaves similarly to conventional length-penalized RL, where optimization is driven\nprimarily by shortening responses rather than improving solution quality. Although token usage is significantly reduced,\nthe performance gains over the initial model become noticeably smaller, indicating that overly aggressive regularization\nsuppresses useful long-form reasoning. As α decreases, response length increases in a gradual and well-behaved manner, while task performance first improves\nand then saturates. In particular, moving from α = 0.33 to smaller values (e.g., 0.2 and 0.1) yields longer generations but\nonly marginal or inconsistent accuracy improvements. This empirical trend aligns closely with the analysis in Section 3.3.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 54, + "total_chunks": 74, + "char_count": 950, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e4e43716-9611-4263-b92d-384736d55140", + "text": "Once the penalty is weak enough that the advantage of representative high-quality trajectories is preserved, further reducing\nα mainly relaxes the length constraint without introducing stronger optimization signals. In other words, the training has\nalready crossed the advantage-preservation boundary, after which additional reasoning length no longer translates into\nmeaningful performance gains. Overall, α controls distinct behavioral regimes, and GR3 provides a principled way to select it near the advantage-preserving\ntransition between length control and capability gains. Qualitative Analysis of Rollout Trajectories To better understand the behavioral differences induced by the two training objectives, we present representative rollout\nexamples from a GR3-trained model and a GRPO-trained baseline on the same reasoning prompt. Tables 10 and 11 show\nthe generation traces.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 55, + "total_chunks": 74, + "char_count": 883, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba9e7a57-5212-406e-8d56-f20770de069c", + "text": "GR3: concise reasoning with preserved structure. As illustrated in Table 10, the GR3-trained model produces a\nreasoning trajectory that is both structured and economical. The solution follows a clear progression: (i) restating the task,\n(ii) identifying the periodicity of directions under full rotations, (iii) reducing the angle to a remainder modulo 360◦, and\n(iv) mapping the residual rotation to a compass direction. Each step contributes directly to advancing the solution, and", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 56, + "total_chunks": 74, + "char_count": 483, + "word_count": 71, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "28753c1f-c45d-40c9-9200-0f2fea6de83c", + "text": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Hyperparameter sweep of the length penalty coefficient α in GR3 on 1.5B models. AIME24 AIME25 AMC23 MATH500 Model Avg@32↑ #Token↓ Avg@32↑ #Token↓ Avg@16↑ #Token↓ Avg@4↑ #Token↓ Initial model\nDeepSeek–R1–Distill–1.5B 30.0 16531 23.6 15799 70.8 9351 83.9 5399 Our method: varying α\nGR3 (α = 1.00) 34.9 6316 27.4 5927 75.2 1754 83.0 848\nGR3 (α = 0.50) 40.5 7923 29.6 7399 82.3 3245 87.1 1705\nGR3 (α = 0.33) 45.2 8381 32.8 8137 81.6 4153 89.3 2214\nGR3 (α = 0.20) 45.1 10174 32.4 10250 83.0 4887 88.9 2838\nGR3 (α = 0.10) 43.2 10759 31.7 10701 83.0 6235 88.2 3419 intermediate checks serve to confirm rather than re-derive earlier results.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 57, + "total_chunks": 74, + "char_count": 738, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "849b10dd-645c-47cc-82c5-9b20a87aeb22", + "text": "Importantly, the trajectory terminates decisively with a correctly formatted boxed answer. The reasoning chain is neither\nartificially shortened nor overly verbose; instead, redundant re-computation and self-doubt loops are largely absent. This\nreflects a policy that has learned to allocate tokens primarily to causally relevant steps.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 58, + "total_chunks": 74, + "char_count": 336, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c41d1ec-ccab-4238-953f-2e951cbae6ac", + "text": "GRPO: verbose loops and diluted signal. In contrast, the GRPO-trained baseline (Table 11) exhibits substantially\ndifferent behavior. Although it repeatedly identifies correct intermediate facts—such as the 360◦periodicity and the\n2250 mod 360 = 90◦reduction—it frequently re-derives them, questions previously established conclusions, and oscillates\nbetween equivalent formulations (e.g., full rotations vs. fractional rotations). The trajectory contains multiple self-corrections\nthat do not introduce new information. This pattern results in long reasoning traces where many tokens are only weakly related to forward progress. From an\noptimization perspective, such trajectories distribute the reward signal across a large number of low-impact tokens, reducing\nthe effective learning pressure on decisive reasoning steps. Moreover, despite eventually circling back to the correct direction,\nthe model fails to present a clean final answer in the required boxed format, leaving the solution inconclusive. Implications for optimization dynamics.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 59, + "total_chunks": 74, + "char_count": 1045, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac532034-5b2f-491d-8d56-fced6a485612", + "text": "These qualitative differences align with the mechanism discussed in Section 4.3.2. By down-weighting the advantages of unnecessarily long trajectories, GR3 implicitly favors reasoning paths with\nhigher informational density per token. This encourages the policy to preserve essential logical structure while avoiding\nnon-robust reasoning patterns, such as repetitive or self-doubting loops. As a result, the reward signal becomes more\nconcentrated on tokens that correspond to meaningful reasoning transitions, rather than being diluted across verbose yet\nlow-contributing segments.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 60, + "total_chunks": 74, + "char_count": 582, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d9a9337d-2a61-4a24-b54c-c9809c4e52f4", + "text": "Zichao Li led the project. Jie Lou and Yaojie Lu supervised the project. Fangchen Dong contributed to the experimental implementation. Zhiyuan Fan contributed by first sketching the prototype of the method during early discussions. Mengjie Ren contributed to the experiments on code generation tasks. Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, and Xing Yu provided supervision and contributed to technical discussions\nthroughout the project. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from our GR3-trained 1.5B model, demonstrating robust, clearly structured step-by-step reasoning and a correct\nfinal answer in boxed format.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 61, + "total_chunks": 74, + "char_count": 706, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d955e87-a0b8-40d8-b535-705132cb4370", + "text": "Prompt\nA figure skater is facing north when she begins to spin to her right. She spins 2250 degrees. Which direction (north, south, east or\nwest) is she facing when she finishes her spin?", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 62, + "total_chunks": 74, + "char_count": 187, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94e28563-76f7-4dd9-a972-b83d2c14931e", + "text": "Let's think step by step and output the final answer within \\boxed{}. Example Output from our GR3-trained 1.5B Model\n\nOkay, so I need to figure out which direction a figure skater is facing after spinning 2250 degrees to her right. She starts facing\nnorth. Hmm, let's break this down step by step.\n// Restates the problem and sets up a step-by-step plan.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 63, + "total_chunks": 74, + "char_count": 361, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbbb1c4d-f9ba-4ca2-8f04-c168333f000b", + "text": "First, I know that a full rotation is 360 degrees, so every time she spins 360 degrees, she'll end up facing the same direction she\nstarted. So, to find out where she ends up after 2250 degrees, I should probably find the equivalent angle within a single rotation.\n// Recalls the key concept: directions repeat every 360°. Let me calculate how many full rotations 2250 degrees is. I'll divide 2250 by 360. Let's see, 2250 ÷ 360. Let me compute that:\n360 × 6 is 2160, and 2250 - 2160 is 90. So, 2250 degrees is equivalent to 6 full rotations plus an additional 90 degrees.\n// Performs division to reduce the rotation to full turns plus a remainder. So, she spins 6 full circles, which brings her back to the starting direction, facing north. Then she spins an additional 90 degrees to\nher right. Now, I need to figure out where she ends up after spinning 90 degrees to the right from north.\n// Uses the result: full turns cancel out, leaving only the 90° remainder. Let's recall the directions: when facing north, a right turn would be towards east. So, if she spins 90 degrees to the right, starting\nfrom north, she would face east.\n// Maps a 90° right turn from north to the compass direction east. Wait, let me double-check that. If she's facing north and spins 90 degrees to the right, her new direction is east.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 64, + "total_chunks": 74, + "char_count": 1315, + "word_count": 241, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8146bb01-7851-47a1-a8c3-0afbcd365345", + "text": "Yeah, that makes\nsense. So after 2250 degrees, she ends up facing east.\n// Verifies the directional reasoning and confirms the conclusion. So the answer should be east.\n**Final Answer**\neast", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 65, + "total_chunks": 74, + "char_count": 190, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8acad438-5967-48c5-91d3-35b64b9e225a", + "text": "\nThe figure skater starts facing north and spins 2250 degrees to her right. First, we determine the equivalent angle within a single rotation by dividing 2250 by 360. This gives us 6 full rotations plus an\nadditional 90 degrees (since 2250 - 6*360 = 90). After completing 6 full rotations, the skater is back facing north. She then spins an additional 90 degrees to her right from north. When facing north and spinning 90 degrees to the right, the skater will face east. // Formats the final answer as required. Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning Example output from the GRPO-trained 1.5B baseline, exhibiting redundant reasoning loops, repeated self-doubt, and failure to\nproduce a conclusive final answer.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 66, + "total_chunks": 74, + "char_count": 785, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c12d8465-1f0b-403b-b229-baefcbbade83", + "text": "Prompt\nA figure skater is facing north when she begins to spin to her right. She spins 2250 degrees. Which direction (north, south, east or\nwest) is she facing when she finishes her spin?", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 67, + "total_chunks": 74, + "char_count": 187, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b6465b9-aacb-46f3-9256-4cfb550aa9bb", + "text": "Let's think step by step and output the final answer within \\boxed{}. Example Output from the GRPO-trained 1.5B Model\n\nOkay, so I have this problem here about a figure skater spinning. Let me try to figure it out step by step.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 68, + "total_chunks": 74, + "char_count": 233, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8e1b32e6-ce15-41a4-8a22-ab0aca67e352", + "text": "Hmm, she's facing north\nand starts spinning to her right. She spins 2250 degrees, and we need to find out which direction she's facing after that. The options\nare north, south, east, or west. Alright, first, I know that when you spin, your direction changes based on how many degrees you turn. But since she's starting\nfacing north and spinning to her right, I need to figure out how that affects her direction after a certain number of degrees. When you spin to your right, does that mean you're turning clockwise or counterclockwise? I think in the context of\nfacing north, spinning to your right would be clockwise because if you're facing north and turn right, you're going towards east,\nthen south, then west, and back to north. So, right is clockwise. But maybe I should clarify.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 69, + "total_chunks": 74, + "char_count": 785, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88d5760e-0f4c-4211-859b-80ea1648aeab", + "text": "If you're facing north and you spin to your right, you're turning clockwise. So, each 90-degree turn\nwould change your direction. So, 90 degrees to the right from north would be east, another 90 degrees would be south, another 90\ndegrees would be west, and another 90 degrees back to north. That makes sense.\n// Repeatedly re-derives the meaning of \"spin to her right\". Now, she spins 2250 degrees.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 70, + "total_chunks": 74, + "char_count": 398, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "23580d33-3163-4d0d-81cb-a2b610ba8f18", + "text": "Let me think about how to break this down. Since a full circle is 360 degrees,\nmaybe I can find out how many full rotations she makes and then the remaining degrees. So, if I divide 2250 by 360, that will tell me how many full spins she does and what the remainder is. Let me do that: 2250 ÷ 360. First, 360 times 6 is 2160, right? Because 360*6 = 2160. Then, 2250 - 2160 is 90. So, 2250 degrees is equal to 6 full rotations\n(which bring her back to the starting point) plus an additional 90 degrees. So, effectively, she only needs to consider the 90-degree\nturn after the full rotations.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 71, + "total_chunks": 74, + "char_count": 589, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fdcc7288-0c9a-421a-9be1-4bb20b001f81", + "text": "Wait, is it 6 full rotations or 6.25? Wait, no, 2250 divided by 360 is 6.25, because 360*6=2160, and 2250-\n2160=90, so it's 6 and 90/360, which is 6.25 full rotations. So, 6.25 rotations is the same as 6 full rotations plus 0.25 of a rotation. But since a full rotation brings her back to the starting direction, the 6 full rotations don't change her facing. So, the key is the 0.25\nof a rotation, which is 90 degrees. She starts spinning to her right, which is clockwise. So, spinning 90 degrees clockwise from north would make\nher face east. So, after 2250 degrees, which is equivalent to 6 full rotations plus 90 degrees, she would have effectively turned 90\ndegrees clockwise from her starting direction. Wait, the problem says she starts spinning to her right. So, does that mean she is spinning clockwise? Because if\nshe's facing north and spins to her right, that's a clockwise spin. So, each 90-degree turn clockwise would change her direction. So, if she spins 90 degrees clockwise, she would face east. 180 degrees clockwise would face her west, and 270 degrees clockwise\nwould face her south. But in this case, she only spins 90 degrees, which is 2250 degrees, which is 6 full rotations plus 90 degrees. Wait, but 2250 degrees is a lot.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 72, + "total_chunks": 74, + "char_count": 1247, + "word_count": 221, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "14d33b56-be57-4026-ad12-600c1d3ee590", + "text": "Let me make sure I'm calculating this correctly. So, 2250 divided by 360 is equal to... let's see, 360\ntimes 6 is 2160, as I had before. 2250 minus 2160 is 90. So, 2250 degrees is 6 full circles plus 90 degrees. So, she's effectively\nturned 90 degrees.\n// Unnecessarily re-checks the same arithmetic multiple times. But since she's spinning to her right, which is clockwise, a 90-degree turn would make her face east. So, she starts facing north,\nspins 90 degrees clockwise, so she faces east. Wait, maybe I'm mixing up the direction of rotation. If she's spinning to her right, that's a clockwise rotation,\nbut in terms of her facing direction, starting north, spinning 90 degrees clockwise would make her face east. Wait, but let me think again.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 73, + "total_chunks": 74, + "char_count": 747, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "30cfacd3-e259-454c-be33-c63dffa4d90e", + "text": "When you spin to your right, you're turning clockwise. So, if you're facing north, turning right\n(clockwise) 90 degrees would make you face east. So, 90 degrees clockwise from north is east. So, 2250 degrees is 6 full rotations\nplus 90 degrees. So, after the full rotations, she's back to north, and then she turns 90 degrees clockwise, so she's facing east.\n// Enters a self-doubt loop, questioning previously correct directional reasoning.\n. . . If you add 2250 degrees, but since 360 is a full circle, 2250 degrees is equivalent to 90 degrees. So, starting from\nnorth (0 degrees), adding 90 degrees clockwise is east. So, yes, she's facing east. But wait, hold on, is it 90 degrees or 270 degrees? Wait, no, because she's spinning to her right, which is clockwise, so 90 degrees\nclockwise from north is east. Wait, maybe I should draw a diagram in my mind.\n. . .\n// Fails to present the final answer, leaving the solution inconclusive.", + "paper_id": "2603.10535", + "title": "Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning", + "authors": [ + "Zichao Li", + "Jie Lou", + "Fangchen Dong", + "Zhiyuan Fan", + "Mengjie Ren", + "Hongyu Lin", + "Xianpei Han", + "Debing Zhang", + "Le Sun", + "Yaojie Lu", + "Xing Yu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10535v1", + "chunk_index": 74, + "total_chunks": 74, + "char_count": 938, + "word_count": 165, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10541_semantic.json b/data/chunks/2603.10541_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..9fea7b26c14053cf29a4342d62ef3434915612a1 --- /dev/null +++ b/data/chunks/2603.10541_semantic.json @@ -0,0 +1,1635 @@ +[ + { + "chunk_id": "f19f7300-e94c-4dc1-8414-af70607c037f", + "text": "Prompting with the human-touch:\nevaluating model-sensitivity of foundation models for musculoskeletal CT segmentation Caroline Magga,b,c, Maaike A. ter Weeb,c, Johannes G.G. Streekstrab, Leendert Blankevoortc, Clara\nI. Sáncheza, Hoel Kervadeca", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 0, + "total_chunks": 71, + "char_count": 243, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c1f12ad-2a88-454a-aeb9-64232bfc67dc", + "text": "aQuantitative Healthcare Analysis (QurAI) Group, University of Amsterdam, Science Park 900, Amsterdam, 1098 XH, The Netherlands\nbDepartment Biomedical Engineering and Physics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands\ncDepartment Orthopaedics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized2026medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics,\nand compared models, makes direct performance comparison between models difficult and complicates the selection\nof the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterativeMar2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four\n11anatomicalusing humanregionsprompts(wrist,collectedshoulder,throughhip aanddedicatedlower leg).observerThe study.Pareto-optimalOur findingsmodelsare:are1) identifiedThe segmentationand furtherperformanceanalyzed\nvaries a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in\n3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures,\nwith higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia,\nimplants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on \"ideal\"\nprompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models[cs.CV]were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater\nsettings.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 1, + "total_chunks": 71, + "char_count": 1840, + "word_count": 225, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7af427ba-7853-4a44-9fff-58c00ad42777", + "text": "We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with\neven high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction\nand model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/ Keywords: foundation models, medical image segmentation, validation, MSK segmentation, CT segmentation Introduction experience.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 2, + "total_chunks": 71, + "char_count": 442, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "50fa48c0-81bc-47ba-948e-51e09fb59d43", + "text": "Consequently, a significant gap exists in analyzing how human-generated prompt variability impacts\nFoundation models (FMs) for medical image segmen- the final segmentation performance. To address this gap,\ntation have gained significant attention as a promising we conducted an observer study to collect and analyze\nparadigm for developing promptable methods, which al- human prompts. This allowed us to quantify intra- and\nlow users to guide segmentation through simple interac- inter-rater variability and, more importantly, to evaluate\ntions such as selected points, bounding boxes or scribbles model sensitivity to input prompt variations. In addition to addressing the limited understanding ofarXiv:2603.10541v1(i.e. prompts). Inspired by the Segment Anything Model\n(SAM) [1], a growing number of medical variants tried to real-world prompt behavior, this study incorporates spetransfer the benefits of broad generalization and prompt- cific experimental design choices to overcome three critical\nable inference to the medical domain. These models aim to challenges in the current evaluation landscape:\nreduce annotation burden, accelerate data curation, and\nenhance clinical usability by human-guided interactions Challenge 1 – Scalability of multi-rater evaluations. The\nand refinements. growing number of promptable FMs makes exhaustive\nDespite extensive interest and many evaluation efforts, benchmarking across all available architectures computamost studies rely on synthetic or algorithmically generated tionally demanding, especially when accounting for multiprompts based on reference segmentation masks. While ple human-annotator prompt sets. Furthermore, analyzing\nthese represent \"ideal\" prompts, they fail to account for the sensitivity of under-performing models offers limited\nthe inherent variability of human annotations. Solution: We implemented a two-stage evaluation\nsults in a limited understanding of real-world prompting strategy. First, 11 models were compared using standardbehavior, where prompts are not \"perfect\" but still correct, ized \"ideal\" prompts to identify the Pareto-optimal models\nprovided by humans with varying levels of expertise and with the least model parameters, offering the best tradeoffs between segmentation performance and parameter effi- points, masks) using three components: an image encoder,\nciency. Second, analysis with human prompts was focused a prompt encoder, and a lightweight mask decoder that\nexclusively on these top-performing models, ensuring that fuses image and prompt embeddings into a binary mask.\nour sensitivity evaluation targeted the most relevant can- Applied to 3D medical scans, SAM operates slice-by-slice\ndidates. and requires a prompt for each slice. SAM2 [6] extends\nSAM to video by replacing the image encoder backbone\nChallenge 2 – Benchmarking fairness and data contami- and adding a memory attention module that merges imnation. While public datasets drive progress in the field, age embeddings, prompt encodings, and predicted masks\nmany medical FMs are trained on publicly available into a joint representation. Stored in a first-in, first-out\ndatasets, making it difficult to compare models fairly when (FIFO) memory bank, this representation is queried when\nthe same data cannot be reused for testing.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 3, + "total_chunks": 71, + "char_count": 3295, + "word_count": 457, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "44693ca5-6f82-4ff2-90c9-4e1dbcb66669", + "text": "As a result, segmenting neighboring video frames. Treating CT slices\nensuring fair, unbiased evaluation often depends on private as video frames allows SAM2 to propagate information\ndatasets, as these provide the necessary independence. At across slices, enabling volumetric segmentation without\nthe same time, private datasets hinder reproducibility of prompting each slice individually. Recently, SAM3 [7]\na study, creating an inherent tension between the need for was released as concept-driven foundation model that uniunbiased assessment and the desire for open, community- fies image, video, and volumetric segmentation and obverifiable research. Solution: We utilize a hybrid data ject tracking by using short noun phrases or image examstrategy. By combining private, task-specific data for inde- ples (i.e., concept) instead of geometric prompts as used\npendent assessment with public data, our study maintains in SAM and SAM2.\na balance between independent validation and scientific\nreproducibility. Medical Foundation Models. A wide range of geometric prompt-based interactive FMs have been proposed\nChallenge 3 – Disconnect between benchmark dataset di- for medical image segmentation: Spanning from fineversity and clinical requirements. While recent large-scale tuned 2D SAM variants (Med-SAM [8], SAM-Med2D\nbenchmarks aim to showcase the generalization of FMs [9], ScribblePrompt [10], MedicoSAM [11]) and fine-tuned\nacross diverse datasets [2], they often lack the depth 3D SAM2-based models (Medical-SAM2 [12], Med-SAM2\nrequired to validate performance on specialized clinical [13]) to SAM-based extensions to 3D (SAM-Med3D [14],\ntasks. In clinical practice, the integration of a model de- SegVol [15]), as well as non-SAM CNN-based methods like\npends on its performance for specific tasks, such as wrist Vista3D [16] and nnInteractive [17]. For a comprehensive\nbone segmentation for osteoarthritis assessment [3], tibia overview of medical foundation models, their variations\nand implant segmentation for loosening quantification [4], and applications, we refer the reader to the dedicated litor shoulder joint analysis for humeral head positioning [5]. erature [18, 19, 20, 21, 22, 23]. Evaluating models across heterogeneous tasks can dilute\nthe focus on these task-specific requirements. Solution: Independent evaluation studies for Promptable FoundaRather than distributing efforts across many modalities tion Models. Since the release of SAM, multiple evaluation\nand heterogeneous tasks, we performed a targeted, task- studies [24, 25, 26, 2, 27, 28, 29, 30, 31, 32] have shown\nfocused investigation on musculoskeletal (MSK) CT scans that the performance of SAM-based models varies widely\nfor bone and implant segmentation. across datasets and tasks - generally favoring large, wellInterested in FM performance in human-driven settings, defined structures while struggling with small, irregular,\nour work contributes an extensive evaluation of FMs in or low-contrast ones. Most of these studies assessed only\nbone and implant segmentation that moves beyond ideal- a limited set of models and prompts.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 4, + "total_chunks": 71, + "char_count": 3122, + "word_count": 442, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "493f9f6e-e529-4bc0-8898-47a99f9a9350", + "text": "Ali et al. [21] comized simulations. By (1) integrating the human-in-the-loop plement these findings by examining SAM, MedSAM, and\nvariability, (2) using both public and private data, and (3) SAM-Med2D in fine-tuning scenarios, showing that finefocusing on clinically relevant MSK tasks, we provide a tuning and prompt optimization improves performance for\nmore realistic assessment of FM performance. We make Automated Breast Ultrasound (ABUS) tumor dataset and\nour code base for prompt extraction and model inference a pregnant pelvis MRI dataset. Magg et al. [33] evalupublicly available1. ated bone CT segmentation employing four 2D SAM-based\nmodels (SAM, SAM2, Med-SAM, SAM-Med2D) and 32\nprompting strategies, finding that bounding box and com-\n2. Related Work\nbination of bounding box with center point yielded the\nbest performance across models. Noh et al. [20] provide a\nSegment Anything Model. The Segment Anything Model\nbroader comparison of seven foundation models for med-(SAM) [1] enables image segmentation from sparse or\nical image segmentation (SAM, Med-SAM, SAM-Med2D,dense prompts (bounding boxes, positive and negative\nUniverSeg, SAM-Med3D, SegVol, and SAT-Pro), evaluating visual, text, and reference prompts across diverse\n1https://github.com/CarolineMagg/segmentation-FM- datasets. The RadioActive benchmark [34] focuses its evalbenchmark/ uation on 3D interactive segmentation, testing seven mod- els (SAM, SAM2, Med-SAM, SAM-Med2D, SAM-Med3D, 2D slices. Based on our dataset characteristics, up to 5\nSegVol, and ScribblePrompt) on CT and MRI data un- components were considered for reference prompt extracder an iterative refinement workflow. Its findings indicate tion (referred to as NP prompts). Thus, for 2D prompting\nthat SAM2 outperforms all assessed 2D and 3D medical strategy, the default settings are: bounding box( ), cenfoundation models, and that bounding box prompts are ter point( ) or their combination( ), extracted for up\ngenerally superior to point-based ones. All named stud- to 5 components of the object of interest (Table 1).\nies in this section relied on synthetic and algorithmically\ngenerated prompts, based on an available reference label. Prompting Strategies in 3D. Models (except SegVol [15])\nrely on pseudo-3D boxes defined by two coordinates representing a box in a 2D slice.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 5, + "total_chunks": 71, + "char_count": 2330, + "word_count": 337, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5d39ab05-fa24-40e3-95e1-f08cc0d8b341", + "text": "Similarly, a 3D point can be\n3. Methodology represented as a 2D coordinate with a slice number. Thus,\nthe main extension of the 2D framework to 3D is initial\n3.1.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 6, + "total_chunks": 71, + "char_count": 162, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0e86b9b4-dd26-4508-a0df-35d3e9ac055e", + "text": "Promptable Foundation Models slice selection (Figure 1). Within this 3D context, a sinEleven foundation models that, to our knowledge, were gle prompt is defined as 2D coordinates localized within\navailable as of July 30, 2025 – while supporting training- a single slice. Multiple 3D prompts are either several coand adaption-free open-set medical image segmentation ordinates within one slice or individual coordinates disusing sparse geometric prompts (i.e., bounding boxes and tributed across multiple slices.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 7, + "total_chunks": 71, + "char_count": 512, + "word_count": 73, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "477395b0-6687-4799-905a-744f27e3a588", + "text": "In this work, as we utilized\npoints) –were included in our study (see Appendix B for specific human-annotated slices, our strategy was limited\nimplementation details). The models were divided into to either a single selected slice or a combination of all sefour categories per prompt type based on prediction di- lected slices (NS slices). Additionally, we investigated a\nmensionality (2D vs. 3D) and training data domain (med- prompting variant that incorporates the top and bottom\nical vs natural images)(see Table 1 for a model overview): slices of an object. By default, the same prompt prim-\n• 2D FM trained on natural images: SAM & itives – bounding box( ), center point( ), or their\nSAM2.1 2D combination( ) – extracted from the largest component\n• 2D FM trained on medical images: Med-SAM, in a single slice were used, which represented the common\nSAM-Med2D, ScribblePrompt, MedicoSAM 2D configuration supported by all 3D models (Table 1).\n• 3D FMs trained on natural images: SAM2.1 3D\n• 3D FMs trained on medical images: SAMMed3D, SegVol, MedicoSAM 3D, Vista3D, nnInteractive, Med-SAM2\nTwo sources of prompts were used: 1.) Automatically extracted prompts generated based on the reference mask,\ni.e., following previous work [33], called reference prompts\nfor short; 2.) Human-generated prompts created by participants of the observer study following annotation guide- (a) Bounding Box (b) Center Point (c) Slice Selection\nlines aligned with the automatic extraction procedure,\ncalled human prompts for short. To ensure consistency, Figure 1: Prompting strategies in 2D consist of prompt primitives,\ni.e., bounding box (a) and/or center point (b), and component seboth prompts were derived on the same selected slices. lections, i.e., including prompts from either the largest component\n(blue prompt) or all components (white and blue prompts). The 3D\nPrompting Strategies in 2D. A 2D prompting strategy can prompting strategies extend this concept with slice selection (c).\nbe constructed with a primitive and a component selection criterion [33] (Figure 1). Primitives are the building\nblocks of a prompt and in our work, the bounding box 3.2. Dataset\n(referred to as bbox or box) and center point (referred to Since medical FMs are trained on publicly available\nas center or point) are chosen, due to their demonstrated datasets, including bone CT segmentation ([35, 36, 37]),\nstrong performance [33] and the ability to compare refer- an independent dataset is essential to fairly compare\nence and human prompts. Following [33], the bounding performance across models. A private dataset ensures\nbox is defined as the tightest box enclosing the object, a task-specific and independent evaluation, while public\nand the center point is defined as the pixel furthest away datasets enable to study reproducibility by the broader\nfrom the object boundary based on the Euclidean distance research community. To address both needs, we compiled\ntransform. This definition was used for the automatic a CT test dataset consisting of private CT scans from the\nprompt extraction and in the annotation guidelines for the department of Orthopaedic Surgery and Sports Medicine\nobserver study. The component selection determines how of the Amsterdam UMC, approved by the local Medical\nmany components of an anatomical structure are consid- Ethics Committee (2025.0447), and selected CT samples\nered for the extraction of prompt primitives.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 8, + "total_chunks": 71, + "char_count": 3434, + "word_count": 532, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6ffec0f9-b2b8-46d4-afe5-48147a2d958d", + "text": "Although of the TotalSegmentator test set [35]. Unfortunately,\nanatomical structures may form a single 3D object, they not all FMs included in our study specify their exact\ncan appear as multiple disconnected regions in individual test dataset splits of the TotalSegmentator dataset Table 1: Overview of promptable FMs: Model backbone architecture, prediction dimensionality (2D vs. 3D), training data domain (Medical\nvs. Natural) and the supported prompting strategies. The prompting strategies are: single (1) or multiple (NP ) boxes, points, and their combinations, for single (1) or multiple (NS) slices, with or without\nvolumetric limitations (for 3D predictions). Boxed settings are our default settings, as they are possible across different models. (✓)* denotes that\nauthors explicitly stated that the test set of [35] was excluded from training.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 9, + "total_chunks": 71, + "char_count": 854, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35820f0a-744d-4674-9d8b-f3f185131256", + "text": "NP prompts denotes that multiple prompts (in our work, up to 5\nprompts) per initial slice were used. NS slices denotes that multiple initial slices (in our work, all selected slices) were used. Data Box Point Pt+Box Slice Vol. Domain [35] (1/NP ) (1/NP ) (1/NP ) (1/NS) Limits\nSam [1] ViT 2D N ✗ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nSam2 2D [6] Hiera 2D N ✗ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nMed-Sam [8] Sam 2D M ✓ ✓/ ✓ ✗/✗ ✗/✗ - -\nSam-Med2D [9] Sam 2D M ✓ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nScribblePrompt-U [10] UNet 2D M (✓)* ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nScribblePrompt-Sam [10] Sam 2D M (✓)* ✓/ ✓ ✓/ ✓ ✓/✗ - -\nMedicoSam 2D [11] Sam 2D M ✓ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nSam2 3D [6] Hiera 3D N ✗ ✓/✗ ✓/✓ ✓/✗ ✓/✓ ✓\nSam-Med3D [14] 3D ViT 3D M (✓)* ✗/✗ ✓/✓ ✗/✗ ✓/✗ ✗\nSegVol [15] 3D ViT 3D M ✓ ✗/✗ ✓/✓ ✗/✗ ✓/✓ ✗\nMedicoSam 3D [11] Sam 3D M ✓ ✓/✗ ✓/✓ ✓/✗ ✓/✗ ✗\nVista3D [16] SegResNet 3D M ✓ ✗/✗ ✓/✓ ✗/✗ ✓/✓ ✗\nnnInteractive [17] CNN 3D M ✓ ✓/✓ ✓/✓ ✓/✗ ✓/✓ ✗\nMed-Sam2 [13] Sam2 3D M ✓ ✓/✗ ✗/✗ ✗/✗ ✓/✓ ✓ In total, our final dataset contains four to written guidelines, participants had access to a video\nskeletal regions, 49 CT scans and 18 class labels (Figure 2). showing the usage of the annotation platform. The annotation interface supported zooming and window/level adA subset of axial slices, the primary scanning direction, justments, with default window settings tailored to each\nwas selected from the full CT volumes to limit the an- anatomical subset, and scrolling through the volume in\nnotation workload for participants in the observer study. all three planes (i.e., axial, sagittal, coronal), with the seThe slice selection was performed once using random sam- lected slice displayed as the default axial view. To enable\npling for each class label (i.e., anatomical object), with prompt-specific time tracking on grand-challenge.org, the\nconstraints applied to ensure adequate data coverage, di- placement of bounding boxes and center points was perversity, and comparability across data subsets (see Ap- formed independently. Thus, each sample was annotated\npendix A). The selection was kept consistent across all ex- twice: once per prompt type. Before the main study, parperiments and served as the initial slices for model prompt- ticipants completed a training phase in which they annoing (i.e., with perfect and human prompting). In total, 404 tated selected slices from one sample per data subset (i.e.,\naxial CT slices have been selected, i.e., 132 for Wrist, 96 per anatomical region; 18–34 slices in total) and received\nfor Lower Leg, 88 for Shoulder and 88 for Hip. written feedback on their annotations.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 10, + "total_chunks": 71, + "char_count": 2553, + "word_count": 478, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "12f5962e-ad50-4703-8468-eb0612aa65ee", + "text": "When deviations\nfrom the protocol were identified, participants were asked\n3.3. Observer Study to correct their annotations and were provided with an additional round of feedback. This iterative process was reAn observer study was conducted on the platform grandpeated until the participant demonstrated a consistent unchallenge.org with 20 medical students from Faculty of\nderstanding of the annotation protocol. After the training\nMedicine, University of Amsterdam.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 11, + "total_chunks": 71, + "char_count": 467, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6ab6de42-37ce-4935-88cc-fbcaf002c67d", + "text": "Participants prophase, all subsequent annotations were taken as provided,\nvided informed consent prior to participation. Particiwithout additional correction or exclusion. Thus, no spepants were instructed to place bounding boxes and cencial handling of protocol deviations was applied. The main\nter points on each bone structure visible in a given CT\nstudy was conducted in the fixed annotation order: Wrist\nslice (with exception of vertebrae and rib bones to reduce\n(180 slices), Lower Leg (180 slices), Shoulder (120 slices),\nannotation effort), following predefined annotation guideand Hip (120 slices).", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 12, + "total_chunks": 71, + "char_count": 607, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5082490a-ef7c-426f-aa23-a7dedf8aeecc", + "text": "Participants were randomly assigned\nlines. These guidelines included precise definitions and\nto one of two groups to counterbalance ordering effects:\nmultiple visual examples from different scans of the study\none group always began with bounding box placement,\ndataset to ensure consistent interpretation. Figure 2: Dataset Overview: Our dataset consists of four subsets, i.e., Wrist, Lower Leg, Shoulder, and Hip[35]. A subset of 404 axial slices\nwas extracted based on constraints ensuring data coverage, diversity and comparability. the other with point placement. To assess intra-rater vari- slices appeared twice nor in which order they occurred.\nability, each project included duplicate slices together with The duplicated sample counts per subset were as follows:\nthe original samples. Participants were fully blinded to the Wrist: 120 original + 60 duplicates, Lower Leg: 90 + 90\nduplication, meaning they were not informed that certain duplicates, Shoulder and Hip: 80 + 40 duplicates (for de-", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 13, + "total_chunks": 71, + "char_count": 1002, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "91897b11-7fd8-432a-9697-e4df6312183f", + "text": "tails see Appendix A). difference in width and height (∆w, ∆h, |∆w|, |∆h|) was\ncomputed with respect to the corresponding reference\n3.4. Evaluation design boxes. Annotation consistency was evaluated at intra-rater and First, we quantified accuracy and consistency of the huinter-rater levels using the same metrics as described\nman prompts. Second, we compared the segmentation\nabove. Intra-rater consistency was assessed by comparing\nperformance of the FMs prompted with perfect prompts,\nrepeated annotations from the same annotator, while\nto make a model selection of the Pareto-optimal models.\ninter-rater consistency was assessed by pairwise compariThen, we evaluated the segmentation accuracy and consisson of annotations from two different annotators for the\ntency of these FMs prompted with the human prompts and\nsame object.\nthe performance difference between both prompt sources. Distances were calculated in both pixel coordinates and\nFinally, the models' sensitivity to input prompt variations\nphysical units (mm) based on the spacing of the reference\nwas determined.\nmasks (Figure 2). The metrics for human prompt analysis\nwere summarized as medians with interquartile ranges\n3.4.1. Human prompt analysis\n(IQR) to avoid scaling on outliers. To reduce annotation complexity, observer study\nprompts were categorized with broad categories (i.e., bone\nFor all annotators, annotation time was recorded per\nand implant) rather than the specific class labels required\nsample. Due to platform functionality, annotation times\nfor prompting. Thus, a matching process assigned a\nat the level of individual annotations were not available.\nclass label to each observer study prompt by aligning\nTherefore, the annotation time per annotation was estithe observer study prompts with their reference prompts\nmated by averaging the total time spent per image over\n(i.e., automatically extracted from the reference mask).\nthe number of annotations within each sample. Human center points were compared to reference points\non a per-label, per-component basis. For each connected\ncomponent in the reference mask, the Hungarian algo- 3.4.2. Segmentation analysis\nrithm2(linear sum assignment) with Euclidean distance Segmentation performance was assessed by comparing\nas cost metric was used to ensure optimal one-to-one masks generated from either reference or human prompts\nmapping. This approach minimizes total distances while against the reference masks, which were obtained by manallowing unmatched points, i.e., cases where the structures ual segmentation, following [33].", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 14, + "total_chunks": 71, + "char_count": 2569, + "word_count": 367, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db98be13-6d90-4bf3-952b-4e576bb0fc29", + "text": "For human prompts, segwere either not labeled in the reference (e.g., the fibula mentation consistency was determined using an intra-rater\nin the Lower Leg or clavicle in the Shoulder dataset) approach, where masks of the same sample generated from\nor overlooked by the annotator. Human bounding boxes an annotator's first prompt set were compared to those\nwere compared to the reference boxes on a per-label, from their second set. Finally, the performance gap beper-object basis. Because a single bounding box may tween reference- and human-prompted results was quantienclose multiple components, we matched boxes for fied by pairwise difference analysis per sample.\neach object (instead of component) using the Hungarian For 2D models, the evaluation was performed on the prealgorithm with Intersection over Union (IoU) as cost dictions of the selected slices in a 2D manner (slice-wise).\nmetric. For 3D models, the generated volumetric predictions based\nMatched pairs were counted as true positives (TPs) and on the selected slices as initial input were evaluated in a\nunmatched reference prompts as false negatives (FNs). 3D manner (volumetric). For completeness, unmatched human prompts were cate- Following previous work [33] and MetricsReloaded [38],\ngories as false positives (FPs), and if both reference and the Dice Similarity Coefficient (DSC), the Normalized Surhuman prompts were absent for a connected component face Dice (NSD) (threshold is set to largest spatial resoluor object, it was considered a true negative (TN). tion of 1.5mm), and the 95%-percentile Hausdorff distance\n(HD95) were used as metrics for segmentation performance\nDetection performance was measured by Recall analysis, with the implementation of the DisTorch frame-\n(TP/(TP + FN)). For all TPs, localization error was work [39].", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 15, + "total_chunks": 71, + "char_count": 1816, + "word_count": 273, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e01597a7-36fd-40cb-8fca-d485531b48e1", + "text": "In line with common practice, summarized valquantified for center points, i.e., human center points and ues of DSC, NSD and HD95 are reported as mean and\ncenter points derived from human bounding boxes, by cal- standard deviation (std).\nculating the Euclidean distance and the signed/absolute\nx and y coordinate offset (∆x, ∆y, |∆x|, |∆y|) to the 3.4.3. Pareto front\ncorresponding reference points. For bounding boxes, the\nA model i with performance vector mi =\nIntersection over Union (IoU) and the signed/absolute\n(mi1, mi2, . . . , mid) is Pareto-optimal (non-dominated) if\nno other model j dominates it. Model j dominates model\ni (denoted mj ≻mi) if: 2See SciPy documentation: https://docs.scipy.org/doc/\nscipy/reference/generated/scipy.optimize.linear_sum_\nassignment.html mjk ⪰k mik ∀k ∈1, . . . , d and ∃k′ : mjk′ ≻k′ mik′. Here, ⪰k denotes the comparison direction for metric k significance. This recursive process pinpointed the thresh-\n(i.e., superiority: ≥if higher is better, ≤if lower is better). old with only a fraction of the exhaustive calculations. In other words, no other model performs at least as good\nacross all performance metrics and strictly better in at 3.4.6.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 16, + "total_chunks": 71, + "char_count": 1187, + "word_count": 185, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "29464be6-d40d-46e1-ab86-2fd697d61a09", + "text": "Statistical significance tests\nleast one of them. Then, the Pareto front is defined as the The Wilcoxon signed-rank test was used for all pairwise\nset of all Pareto-optimal models: comparisons, including the evaluation of performance differences between reference- and human-prompted segmenP = { i | ∄j : mj ≻mi} . tations, as well as the comparison of intra- versus interrater consistency. To account for multiple comparisons\nIn our work, a model lies on the Pareto front if no other (n) within each test group, a Bonferroni correction was\nmodel achieved higher DSC, higher NSD, and lower HD95 applied to the initial significance level (α = 0.05). In\nsimultaneously, with at least one of these comparisons be- the reported results, asterisks (∗) denote statistical signifiing strictly inequal. cance (p < α/n), while a lack of ∗indicates non-significant\nresults.\n3.4.4.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 17, + "total_chunks": 71, + "char_count": 870, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53cac053-1500-4015-a833-df3de9427e05", + "text": "Model selection\nWithin each category – defined by prediction dimen-\n4. Resultssionality (2D vs. 3D), training data domain (medical\nvs. natural), and prompting strategy (bounding box, cenFirst, the human prompt accuracy and consistency were\nter point, or combination) – the Pareto-optimal models\nanalyzed. Then, the segmentation performance was evaluwith the smallest number of parameters were identified.\nated using reference prompts, including a model selection\nThe selection prioritized smaller models that demonstrate\nof the Pareto-optimal models. These models were further\nstrong performance within their category, ensuring computested with the human prompts to determine segmentatational efficiency. These models were chosen for further\ntion performance and consistency, followed by an analysis\nanalysis with human prompting.\nof the performance differences of the two prompt sources. Finally, models' sensitivity to prompt variability was ex-\n3.4.5. Model sensitivity to input prompt variations amined with intra- and inter-rater prompt variability and\nFollowing the analysis of prompt variability (intra-rater segmentation consistency.\nand inter-rater) and segmentation consistency, the relationship between these two factors was analyzed to assess 4.1. Human prompt variation\nmodel sensitivity to input prompt variability.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 18, + "total_chunks": 71, + "char_count": 1329, + "word_count": 178, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "80ce39f9-bc8b-4f75-95f4-ddab0015651f", + "text": "Spearman's\nCenter points. The median Euclidean distance between\nrank correlation coefficient (ρ) was calculated between\nthe human and the reference center points was 1.50mm\nthe prompt variability (Euclidean distance or IoU) and\n(IQR: 0.7 −3.0mm) (Table 2). The median intra-rater\nthe corresponding segmentation consistency (DSC, NSD,\nEuclidean distance calculated from samples, that the anHD95). A low correlation coefficient indicates low sensinotators processed twice, was 0.98mm (IQR: 0.5−1.9mm)\ntivity (i.e., increased robustness) to prompt variability, as\nand the median inter-rater Euclidean distance was 1.37mm\nit suggests the output masks remain similar regardless of\n(IQR: 0.7 −2.6mm) (Table 3).\nvariations in the input prompt. This analysis was first performed for the intra-rater Bounding Boxes. The median IoU between the human and\nprompt variability and segmentation consistency.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 19, + "total_chunks": 71, + "char_count": 892, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1428f7ed-a1d8-4acc-a142-abfd023b09c9", + "text": "In case the reference bounding boxes was 90.56% (IQR: 83.4 −\nmodels showed a lack of significant correlation for this set- 94.5%) (Table 2). The median intra-rater IoU on samples\nting, the analysis was repeated for an inter-rater setting seen twice by the annotators was 93.04% (IQR: 88.5 −\non the same sample set to determine the transition be- 96.1%, the median inter-rater IoU was 90.11% (IQR: 84.2−\ntween non statistically significant and statistically signifi- 94.0%) (Table 3).\ncant correlation. To optimize the computational overhead\nof exhaustive pairwise comparisons (n = 190 per sample Intra- vs. Inter-rater annotation consistency.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 20, + "total_chunks": 71, + "char_count": 642, + "word_count": 99, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "166154ac-173f-4be8-810a-878a57f72ddf", + "text": "For center\nper model), an iterative search was implemented. First, point and bounding box, comparing the intra- and interthe annotator pool was sorted by mean euclidean distance rater consistency revealed a statistically significant differ-\n(i.e., prompt variation from one rater to all others), with ence for all four datasets (p-values < 0.5/4 = 0.0125),\nthe lowest-variability rater serving as lower bound and with intra-rater annotations demonstrated higher consisthe highest-variability rater as upper bound. If statisti- tency compared to inter-rater annotations (Table 3).\ncally non-significant correlation was shown at the lower\nbound, the upper bound was tested. If statistically signif- Dataset-specific performance. For both human prompts,\nicant model sensitivity was detected at the upper bound, there were considerable differences across data subsets and\nthe pool median was tested. Then, the search proceeded class labels in terms of localization errors and intra-rater\nby splitting the remaining intervals in halves: testing consistency (Figure 3). The annotations for the dataset\nthe lower partition to identify the statistical significance Lower Leg and Hip showed high localization errors and\nthreshold, and the upper partition to verify statistical non low intra-rater consistency, mostly due to the class tibia Table 2: Annotation performance for human bounding box and cen- and then center point (Table 4, Figure 5a, Table D.11).\nter point variations, reported as median (IQR). The overall best model was SAM2.1 T with combination\nprompting (91.83% DSC, 98.38% NSD, 0.71mm HD95).", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 21, + "total_chunks": 71, + "char_count": 1600, + "word_count": 234, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c01f5aaf-7a73-466d-a350-7d19e9ec377a", + "text": "Metrics Bounding Box Center Point Annotation accuracy in % ↑ 3D models. The bounding box and combined prompting\nIoU 90.56(83.4-94.5) - strategies achieved higher performance than center-point\nprompts (Table 4, Figure 5b, Table D.11). The overall\nLocalization error in mm ↓\nbest model was Med-SAM2 with bounding box prompting\nEucl. distance 0.49(0.0-1.4) 1.50(0.7-3.0) (79.56% DSC, 80.25% NSD, 13.49mm HD95).\n|∆x| 0.00(0.0-0.9) 0.98(0.3-1.9)\n|∆y| 0.00(0.0-0.9) 0.97(0.3-1.9) Selected Models. Considering prediction dimensionality\n(2D vs. 3D), training data domain (medical vs. natural), |∆w| 1.30(0.3-2.9) -\nand prompting strategy (bounding box, center point, com-\n|∆h| 1.46(0.3-2.9) -\nbination), the smallest Pareto-optimal models for prompt-\n∆x 0.00(0.0-0.0) 0.33(-0.3-1.5) ing with reference prompts were collected in Table 4. Fo-\n∆y 0.00(0.0-0.3) 0.33(0.0-1.5) cusing only on dimensionality, ignoring the training data\n∆w 0.83(0.0-2.0) - domain, the Pareto-optimal models with the least parameter per prompt type were: SAM2.1 B+ (2D bounding ∆h 0.98(0.0-2.5) -\nbox), SAM B (2D center point), SAM2.1 T (2D center\nDetection performance in % ↑ point), Med-SAM2 (3D bounding box), nnInteractive (3D\nRecall 96.1 95.5 center point, 3D combination). These models were marked\nwith gray cells in Table 4 and large symbols in Figure 5.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 22, + "total_chunks": 71, + "char_count": 1328, + "word_count": 191, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40d5fc0e-840b-4308-a3c6-9b0fa60114ae", + "text": "Table 3: Intra-rater and inter-rater annotation consistency for human bounding box and center point, reported as median (IQR). Comparing SAM2 with SAM2.1 and investigating variations of the 3D prompting strategies for Metrics Bounding Box Point\nintra inter intra inter automated extracted prompts showed the following trends:\n• There was no statistically significant difference beAgreement in % ↑\ntween SAM2 and SAM2.1, except for the tiny (T) IoU 93.04(88.5-96.1) 90.11(84.2-94.0) - -\nVariability in mm ↓ models (Appendix E.1). Eucl. distance 0.49(0.0-1.4) 0.73(0.3-1.5) 0.98(0.5-1.9) 1.37(0.7-2.6) • Limiting the propagation in SAM2.1 and Med-SAM2\n|∆x| 0.33(0.0-0.9) 0.33(0.0-1.0) 0.49(0.0-1.5) 0.83(0.0-1.6) prevented over-segmentation at the top and bot-\n|∆y| 0.33(0.0-0.8) 0.33(0.0-1.0) 0.49(0.0-1.5) 0.83(0.0-1.6) tom of an object which improved performance (Ap-\n|∆h| 0.65(0.0-1.5) 0.98(0.3-2.6) - - pendix E.2).\n|∆w| 0.65(0.0-1.5) 0.98(0.3-2.6) - - • Medical FMs benefit from multiple initial slices more\nSystematic differences in mm ↓ than SAM2.1 models (Appendix E.3). With multiple\n∆x 0.00(-0.3-0.3) 0.00(-0.3-0.3) 0.00(-0.5-0.5) 0.00(-0.3-0.3) initial slices, nnInteractive exceeded the performance\n∆y 0.00(-0.3-0.0) 0.00(-0.3-0.3) 0.00(-0.5-0.5) 0.00(-0.8-0.8) of Med-SAM2, which was the Pareto-optimal model\n∆w 0.00(-0.7-0.7) 0.00(-1.0-1.0) - - for the default settings (i.e, with single initial slice).\n∆h 0.00(-0.7-0.7) 0.00(-1.0-1.0) - -\n• There was only a marginal difference (mostly statistically non-significant) between using a single or multiple prompts for 3D prompting (Appendix E.4).bone for center points (Figure C.8) and tibia implant and\nhip for bounding box (Figure C.10), while annotations in\nthe dataset Wrist showed overall the lowest localization er- 4.3. Segmentation performance with human prompts\nrors and high intra-rater consistency. Detailed results and 2D models. SAM and SAM2.1 maintained their superior\nvisualizations per data subset and class labels are available performance compared to medical FMs, mirroring the\nin the Appendix C. trends seen with reference prompts (Table 5). The overall best model was again SAM2.1 T with combination\nAnnotation Time. The average annotation time was 4.22 prompting (89.65% DSC, 97.71% NSD, 1.06mm HD95).\nsec for a center point and 11.37 sec for a bounding box. The annotators required between 5 and 14 hours to com- 3D models. All 3D medical FMs consistently outperplete the project (excluding training phase), with a median formed SAM2.1 for all prompt types (Table 5). The overall\nof 8 hours and 48 min (IQR: 7 hours to 11 hours and 18 best model was Med-SAM2 with bounding box prompting\nminutes) (Figure 4). (77.05% DSC, 79.47% NSD, 14.36mm HD95). Segmentation performance with reference prompts Segmentation consistency. Intra-rater consistency is high\n2D models. For reference prompts, the combined prompt- for all FMs (Table 5).", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 23, + "total_chunks": 71, + "char_count": 2914, + "word_count": 418, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8df550bd-d97c-405d-8392-17c745f35090", + "text": "Notably, consistency was most proing strategy worked the best, followed by the bounding box nounced in the top-performing models.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 24, + "total_chunks": 71, + "char_count": 129, + "word_count": 19, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e30e6000-a3be-49b9-be7a-65c0f86532c0", + "text": "Figure 3: Spatial distribution of prompt placement per annotator per dataset. Each point corresponds to one annotator. It represents the mean deviations (mm) in the x- (∆x) and y-directions (∆y) of the center point (i.e., human\n(a) or extracted from the bounding box (b)), relative to the reference prompt at the origin (0,0). The same-colored (more transparent) ellipse (a) and\nrectangles (b) represent each annotator's intra-rater consistency ((a): (∆x, ∆y), (b): (∆w, ∆h)). Wrist shows the least localization errors and highest\nconsistency, while Lower Leg and Hip show high localization errors and low intra- and inter-rater consistency.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 25, + "total_chunks": 71, + "char_count": 641, + "word_count": 97, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "179d7e9a-eb93-477d-a0ec-4670748c6eab", + "text": "Model sensitivity to input prompt variations For all models and prompt types, the correlation coefficient showed decreasing intra-rater segmentation consistency with increasing prompt variability (Table 6, Figure\n6). All 2D models and 3D models box-prompted showed\nstatistically significant correlation (p-values< 0.05/(13 ×\n3) = 0.0013) for intra-rater variability. Only nnInteractive\ncombination-prompted and SAM2.1 T point-prompted\nwere robust for intra-rater variability. These two models\nwere analyzed for the inter-rater annotation variability and\nsegmentation consistency in an iterative search pattern\nbased on a sorted annotator pool (Table C.10), to identify\nthe threshold for model sensitivity. The results showed\nFigure 4: Accumulated annotation time per annotator for all\nthat SAM2.1 T point-prompted was sensitive to the low-projects.\nest inter-rater variability and nnInteractive combinationprompted was sensitive for the sixth lowest inter-rater variability (Table 7) (p-values < 0.05/(20 × 3) = 0.00083).\n4.4. Performance gap between reference- and humanprompted results\n5. Our study quantified intra- and inter-rater variability in\n2D models exhibited a statistically significant decline human prompts and analyzed their impact on segmentain performance when transitioning to human prompts tion consistency of Pareto-optimal FMs for MSK CT appli-\n(2.07% DSC, 0.87% NSD, −0.25mm HD95; p-values < cation, across four anatomical regions. The main findings\n0.05/6 = 0.0083), while 3D models showed a smaller but in analyzing the model sensitivity to input prompt variastill statistically significant performance drop compared to tions were: 1.) All 2D models showed sensitivity to prompt\ntheir reference-prompted counterparts (1.06% DSC, 0.47% variations. 2.) 3D models SAM2.1 T point-prompted\nNSD, −0.39mm HD95; with p-values < 0.05/6 = 0.0083) and nnInteractive combination-prompted showed robust-\n(Table F.16). ness for intra-rater variations, but not for all inter-rater Table 4: Segmentation performance with reference prompts of Pareto-optimal models per prompt type (i.e., bounding box, center point,\ncombination) and category (2D vs. 3D; medical vs. natural). The Pareto-optimal models with the least parameters per category are highlighted in bold (selected for further analysis with human prompts). Grayshaded cells and prompt symbols next to the model names indicate the smallest Pareto-optimal models per prompt type. No Pareto-optimal results\nare omitted in this Table and can be found in Table D.11.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 26, + "total_chunks": 71, + "char_count": 2527, + "word_count": 353, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "15fda36b-67bb-4f87-ad14-86d6d96a905f", + "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ M % % mm % % mm % % mm MedicoSAM2D 94 90.74±7.7 97.36±3.6 0.76±0.9 77.46±19.3 83.23±18.4 5.00±5.9 91.27±7.4 97.74±3.3 0.69±0.8 SAM-Med2d 271 - - - 73.69±17.0 84.48±14.9 5.35±5.0 - - - medical\nScribblePrompt-SAM 94 - - - 74.19±14.6 84.22±12.6 6.30±5.3 - - - SAM B 94 - - - 85.43±14.4 90.82±13.0 4.83±6.3 - - - SAM2.1 B+ 81 90.60±8.1 97.84±3.5 0.82±1.0 - - - 91.98±7.2 98.21±3.6 0.73±1.1\nnatural SAM2.1 L 224 - - - - - - 90.90±6.9 98.36±3.2 0.69±1.0\nSAM2.1 S 46 - - - - - - 91.51±7.0 98.40±3.3 0.69±0.9 SAM2.1 T 39 - - - - - - 91.83±6.9 98.38±3.2 0.71±1.0 3D Models\nmedical nnInteractiveMed-SAM2 10239 79.56±11.1- 80.25±10.5- 13.49±11.1- 69.40±11.2- 68.23±12.0- 30.98±9.4- 75.92±9.4- 76.60±9.6- 26.53±10.3- SAM2.1 B+ 81 66.11±10.1 66.59±10.0 24.77±18.1 - - - 68.33±9.4 67.86±10.2 26.04±18.2 SAM2.1 S 46 67.69±10.2 68.48±10.0 31.67±21.6 56.90±19.1 53.96±20.2 47.84±31.2 70.22±10.1 69.88±10.7 32.21±22.0 natural\nSAM2.1 T 39 - - - 54.74±15.9 52.92±16.9 46.40±28.5 - - - (a) All 12 2d models evaluated slice-wise. (b) All 13 3D models evaluated volumetric. HD95 (mm) performance of all models (color-encoded) and three prompt types (symbol-encoded) for perfect prompts. Larger symbols highlight the smallest Pareto-optimal models. variations. 3.) Performance estimates of \"ideal\" prompt- than point prompts, likely because defining a precise point\ning (i.e., reference prompts) do not translate to a human- for complex geometries is less intuitive for annotators than\ndriven setting. defining spatial boundaries.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 27, + "total_chunks": 71, + "char_count": 1647, + "word_count": 266, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5be5823-bd11-48d3-8af1-87b2144b35b8", + "text": "Human prompt analysis less evident (Figure 7). There were considerable differences across data subsets\n5.2. Segmentation analysis\nand class labels for human prompts (Tables C.8, C.9),\nbut some consistent findings emerged. Circular structures In both 2D and 3D models, there were considerable per-\n(e.g., humerus, wrist bones) showed high rater consistency. formance variations for reference prompts across model\nPoint placement was more prone to deviation in elongated, types and prompting strategies(Figure 5, Table D.11). For\nthin, or annular bone shapes (e.g., scapula, femur with 2D medical FMs, MedicoSAM showed high performance\nmetal implant, see Figures C.8, C.9). For bounding boxes, compared to its alternatives Med-SAM and SAM-Med2D,\nconsistency decreased in structures with complex topolo- which is likely due to its training on a complex objective\ngies and multiple components (e.g., scapula, metal im- (in contrast to Med-SAM) while keeping the SAM architecplants, see Figures C.10, C.11). Overall, bounding box ture without adapters (in contrast to SAM-Med2D) [11].\nprompts demonstrated higher accuracy and consistency However, going to 3D, its propagation is outperformed by Table 5: Segmentation performance and intra-rater segmentation consistency with human prompts – grouped by prompt type (bounding\nbox, center point, combination). The best values per prompt type are highlighted in bold. The best performing models also showed the highest\nconsistency. The performance difference to perfect prompts is shown in Table F.16.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 29, + "total_chunks": 71, + "char_count": 1542, + "word_count": 223, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "95096652-d006-4fa8-bc3a-cddc521b1fa4", + "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ M % % mm % % mm % % mm Segmentation performance", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 30, + "total_chunks": 71, + "char_count": 188, + "word_count": 46, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a769e16f-e062-4f59-b28e-e9c7f10e8bc3", + "text": "2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 86.12±13.6- 95.40±5.8- 1.26±1.7- 72.50±18.075.95±20.7 84.26±12.983.54±16.6 6.39±5.35.13±5.8 86.53±13.0- 95.09±7.0- 1.38±2.1- SAM B 94 - - - 83.69±17.5 90.99±11.6 4.85±6.2 - - - SAM2.1 B+ 81 87.86±12.8 96.80±5.0 1.15±1.6 - - - - - - natural\nSAM2.1 T 39 - - - - - - 89.65±10.8 97.41±4.7 1.06±1.7 3D Models\nmedical Med-SAM2nnInteractive 10239 76.80±13.5- 79.27±11.2- 14.46±11.8- 68.12±12.6- 68.63±11.5- 30.10±8.8- 75.59±10.6- 77.29±9.1- 25.65±9.5-\nnatural SAM2.1SAM2.1 ST 4639 65.93±11.6- 67.83±10.2- 32.71±21.6- 53.72±16.3- 52.93±16.5- 46.84±27.8- 68.80±11.2- 69.19±10.9- 33.88±22.4- Segmentation consistency", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 31, + "total_chunks": 71, + "char_count": 659, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5efa4186-34ef-4ee2-82ae-327641110467", + "text": "2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 95.35±8.1- 99.20±1.8- 0.38±0.8- 93.13±13.196.27±10.8 96.45±9.398.29±6.3 1.68±5.00.99±3.9 95.15±9.7- 98.48±3.9- 0.64±1.7- SAM B 94 - - - 97.17±9.4 99.07±3.5 0.49±1.9 - - - SAM2.1 B+ 81 97.13±8.1 99.40±1.8 0.26 ±0.8 - - - - - - natural\nSAM2.1 T 39 - - - - - - 97.71±9.2 99.38±2.1 0.31±1.2 3D Models\nmedical Med-SAM2nnInteractive 10239 88.13±20.0- 90.79±16.2- 7.58±15.1- 84.89±20.6- 86.71±18.0- 7.32±12.7- 88.44±17.7- 88.75±15.2- 8.05±12.9-\nnatural SAM2.1SAM2.1 ST 4639 85.45±20.9- 87.63±17.8- 16.94±31.3- 79.63±28.5- 80.52±27.2- 23.94±40.8- 87.46±19.6- 88.28±17.6- 16.64±30.5-", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 32, + "total_chunks": 71, + "char_count": 627, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "91077add-65ab-47ea-8f3f-e94bc67b271a", + "text": "Table 6: Spearman's rank correlation coefficients for each metric (ρDSC, ρNSD, ρHD95) between intra-rater annotation variability and segmentation consistency. Asterisks (∗) denote statistical significance. Positive values for HD95 and negative values for DSC and NSD indicate that increased prompt variability\nsignificantly reduces segmentation consistency.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 33, + "total_chunks": 71, + "char_count": 357, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "058b4624-87d8-4342-a956-68c5f9b36898", + "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D)\nSize ρDSC ↑ρNSD ↑ ρHD95 ↓ ρDSC ↑ρNSD ↑ ρHD95 ↓ ρDSC ↑ρNSD ↑ ρHD95 ↓\nM % % mm % % mm % % mm 2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 -0.36*- -0.48*- 0.49*- -0.57*-0.58* -0.55*-0.54* 0.59*0.48* -0.45*- -0.54*- 0.50*- SAM B 94 - - - -0.53* -0.52* 0.46* - - - SAM2.1 B+ 81 -0.33* -0.38* 0.41* - - - - - - natural\nSAM2.1 T 39 - - - - - - -0.60* -0.59* 0.55* 3D Models\nmedical Med-SAM2nnInteractive 10239 -0.38*- -0.41*- 0.50*- -0.32*- -0.30*- 0.35*- -0.09- -0.11- 0.16-\nnatural SAM2.1SAM2.1 ST 4639 -0.23*- -0.29*- 0.41*- -0.05- -0.02- 0.12- -0.31*- -0.32*- 0.41*- native 3D models such as Med-SAM2 and nnInteractive. additional input channel for 3D feature extraction, which\nWhile MedicoSAM3D projects the prediction of adjacent is less prone to error propagation by design.\nslices, Med-SAM2 leverages the memory bank mechanism Several 3D medical FMs perform significantly worse than\nof SAM2 and nnInteractive integrates user prompts as an others. For SAM-Med3D, resampling the entire image (a) 2D Models (b) 3D Models Figure 6: Model sensitivity to input variations visualized as intra-rater annotation variability (euclidean distance or IoU) vs. segmentation\nconsistency (DSC). Each point represent the mean prompt variability and mean segmentation consistency for one sample. Dotted lines represent\nordinary least squares (OLS) linear regression trends. Shaded areas denote the 95%confidence intervals (α = 0.05). Table 7: Spearman's rank correlation coefficients for each metric distribution. Even with cropping, performance remains\n(ρDSC, ρNSD, ρHD95) between inter-rater annotation variability and\nsegmentation consistency. below competitive levels. SegVol and Vista3D prompted\nAnnotator rows are ordered by euclidean distance (mm), starting with with center points also demonstrate suboptimal results,\nthe lowest-variance rater.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 34, + "total_chunks": 71, + "char_count": 1922, + "word_count": 292, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d39224cb-1b1e-4f26-8297-b3a267a03aaf", + "text": "Due to iterative search, not all inter-rater\nvariabilities are tested (Table C.10). The first column indicates the order which is likely due to the underlying training data, favorof tests. Asterisks (∗) denote statistical significance. Positive values ing abdominal and thoracic organ segmentation over bone\nfor HD95 and negative values for DSC and NSD indicate that increased\nprompt variability significantly reduces segmentation consistency. and metal implant segmentation. A direct comparison between 2D and 3D models is limAnnotator Eucl. distance ρDSC ↑ρNSD ↑ρHD95 ↓ ited by fundamental differences in evaluation and prompt-\n(mm) (%) (%) (mm) ing strategies. 3D performance was measured across entire\nnnInteractive volumes, where error propagation in more distal slices can\nlower overall metrics, whereas 2D models were evaluated 1 Annotator02 1.67±2.8 -0.19 -0.15 0.14\non single slices without such penalties. In addition, 2D\n5 Annotator05 1.85±3.2 -0.21 -0.21 0.18\nmethods utilized prompts per component (i.e., multiple\n6 Annotator14 1.86±3.2 -0.22 0.18 -0.22 prompts per object), while 3D models were often restricted\n7 Annotator20 1.86±3.2 -0.26* 0.20 -0.26* to a single prompt per object, especially for bounding box\n4 Annotator01 1.87±3.3 -0.27* -0.27* 0.23* prompting. Due to these differences, we treated 2D and\n3D models as two different categories of models in our 3 Annotator09 1.96±3.1 -0.25* -0.27* 0.24*\nanalysis.\n2 Annotator07 2.40±3.6 -0.37* -0.38* 0.35*\nWhile performance findings remain consistent for both\nSAM2.1 T\nprompt sources (i.e., perfect and human), the performance\n1 Annotator15 2.51±5.5 -0.23* 0.30* -0.29* drops for human prompts. This suggests that reference\nprompts are more \"ideal\" for optimizing model output,\nindicating that standard benchmarks might overestimate\nachievable performance in practical, human-driven set-without cropping to 128×128×128 leads to notable loss of\ntings.performance, likely due to image distortion and misalignment of the object of interest relative to the training data Visual inspection of segmentation results revealed three (a) SAM2.1 T point-prompted; Wrist sample showing ulna and radius; (b) nnInteractive point-prompted; Wrist sample showing ulna and raDespite small intra-rater variation, the resulting 3D prediction shows dius; Small intra-rater variation with small differences in resulting 3D\nlarge differences (72.5% DSC, 69.0% NSD, 24.8mm HD95). prediction (98.7% DSC, 100.0% NSD, 0.3mm HD95). (c) MedicoSAM2D (first row), ScribblePrompt-SAM (second row); Hip (d) SAM2.1 S box-prompted; Hip sample showing left/right hip; Despite\nsample showing left/right hip; Varying model sensitivity for input prompt small intra-rater variation, the resulting 3D prediction shows large difvariations. ferences (64.1% DSC, 62.9% NSD, 48.6mm HD95) Figure 7: Visual examples for model sensitivity to input prompt variations: reference mask, predicted mask with reference prompt, predicted\nmask with 1st set of human prompt and with 2nd set of the same annotator. The reference prompt is drawn as black point or box. The human\nprompt is drawn as colored point or box.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 35, + "total_chunks": 71, + "char_count": 3130, + "word_count": 452, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6e417a19-bc1d-41e9-bcd4-9e0f80e1db9c", + "text": "common mistakes (Figure 7), which also explain poor per- variability increases, segmentation consistency declines.\nformance metrics: 1.) Anatomical Ambiguity: Due to the While the high values for segmentation consistency suggest\ndifferent Houndsfield Unit (HU) for cortical bone and tra- that the resulting masks remain mostly similar, the models\nbecular bone, models struggled to differentiate between nonetheless show sensitivity where even minor changes in\nthese structures and the combined total bone volume. This the input prompt can trigger large changes in the output\nissue is caused by prompt ambiguity, where a point or box segmentation (Figure 6, Figure 7). Consequently, sensitivmay not clearly define whether the user intends to segment ity to prompt fluctuations should be considered a critical\nthe entire bone or just a specific layer. 2.) Oversegmen- performance metric for the development and real-world\ntation: Both 2D and 3D architectures sometimes failed evaluation of FMs, particularly in domains where user into identify clear anatomical boundaries. For 2D models, put can inherently vary.\nthis typically resulted in the prediction extending beyond Intra-rater annotation variability was consistently lower\nthe bone contour within a single slice. For 3D models, than even the most stable inter-rater setting for both point\nthese boundary failures were magnified by the additional prompts (intra-rater Euclidean distance of 2.00 ± 5.3 mm\nspatial dimension, allowing errors to propagate and grow vs. lowest pairwise Euclidean distance of 2.51 ± 5.5 mm)\ninto neighboring structures, even across joint spaces. This and combined prompts (intra-rater Euclidean distance of\npropagation error suggests that the models lack a robust 1.41 ± 2.4 mm vs. lowest inter-rater Euclidean distance\nvolumetric \"stop\" signal. 3.) Undersegmentation: In re- of 1.68 ± 2.8 mm). Therefore, if a model demonstrated\ngions with fading or fluctuating intensity values, models sensitivity to the variations within a single annotator, it\nsometimes stopped the predictions too early. likely exhibits similar or greater sensitivity to the larger\nfluctuations between annotators. For models that did not\n5.3. Model sensitivity to input prompt variations show statistically significant correlation for the intra-rater\nAn inverse relationship was observed between prompt setting, the correlation for inter-rater settings was tested\nvariability and segmentation consistency; as input prompt as well.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 36, + "total_chunks": 71, + "char_count": 2484, + "word_count": 360, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0689936d-74c6-4d79-a6ec-b1aeb23a24cb", + "text": "While all 2D models and box-prompted 3D models exhibited sensitivity to prompt fluctuations already for the is often to fully automate the segmentation process withintra-rater setting, nnInteractive combination-prompted out user interaction.\nand SAM 2.1 T point-prompted showed a lack of statistically significant correlations, suggesting model robustness Geometric prompting. Aside from geometric prompting,\nto intra-rater input prompt variations. Testing them fur- also text prompts become more popular and are for examther with inter-rater settings, SAM2.1 T point-prompted ple integrated in the recently released SAM3 framework\nshowed sensitivity at the first inter-rater iterations (2.51± [7]. Text prompts remove user interaction and therefore\n5.5), while nnInteractive combination-prompted showed geometric variations and could potentially be automatized\nsensitivity at the sixth inter-rater level with 1.87 ± 3.3. for specific medical tasks if always the same structures\nThus, no tested model is robust against large fluctuations should be segmented.\nbetween annotators, but nnInteractive shows the least sensitivity. It is critical to emphasize that model sensitivity should not be viewed as an isolated performance met- 6. It must be evaluated in combination with absolute\nperformance and segmentation consistency to ensure a The observed performance drop when transitioning from\nmore complete evaluation.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 37, + "total_chunks": 71, + "char_count": 1415, + "word_count": 194, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c1e3d2b-fa1d-40a1-95de-048c555267af", + "text": "Considering that, nnInteractive idealized reference prompts to human inputs and the sensicombination-prompted presented itself as the best option tivity to human prompt fluctuations across models, shows\nof all tested models. that prompt placement matters. Our findings suggest that\nsegmentation performances derived from \"ideal\" (i.e., reference prompts) may not accurately reflect performance5.4. Limitations & Future Work\nin human-driven settings. Consequently, model sensitivDataset. The TotalSegmentator dataset was used to train\nity to prompt variability should be established as a comsome of the investigated FMs. Not all FMs reported a deplementary performance metric for the development and\ntailed train–test split (Table 1). However, by introducing\nreal-world evaluation of promptable FMs. This would help\nthe new classes femur implant left and right, the evalubridging the gap between theoretical potential and practiated task extended beyond the original training labels and\ncal application.\nposed a new task unseen by the FMs, even if the selected\ntest samples were included in previous training.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 38, + "total_chunks": 71, + "char_count": 1108, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "063f19a3-2054-4941-9d93-3f2c6d25f91b", + "text": "Acknowledgments\nAxial slices. We limited our study to axial slices to limit\nthe workload for annotators. Sagittal and coronal slices, We thank all the students, who participated in the obwhich are often underexplored, could serve as a meaningful server study and made the collection of human prompts\nalternative or complementary source of information. possible. We also want to thank Dieuwertje Luitse, for her\ninput to study questionnaires sent to the students at the\nObserver Study. The annotators in our observer study are beginning and end of their study participation to collect\nmedical students rather than trained radiologists, primar- additional information about the study participants and\nily due to availability. However, the results indicate that their study experience. We thank Thomas Koopman and\nextensive medical training may not be required for the in- the team from grand-challenge for their great help with\nvestigated tasks, although this may not generalize to more setting up the observer study.\ncomplex clinical applications such as tumor identification. Non-iterative prompting. Our study was conducted in a References\nstatic setting without iterative refinement or segmentation\ncorrection. While interactive workflows are important for [1] A. Rolland,\nreal-world deployment, they increase the complexity of the L.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 39, + "total_chunks": 71, + "char_count": 1336, + "word_count": 197, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c85fae7f-a9f2-48b5-9ab1-985a50cedd63", + "text": "Berg,\nevaluation, as the individual contributions of the interac- W.-Y. Girshick, Segment anything\ntion step and their effect on model sensitivity would be (2023). arXiv:2304.02643.\nmore difficult to isolate and quantify. Future evaluation URL https://arxiv.org/abs/2304.02643\nstudies should be conducted to analyze interactive refinement efficiency, which may mitigate commonly observed [2] Y. Chang,\nsegmentation mistakes. For example, the impact of severe X. Chen,\noversegmentation and volumetric leakage could be miti- S. Grau,\ngated by the strategic use of negative prompts to define D.-P. Ni, Segment anyexclusion zones. Similarly, anatomical ambiguity could be thing model for medical images?, Medical Imovercome by several carefully placed positive prompts until age Analysis 92 (2024) 103061. doi:https:\nthe the desired anatomical boundary is reached. However, //doi.org/10.1016/j.media.2023.103061.\na disadvantage of iterative refinement is the additionally URL https://www.sciencedirect.com/science/\nrequired user interaction and time, where the ultimate goal article/pii/S1361841523003213", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 40, + "total_chunks": 71, + "char_count": 1100, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "733c0119-0a82-43a3-b5df-c61e541179dc", + "text": "Qiao, Sam-med2d (2023). Streekstra, Joint space narrowing in pa- arXiv:2308.16184.\ntients with pisotriquetral osteoarthritis, HAND\n[10] H. Dalca, Scrib- 12 (5) (2017) 490–492, pMID: 28832198. arXiv:\nbleprompt: Fast and flexible interactive segmentation https://doi.org/10.1177/1558944716677542,\nfor any biomedical image, European Conference on doi:10.1177/1558944716677542. Computer Vision (ECCV) (2024). URL https://doi.org/10.1177/\n1558944716677542 [11] A. Pape, Medicosam: Towards foundation models for medical image segmen-\n[4] C. Kievit,\ntation (2025). arXiv:2501.11734. Streekstra,\nURL https://arxiv.org/abs/2501.11734 C. Blankevoort, Automation in tibial\nimplant loosening detection using deep-learning [12] J. Wu, Medical sam 2: Segment medical\nsegmentation, International Journal of Computer images as video via segment anything model 2 (2024). Assisted Radiology and Surgery 20 (2025) 2065–2073. arXiv:2408.00874.\ndoi:10.1007/s11548-025-03459-1. URL https://arxiv.org/abs/2408.00874\nURL https://doi.org/10.1007/s11548-025-\n03459-1 [13] J. J. sam2: Segment anything in 3d medical images and\nKerkhoffs, G. J. van den videos (2025). arXiv:2504.03600. P. van Deurzen, URL https://arxiv.org/abs/2504.03600\nMinimal but potentially clinically relevant anteroinferior position of the humeral head following traumatic [14] H.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 41, + "total_chunks": 71, + "char_count": 1326, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "63d9101f-8b4a-425a-ace9-3e8f9ac002ca", + "text": "Li,\nanterior shoulder dislocations: A 3d-ct analysis, J. Zhang,\nJournal of Orthopaedic Research 42 (8) (2024) J. Qiao, Sam-med3d: Towards general-purpose\n1641–1652. arXiv:https://onlinelibrary. segmentation models for volumetric medical images\nwiley.com/doi/pdf/10.1002/jor.25831, (2024). arXiv:2310.15161.\ndoi:https://doi.org/10.1002/jor.25831. URL https://arxiv.org/abs/2310.15161\nURL https://onlinelibrary.wiley.com/doi/\n[15] Y. Zhao, Segvol: Universal\nabs/10.1002/jor.25831\nand interactive volumetric medical image segmentation (2025). arXiv:2311.13385.[6] N. Ryali,\nURL https://arxiv.org/abs/2311.13385 T. Carion, C.-Y. [16] Y. Feichtenhofer, Sam Z.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 42, + "total_chunks": 71, + "char_count": 654, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3714f5ff-d666-4d9e-b12e-e9fe4f76202f", + "text": "Har-\n2: Segment anything in images and videos, arXiv mon, B. Li, Vista3d: A unified\npreprint arXiv:2408.00714 (2024). segmentation foundation model for 3d medical imagURL https://arxiv.org/abs/2408.00714 ing (2024). arXiv:2406.05285. URL https://arxiv.org/abs/2406.05285[7] N. Maier-Hein, nninteractive: Redefining 3d promptmeni, R. Li, able segmentation (2025). arXiv:2503.08373. Ravi, URL https://arxiv.org/abs/2503.08373\nK. Feichtenhofer, Sam 3: Segment anything with concepts (2025). arXiv:2511. [18] B. Merhof, Foundational\nURL https://arxiv.org/abs/2511.16719 models in medical imaging: A comprehensive survey\nand future vision (2023). arXiv:2310.18689.\n[8] J. Wang, Seg- URL https://arxiv.org/abs/2310.18689\nment anything in medical images, Nature Communications 15 (2024) 1–9. [19] Y. Jiao, Segment anything\nmodel for medical image segmentation: Current\n[9] J. Wang, applications and future directions, Computers in BiY. He, ology and Medicine 171 (2024) 108238. doi:https: //doi.org/10.1016/j.compbiomed.2024.108238. [28] C. KupssinURL https://www.sciencedirect.com/science/ skü, O.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 43, + "total_chunks": 71, + "char_count": 1091, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1efa5fda-14bb-493b-8502-10acbad0990b", + "text": "Barros, Zeroarticle/pii/S0010482524003226 shot performance of the segment anything model\n(sam) in 2d medical imaging: A comprehensive eval-\n[20] S. Lee, A narrative review of foundation uation and practical guidelines, in: 2023 IEEE 23rd\nmodels for medical image segmentation: zero-shot International Conference on Bioinformatics and Bioperformance evaluation on diverse modalities, Quan- engineering (BIBE), 2023, pp. 108–112. doi:10.\ntitative Imaging in Medicine and Surgery 15 (6) 1109/BIBE60311.2023.00025.\n(2025). URL https://qims.amegroups.org/article/ [29] H. A.\nview/138057 Mazurowski, Segment anything model 2: an application to 2d and 3d medical images (2024). arXiv:\n[21] M. Yao, A re- URL https://arxiv.org/abs/2408.00756\nview of the segment anything model (sam) for\nmedical image analysis: Accomplishments and [30] S. Soni, Is sam 2 better\nperspectives, Computerized Medical Imaging than sam in medical image segmentation? (2024).\nand Graphics 119 (2025) 102473. doi:https: arXiv:2408.04212.\n//doi.org/10.1016/j.compmedimag.2024.102473. URL https://arxiv.org/abs/2408.04212\nURL https://www.sciencedirect.com/science/\n[31] J. Wang,\narticle/pii/S0895611124001502\nL. Ren, Sam 2 in robotic surgery: An em-\n[22] D. Kang, pirical evaluation for robustness and generalization\nA. Mukasheva, A review of deep learning approaches in surgical video segmentation (2024). arXiv:2408.\nbased on segment anything model for medical image 04593.\nsegmentation, Bioengineering 12 (12) (2025). URL https://arxiv.org/abs/2408.04593\nURL https://www.mdpi.com/2306-5354/12/12/\n[32] Y. Unberath, Perfor-\nmance and non-adversarial robustness of the seg-\n[23] P. Ma, ment anything model 2 in surgical video segmentation\nQ. Chang, Vision foundation models in medical image (2024). arXiv:2408.04098.\nanalysis: Advances and challenges (2025). arXiv: URL https://arxiv.org/abs/2408.04098\n2502.14584.\n[33] C.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 44, + "total_chunks": 71, + "char_count": 1888, + "word_count": 234, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "15b9df20-7980-415f-b1fb-2ae8f4c15fd6", + "text": "Kervadec, Zero-shot URL https://arxiv.org/abs/2502.14584\ncapability of 2d SAM-family models for bone seg-\n[24] S. Rokuss, mentation in CT scans, in: Medical Imaging with\nN.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 45, + "total_chunks": 71, + "char_count": 172, + "word_count": 24, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1053bdd2-a394-4180-9d1d-ea3bf5fa361f", + "text": "Maier- Deep Learning, 2025. Hein, Sam.md: Zero-shot medical image segmen- URL https://openreview.net/forum?id=\ntation capabilities of the segment anything model AUv6NhK9aH\n(2023). arXiv:2304.05396.\n[34] C. URL https://arxiv.org/abs/2304.05396\nJaeger, K. Maier-Hein, Radioactive: 3d radiological\n[25] S. E. interactive segmentation benchmark (2025). arXiv:\nGrant, Y. Ou, Computer-vision benchmark segment- 2411.07885.\nanything model (sam) in medical images: Accuracy URL https://arxiv.org/abs/2411.07885\nin 12 datasets (2023). arXiv:2304.09324.\n[35] J. Pradella, URL https://arxiv.org/abs/2304.09324\nD. Segeroth, TotalsegmentaN. Zhang, Segment anything model for tor: Robust segmentation of 104 anatomic structures\nmedical image analysis: An experimental study, in ct images, Radiology: Artificial Intelligence 5 (5)\nMedical Image Analysis 89 (2023) 102918. doi: (2023) e230024. doi:10.1148/ryai.230024.\nhttps://doi.org/10.1016/j.media.2023.102918. URL https://doi.org/10.1148/ryai.230024\nURL https://www.sciencedirect.com/science/\n[36] P. Gu,\narticle/pii/S1361841523001780\nH. Li, Zhou, Deep learning to segment pelvic bones: LargeSam on medical images: A comprehensive study on scale ct datasets and baseline models (2021). arXiv:\nthree prompt modes (2023). arXiv:2305.00035. 2012.08721. URL https://arxiv.org/abs/2305.00035 URL https://arxiv.org/abs/2012.08721 Löffler, (2024). arXiv:2403.15063. Payer, URL https://arxiv.org/abs/2403.15063\nD. Štern, M.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 46, + "total_chunks": 71, + "char_count": 1453, + "word_count": 158, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8a75c017-0cae-4e6b-b056-3c7f4a3ae433", + "text": "Kirschke, Verse: A vertebrae labelling and segmentation benchmark for multi-detector ct images,\nMedical Image Analysis 73 (2021) 102166. doi:\nhttps://doi.org/10.1016/j.media.2021.102166. URL https://www.sciencedirect.com/science/\narticle/pii/S1361841521002127 Jäger, et al., Metrics\nreloaded: recommendations for image analysis\nvalidation, Nature Methods 21 (2024) 195–212. URL https://doi.org/10.1038/s41592-023-\n02151-7 Kervadec, Distorch: A fast gpu implementation of 3d hausdorff distance (2025). URL https://github.com/jeromerony/distorch Streekstra, Evaluation of a ct-based technique\nto measure the transfer accuracy of a virtually planned osteotomy, Medical Engineering &\nPhysics 36 (8) (2014) 1081–1087. doi:https:\n//doi.org/10.1016/j.medengphy.2014.05.012. URL https://www.sciencedirect.com/science/\narticle/pii/S1350453314001271 Gerig, User-guided 3D\nactive contour segmentation of anatomical structures:\nSignificantly improved efficiency and reliability, Neuroimage 31 (3) (2006) 1116–1128. Xu, Towards a comprehensive, efficient and promptable anatomic structure\nsegmentation model using 3d whole-body ct scans The three data subsets Wrist, Lower Leg, Shoulder were acquired at the Amsterdam UMC with a\nBrilliance 64-channel CT Scanner (Philips Healthcare, Best, The Netherlands) or a Siemens SOMATOM Force. The\nreference segmentation masks were generated in a two-step annotation process: First, an in-house 3D annotation software\n[40] was used to generate preliminary mask with a threshold-based region-growing segmentation algorithm. Then, these\npreliminary masks were manually corrected and refined with ITK-SNAP [41].", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 48, + "total_chunks": 71, + "char_count": 1635, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35205e79-027c-4dc0-b537-7c35cc91844d", + "text": "The fourth data subset Hip is derived from the publicly reported test set of TotalSegmentator [35],\na labeled CT dataset created by the Research and Analysis Department at University Hospital Basel. Following the\nofficial test split, we selected 11 CT scans, manually ensuring that 6 of them contained at least one hip implant. The\nreference segmentation mask was generated by merging the original reference mask with a manually created annotation\nin ITK-SNAP [41] of the hip implant (stem and cup together). The existing segmentation masks for the left and right\nhips, as well as the left and right femurs, were left unchanged; no corrections for over- or under-segmentation were\napplied.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 49, + "total_chunks": 71, + "char_count": 689, + "word_count": 110, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "23393e9e-3a76-4622-ace2-057eb7033e30", + "text": "To reduce the workload in the observer study, axial slices were extracted from the 3D CT volumes taking\ninto account data coverage, diversity and comparability between data subsets. To avoid slices with little to no relevant\nanatomical information, the top and bottom 10% of each object were excluded from the slice selection. By default, two\nslices per class were extracted from the remaining object volume, maintaining at least a 10-slice interval (see Figure 2). However, since the data subsets differ in their characteristics (e.g., number of classes and slices), the default setting was\nadjusted accordingly. For Wrist, a 5-slice gap was used because six classes were distributed across an average of 363\naxial slices, making a 10-slice gap too large to maintain. To ensure a comparable number of slices across datasets and\nto account for the large volume size (over 1000 slices), three slices per class were selected for Lower Leg. For Hip, only\nthe original classes were used for slice selection to ensure an equal number of slices per sample, as the two newly added\nlabels do not appear in every CT scan. Samples seen twice by annotators. A dataset-specific duplication strategy was applied. For the Wrist, Shoulder, and\nHip datasets, a balanced approach was used by selecting one of the two selected slices per class label a second time\n(i.e., 50% slices used twice). In contrast, all samples in the Lower Leg dataset were used a second time due to several\ndataset-specific characteristics: The number of classes per slice is limited (at most two reference classes), which reduces\nannotation time per sample; The majority of selected slices only contains one class, whereas slices in Wrist, Shoulder\nand Hip commonly display multiple classes; The extraction of three slices per class label precludes an even duplication\nsplit, unlike the other datasets. SAM, SAM2, Med-SAM, Med-SAM2, SAM-Med2D, ScribblePrompt, SegVol, Vista3D, MedicoSAM2D, and nnInteractive were used as described by their GitHub repositories, including the provided tutorials and example scripts for\ndata pre-processing3. MedicoSAM3D [11] has three hyperparameter for prompt propagation: the IoU threshold, projection mode, and box\nextension factor, which controls the expansion of the box after projection. Optimal performance requires tuning these\nhyperparameter for each data subset. To establish a single standardized inference protocol for our entire dataset, we\nperformed a grid-based hyperparameter search on four representative samples – one from each subset, the same samples\nthat participants from the observer study used for their training phase.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 50, + "total_chunks": 71, + "char_count": 2635, + "word_count": 406, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "73f4cbf3-a3e4-472e-a73b-b5b3ccfe0a72", + "text": "The search space included IoU thresholds from\n0.7 to 0.9 (step 0.1), projection modes box, points, points and masks, single point, and box extensions from 0 to 0.25\n(step 0.05). The final settings, selected by majority vote from all experiments, were iou_threshold = 0.7, projection\n= single_point, and box_extension = 0.0. The latest version of SAM-Med3D4 does not support sliding-window inference with built-in prompt propagation,\nin contrast to methods such as SegVol [15] or Vista3D [16]. In its current implementation, inference operates on\nindependent (128,128,128) window crops, each of which requires a newly provided prompt. Because the method does 3SAM: commit 6fdee8f, SAM2: commit 2b90b9f, Med-SAM: commit 2b7c64c, Med-SAM2: commit 332f30d, SAM-Med2D: commit bfd2b93,\nScribblePrompt: commit 182449, SegVol: 4ee0a47, Vista3D: commit 8bb7572, MedicoSAM: 9d73c29, nnInteractive: 47c4626\n4SAM-Med3D commit: e8d2e0a", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 51, + "total_chunks": 71, + "char_count": 922, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6204780f-adf8-40cb-9ea1-419a1e1440d8", + "text": "not implement an overlapping sliding window where prompts are automatically derived from the previously generated\nmask, the user needs to provide prompts for every crop. As our use case requires fully automatic inference after the\ninitial prompt, this evaluation strategy cannot be applied. To perform inference with SAM-Med3D, we implemented\ntwo alternatives without modifying the model framework: The first naive approach is to crop a (128,128,128) window\naround the initial prompt, which may fail to fully capture objects that exceed this size; The second is to resample the\nentire image by resizing its longest side to 128 voxels. Although this ensuring that the entire object is captures, it can\nsignificantly distort the image and affect the performance.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 52, + "total_chunks": 71, + "char_count": 760, + "word_count": 117, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b6e645d0-8917-4fa2-a900-81c7adeeb13b", + "text": "MedicalSAM2 (MedSAM-2) [12] was not included in our analysis due to persistent assertion errors in the model\narchitecture code preventing successful execution5, and resolving these issues would have required extensive investigation\nbeyond the scope of this study. CT-SAM3D [42] was not included in our analysis because preliminary tests produced\nempty prediction masks. We hypothesize that the fixed 64×64×64 patch size in combination with the absence of a\nsliding-window inference or automatic prompt propagation (similar to SAM-Med3D) did not generalize well to our data. 5MedicalSAM2: commit 18b0f5b", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 53, + "total_chunks": 71, + "char_count": 602, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5e5ed344-eba8-4df6-90b5-6a8afe0bb452", + "text": "Human prompt variation Accuracy of human prompts\nCenter point. Table C.8 collects detailed results on the median Euclidean distance (mm). Figure C.8 visualizes the\nspatial distribution of the center point deviations (∆x, ∆y) and the intra-rater consistency (∆x, ∆y) per class label. Table C.8: Euclidean distances (mm) of human center points compared to reference center points measured as median and IQR. (a) Total Average & Dataset Lower Leg and corresponding class labels Annotator Total Lower Leg Tibia Implant Tibia bone all 1.50 (0.7-3.0) 1.76 (1.0-1.0) 1.54 (0.7-0.7) 2.04 (1.1-1.1)\nannotator01 1.50 (0.7-0.7) 1.95 (1.0-1.0) 1.54 (0.7-0.7) 2.04 (1.1-1.1)\nannotator02 1.50 (0.7-0.7) 1.38 (0.7-0.7) 1.54 (1.0-1.0) 1.38 (0.7-0.7)\nannotator03 1.50 (0.8-0.8) 2.01 (1.0-1.0) 1.42 (0.7-0.7) 2.01 (1.1-1.1)\nannotator04 1.50 (0.7-0.7) 1.76 (1.0-1.0) 1.46 (0.5-0.5) 1.95 (1.1-1.1)\nannotator05 1.38 (0.7-0.7) 1.54 (0.7-0.7) 1.24 (0.5-0.5) 1.54 (0.7-0.7)\nannotator06 1.63 (0.8-0.8) 2.04 (1.0-1.0) 1.78 (0.8-0.8) 2.13 (1.1-1.1)\nannotator07 1.50 (0.7-0.7) 2.85 (1.1-1.1) 1.65 (1.0-1.0) 3.59 (1.7-1.7)\nannotator08 1.37 (0.7-0.7) 1.54 (1.0-1.0) 1.76 (1.0-1.0) 1.54 (1.0-1.0)\nannotator09 1.50 (0.8-0.8) 1.95 (1.1-1.1) 1.78 (1.0-1.0) 1.95 (1.1-1.1)\nannotator10 1.59 (0.7-0.7) 1.95 (1.1-1.1) 1.09 (0.7-0.7) 2.01 (1.2-1.2)\nannotator11 1.74 (1.0-1.0) 1.76 (1.1-1.1) 1.95 (1.1-1.1) 1.76 (1.1-1.1)\nannotator12 1.66 (1.0-1.0) 1.76 (1.1-1.1) 1.76 (1.1-1.1) 1.95 (1.1-1.1)\nannotator13 1.50 (0.8-0.8) 1.54 (1.0-1.0) 1.38 (0.7-0.7) 1.95 (1.1-1.1)\nannotator14 1.46 (0.7-0.7) 1.54 (1.0-1.0) 1.09 (0.7-0.7) 1.76 (1.0-1.0)\nannotator15 1.38 (0.7-0.7) 1.09 (0.7-0.7) 1.09 (0.7-0.7) 1.09 (0.7-0.7)\nannotator16 1.71 (0.9-0.9) 2.07 (1.1-1.1) 1.50 (1.1-1.1) 2.18 (1.5-1.5)\nannotator17 1.46 (0.7-0.7) 1.76 (1.0-1.0) 1.38 (0.7-0.7) 2.01 (1.1-1.1)\nannotator18 1.50 (0.7-0.7) 1.95 (1.1-1.1) 1.95 (1.1-1.1) 1.95 (1.1-1.1)\nannotator19 1.52 (0.7-0.7) 2.01 (1.0-1.0) 1.09 (0.5-0.5) 2.31 (1.1-1.1)\nannotator20 1.54 (0.8-0.8) 2.01 (1.1-1.1) 2.07 (1.0-1.0) 2.01 (1.4-1.4) (b) Dataset Shoulder and corresponding class labels Annotator Shoulder Humerus R Scapula R Humerus L Scapula L\nall 1.86 (1.0-1.0) 1.38 (1.0-1.0) 1.67 (1.0-1.0) 1.36 (0.9-0.9) 2.18 (1.2-1.2)\nannotator01 1.38 (1.0-1.0) 1.38 (1.0-1.0) 1.67 (1.0-1.0) 1.36 (0.9-0.9) 2.18 (1.2-1.2)\nannotator02 1.91 (1.0-1.0) 1.29 (1.0-1.0) 2.17 (1.2-1.2) 1.56 (1.0-1.0) 2.04 (1.2-1.2)\nannotator03 1.94 (1.2-1.2) 1.89 (1.0-1.0) 2.18 (1.6-1.6) 1.38 (1.0-1.0) 2.18 (1.2-1.2)\nannotator04 1.91 (1.0-1.0) 1.69 (1.0-1.0) 2.50 (1.4-1.4) 1.22 (0.9-0.9) 2.46 (1.4-1.4)\nannotator05 1.38 (1.0-1.0) 1.21 (1.0-1.0) 1.86 (1.0-1.0) 1.38 (1.0-1.0) 1.38 (1.0-1.0)\nannotator06 1.91 (1.0-1.0) 1.38 (1.0-1.0) 2.76 (1.5-1.5) 1.29 (0.9-0.9) 3.08 (1.3-1.3)\nannotator07 1.66 (1.0-1.0) 1.69 (1.0-1.0) 1.91 (1.2-1.2) 1.36 (1.0-1.0) 1.86 (1.0-1.0)\nannotator08 1.37 (1.0-1.0) 1.28 (1.0-1.0) 1.89 (1.0-1.0) 1.18 (0.9-0.9) 1.38 (1.0-1.0)\nannotator09 1.66 (1.0-1.0) 1.38 (1.0-1.0) 2.18 (1.0-1.0) 1.18 (1.0-1.0) 1.94 (1.3-1.3)\nannotator10 1.94 (1.0-1.0) 0.98 (0.8-0.8) 4.03 (2.7-2.7) 0.98 (0.9-0.9) 4.03 (1.9-1.9)\nannotator11 1.95 (1.2-1.2) 1.86 (1.4-1.4) 2.18 (1.9-1.9) 1.29 (1.0-1.0) 2.36 (1.4-1.4)\nannotator12 1.94 (1.2-1.2) 1.89 (1.2-1.2) 2.18 (1.7-1.7) 1.52 (1.0-1.0) 1.95 (1.4-1.4)\nannotator13 1.86 (1.0-1.0) 1.37 (1.0-1.0) 1.95 (1.2-1.2) 1.38 (1.0-1.0) 2.30 (1.4-1.4)\nannotator14 1.86 (1.2-1.2) 1.69 (1.0-1.0) 2.18 (1.7-1.7) 1.37 (1.0-1.0) 2.18 (1.2-1.2)\nannotator15 1.38 (1.0-1.0) 1.21 (1.0-1.0) 1.95 (1.2-1.2) 1.19 (1.0-1.0) 1.94 (1.0-1.0)\nannotator16 1.95 (1.2-1.2) 1.94 (1.2-1.2) 2.53 (1.7-1.7) 1.30 (1.0-1.0) 2.27 (1.9-1.9)\nannotator17 1.38 (1.0-1.0) 1.23 (1.0-1.0) 1.95 (1.2-1.2) 1.18 (1.0-1.0) 1.94 (1.2-1.2)\nannotator18 1.38 (1.0-1.0) 1.26 (1.0-1.0) 1.94 (1.1-1.1) 1.19 (1.0-1.0) 1.94 (1.4-1.4)\nannotator19 1.94 (1.2-1.2) 1.38 (1.0-1.0) 2.36 (1.4-1.4) 1.22 (1.0-1.0) 3.44 (1.9-1.9)\nannotator20 1.91 (1.0-1.0) 1.69 (1.0-1.0) 2.06 (1.4-1.4) 1.38 (1.0-1.0) 2.17 (1.4-1.4) (c) Dataset Wrist and corresponding class labels\nAnnotator Wrist Capitate Lunate Radius Scaphoid Triquetrum Ulna\nall 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.95 (0.7-0.7) 0.65 (0.3-0.3) 0.73 (0.6-0.6) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator01 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.95 (0.7-0.7) 0.65 (0.3-0.3) 0.73 (0.6-0.6) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator02 0.73 (0.5-0.5) 0.65 (0.3-0.3) 1.17 (0.7-0.7) 0.73 (0.5-0.5) 0.73 (0.6-0.6) 0.65 (0.3-0.3) 0.65 (0.4-0.4)\nannotator03 0.73 (0.5-0.5) 0.65 (0.5-0.5) 1.46 (0.7-0.7) 0.92 (0.5-0.5) 0.73 (0.5-0.5) 0.69 (0.5-0.5) 0.46 (0.3-0.3)\nannotator04 0.73 (0.5-0.5) 0.65 (0.4-0.4) 1.17 (0.7-0.7) 0.73 (0.3-0.3) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator05 0.65 (0.3-0.3) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.65 (0.4-0.4) 0.65 (0.3-0.3) 0.46 (0.3-0.3) 0.46 (0.4-0.4)\nannotator06 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.03 (0.7-0.7) 0.73 (0.3-0.3) 0.73 (0.7-0.7) 0.73 (0.5-0.5) 0.65 (0.3-0.3)\nannotator07 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.92 (0.7-0.7) 0.65 (0.3-0.3) 0.98 (0.7-0.7) 0.69 (0.5-0.5) 0.65 (0.5-0.5)\nannotator08 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.5-0.5)\nannotator09 0.73 (0.5-0.5) 0.69 (0.3-0.3) 1.03 (0.7-0.7) 0.73 (0.7-0.7) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator10 0.73 (0.5-0.5) 0.65 (0.3-0.3) 1.03 (0.5-0.5) 0.69 (0.4-0.4) 0.98 (0.7-0.7) 0.46 (0.3-0.3) 0.46 (0.4-0.4)\nannotator11 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.17 (0.7-0.7) 0.73 (0.5-0.5) 0.95 (0.7-0.7) 0.92 (0.5-0.5) 0.69 (0.4-0.4)\nannotator12 0.73 (0.5-0.5) 0.73 (0.5-0.5) 1.17 (0.7-0.7) 0.82 (0.5-0.5) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.73 (0.5-0.5)\nannotator13 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.92 (0.7-0.7) 0.65 (0.4-0.4) 0.73 (0.7-0.7) 0.65 (0.4-0.4) 0.46 (0.3-0.3)\nannotator14 0.73 (0.3-0.3) 0.73 (0.3-0.3) 0.92 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.46 (0.3-0.3)\nannotator15 0.73 (0.5-0.5) 0.46 (0.3-0.3) 1.26 (0.7-0.7) 0.69 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5)\nannotator16 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.98 (0.7-0.7) 0.73 (0.5-0.5) 0.98 (0.7-0.7) 0.65 (0.3-0.3) 0.65 (0.5-0.5)\nannotator17 0.65 (0.5-0.5) 0.65 (0.3-0.3) 1.00 (0.6-0.6) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.4-0.4) 0.65 (0.5-0.5)\nannotator18 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.92 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.73 (0.4-0.4)\nannotator19 0.65 (0.4-0.4) 0.46 (0.3-0.3) 1.10 (0.5-0.5) 0.65 (0.5-0.5) 0.98 (0.7-0.7) 0.46 (0.3-0.3) 0.46 (0.3-0.3)\nannotator20 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.17 (0.7-0.7) 0.46 (0.5-0.5) 0.98 (0.7-0.7) 0.73 (0.5-0.5) 0.56 (0.3-0.3) (d) Dataset Hip and corresponding class labels\nAnnotator Hip Femur L Femur R Hip L Hip R Femur Implant L Femur Implant R\nall 3.35 (2.1-2.1) 3.00 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 3.00 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator01 3.00 (2.1-2.1) 3.00 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 3.00 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator02 3.35 (2.1-2.1) 3.18 (2.1-2.1) 5.41 (3.0-3.0) 4.74 (3.3-3.3) 4.37 (3.0-3.0) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator03 3.35 (2.1-2.1) 3.35 (1.7-1.7) 4.50 (2.1-2.1) 4.24 (2.1-2.1) 4.74 (3.0-3.0) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator04 3.35 (2.1-2.1) 3.00 (2.1-2.1) 4.24 (2.1-2.1) 5.41 (3.0-3.0) 4.74 (3.0-3.0) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator05 3.00 (1.5-1.5) 2.12 (2.1-2.1) 3.18 (1.5-1.5) 3.35 (2.1-2.1) 3.00 (1.5-1.5) 1.50 (0.0-0.0) 2.12 (1.5-1.5)\nannotator06 3.35 (2.1-2.1) 3.00 (1.5-1.5) 4.74 (3.4-3.4) 4.37 (2.1-2.1) 3.35 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator07 3.18 (1.5-1.5) 3.00 (1.5-1.5) 4.50 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (1.5-1.5) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator08 3.00 (1.5-1.5) 2.12 (1.5-1.5) 3.35 (3.0-3.0) 3.35 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (2.1-2.1) 2.12 (1.5-1.5)\nannotator09 3.35 (2.1-2.1) 3.35 (2.1-2.1) 4.74 (3.3-3.3) 4.24 (3.0-3.0) 4.24 (2.1-2.1) 1.50 (1.5-1.5) 3.35 (2.1-2.1)\nannotator10 4.50 (2.1-2.1) 2.12 (1.5-1.5) 4.50 (3.0-3.0) 6.71 (3.4-3.4) 7.50 (3.4-3.4) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator11 3.35 (2.1-2.1) 3.35 (1.5-1.5) 4.74 (3.4-3.4) 4.74 (3.3-3.3) 3.35 (2.1-2.1) 1.50 (0.0-0.0) 2.12 (1.5-1.5)\nannotator12 3.35 (2.1-2.1) 3.35 (2.1-2.1) 5.41 (3.4-3.4) 3.35 (2.1-2.1) 4.24 (2.3-2.3) 1.50 (1.5-1.5) 1.50 (1.5-1.5)\nannotator13 3.35 (2.1-2.1) 2.12 (2.1-2.1) 4.24 (2.1-2.1) 4.24 (3.0-3.0) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator14 3.35 (2.1-2.1) 3.00 (1.5-1.5) 6.35 (3.0-3.0) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator15 3.35 (1.5-1.5) 2.12 (1.5-1.5) 4.95 (1.5-1.5) 4.74 (3.4-3.4) 4.50 (2.1-2.1) 1.50 (0.0-0.0) 1.50 (1.5-1.5)\nannotator16 3.35 (2.1-2.1) 2.12 (1.5-1.5) 4.24 (2.1-2.1) 5.41 (3.4-3.4) 4.50 (2.8-2.8) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator17 3.00 (1.5-1.5) 2.12 (1.5-1.5) 3.35 (2.1-2.1) 3.35 (1.5-1.5) 3.00 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator18 3.35 (1.5-1.5) 2.12 (1.5-1.5) 4.24 (2.1-2.1) 3.35 (2.1-2.1) 4.50 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator19 3.35 (2.1-2.1) 3.18 (1.5-1.5) 4.24 (3.0-3.0) 5.41 (3.0-3.0) 4.74 (2.1-2.1) 1.50 (1.5-1.5) 1.50 (1.5-1.5)\nannotator20 3.35 (2.1-2.1) 2.74 (2.1-2.1) 3.35 (3.0-3.0) 4.24 (2.1-2.1) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5) Figure C.8: Spatial distribution of mean ∆x and ∆y per annotator per class label. The same-colored (more transparent) ellipse represent\neach annotator's intra-rater consistency (∆x, ∆y).", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 54, + "total_chunks": 71, + "char_count": 9278, + "word_count": 1211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e288d80f-53b2-47b8-a8a4-bf213436e027", + "text": "(a) Wrist (b) Lower Leg Figure C.9: Examples for center point annotations: Center points with low euclidean distance (mm) (top row) and high values (bottom row)\nper data subset. Black dots are automatically extracted reference annotation, annotators' annotations are color-encoded.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 55, + "total_chunks": 71, + "char_count": 281, + "word_count": 40, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fcfd172c-c2ee-4adb-ba5d-e261792cc011", + "text": "Table C.9 collects detailed results on the median IoU (%). Figure C.10 visualizes the spatial distribution\nof the bounding boxes' center point deviations (∆x, ∆y) and the intra-rater consistency (∆w, ∆h) per class labels. Table C.9: IoU (%) of human bounding boxes compared to referenc bounding boxes measured as median and IQR. (a) Total Average & Dataset Lower Leg and corresponding class labels Annotator Total Lower Leg Tibia Implant Tibia bone\nall 90.56 (83.4-94.5) 90.02 (80.3-80.3) 84.82 (79.1-79.1) 92.92 (89.5-89.5)\nannotator01 91.22 (85.5-85.5) 90.10 (86.5-86.5) 84.82 (79.1-79.1) 92.92 (89.5-89.5)\nannotator02 91.22 (85.9-85.9) 91.68 (81.2-81.2) 75.00 (68.5-68.5) 93.33 (91.8-91.8)\nannotator03 84.27 (73.8-73.8) 79.96 (61.2-61.2) 52.91 (45.4-45.4) 84.06 (79.6-79.6)\nannotator04 93.32 (89.0-89.0) 93.66 (90.2-90.2) 89.29 (80.9-80.9) 94.83 (91.9-91.9)\nannotator05 86.99 (79.1-79.1) 84.33 (77.0-77.0) 72.63 (56.3-56.3) 87.80 (83.5-83.5)\nannotator06 92.62 (86.8-86.8) 93.48 (88.6-88.6) 86.89 (78.8-78.8) 95.29 (92.8-92.8)\nannotator07 92.87 (87.2-87.2) 91.12 (70.7-70.7) 63.89 (49.8-49.8) 93.86 (91.4-91.4)\nannotator08 93.15 (88.5-88.5) 93.33 (88.7-88.7) 84.67 (75.5-75.5) 95.64 (92.8-92.8)\nannotator09 90.69 (85.0-85.0) 92.03 (81.5-81.5) 76.38 (66.4-66.4) 94.60 (92.2-92.2)\nannotator10 93.29 (88.2-88.2) 92.93 (87.2-87.2) 81.59 (70.0-70.0) 94.70 (91.7-91.7)\nannotator11 85.43 (76.0-76.0) 82.85 (70.2-70.2) 64.10 (50.4-50.4) 88.11 (82.5-82.5)\nannotator12 88.42 (80.1-80.1) 84.96 (78.3-78.3) 69.27 (56.5-56.5) 88.89 (84.1-84.1)\nannotator13 90.32 (83.9-83.9) 89.41 (83.5-83.5) 77.66 (68.4-68.4) 93.04 (88.5-88.5)\nannotator14 90.75 (85.1-85.1) 91.66 (82.5-82.5) 74.05 (68.3-68.3) 93.79 (91.8-91.8)\nannotator15 92.22 (87.2-87.2) 91.88 (84.7-84.7) 79.62 (74.5-74.5) 93.87 (91.5-91.5)\nannotator16 79.13 (70.2-70.2) 75.81 (65.1-65.1) 55.88 (48.0-48.0) 82.70 (75.5-75.5)\nannotator17 91.67 (85.3-85.3) 90.69 (83.8-83.8) 81.59 (72.3-72.3) 93.48 (90.6-90.6)\nannotator18 90.61 (84.6-84.6) 89.11 (80.7-80.7) 79.69 (66.1-66.1) 92.30 (89.0-89.0)\nannotator19 91.54 (86.7-86.7) 90.61 (84.3-84.3) 80.00 (72.4-72.4) 93.81 (89.8-89.8)\nannotator20 91.38 (85.4-85.4) 91.55 (80.8-80.8) 72.59 (67.7-67.7) 94.65 (91.6-91.6) (b) Dataset Shoulder and corresponding class labels Annotator Shoulder Humerus R Scapula R Humerus L Scapula L\nall 87.82 (80.5-80.5) 84.63 (81.0-81.0) 93.14 (89.4-89.4) 84.44 (80.5-80.5) 92.27 (89.7-89.7)\nannotator01 88.74 (82.5-82.5) 84.63 (81.0-81.0) 93.14 (89.4-89.4) 84.44 (80.5-80.5) 92.27 (89.7-89.7)\nannotator02 90.62 (85.4-85.4) 87.53 (83.4-83.4) 92.80 (89.1-89.1) 87.00 (83.6-83.6) 92.64 (88.5-88.5)\nannotator03 75.09 (65.1-65.1) 67.35 (62.4-62.4) 86.25 (79.1-79.1) 68.78 (62.0-62.0) 80.45 (73.4-73.4)\nannotator04 91.13 (87.1-87.1) 89.32 (84.9-84.9) 92.58 (90.3-90.3) 89.24 (86.0-86.0) 91.47 (89.2-89.2)\nannotator05 80.64 (71.1-71.1) 71.50 (66.9-66.9) 85.92 (82.5-82.5) 72.87 (67.3-67.3) 85.68 (79.7-79.7)\nannotator06 89.29 (82.8-82.8) 84.15 (77.7-77.7) 93.30 (91.0-91.0) 84.18 (79.9-79.9) 91.21 (86.9-86.9)\nannotator07 91.27 (86.1-86.1) 88.10 (83.3-83.3) 93.94 (91.3-91.3) 89.26 (84.7-84.7) 92.59 (89.4-89.4)\nannotator08 92.61 (88.0-88.0) 89.61 (85.7-85.7) 94.15 (90.4-90.4) 92.05 (87.0-87.0) 94.16 (90.4-90.4)\nannotator09 89.38 (84.0-84.0) 86.57 (80.7-80.7) 93.11 (89.0-89.0) 85.42 (80.0-80.0) 91.04 (87.0-87.0)\nannotator10 92.00 (86.2-86.2) 89.51 (83.9-83.9) 93.96 (91.3-91.3) 87.87 (83.8-83.8) 93.78 (91.0-91.0)\nannotator11 78.22 (70.5-70.5) 72.47 (67.4-67.4) 82.97 (77.1-77.1) 73.86 (69.5-69.5) 81.83 (75.2-75.2)\nannotator12 80.95 (72.6-72.6) 75.11 (69.8-69.8) 86.13 (80.2-80.2) 77.74 (71.5-71.5) 83.58 (77.5-77.5)\nannotator13 86.67 (80.7-80.7) 81.92 (76.7-76.7) 90.38 (86.0-86.0) 83.61 (78.8-78.8) 90.13 (87.3-87.3)\nannotator14 88.70 (83.5-83.5) 85.95 (81.6-81.6) 90.95 (87.4-87.4) 85.93 (79.6-79.6) 90.79 (87.6-87.6)\nannotator15 92.80 (88.0-88.0) 90.21 (85.5-85.5) 95.16 (92.4-92.4) 91.00 (85.5-85.5) 93.26 (90.2-90.2)\nannotator16 73.83 (66.2-66.2) 72.41 (66.0-66.0) 75.25 (68.4-68.4) 71.87 (63.4-63.4) 75.02 (67.1-67.1)\nannotator17 87.25 (80.4-80.4) 81.88 (77.6-77.6) 90.26 (87.4-87.4) 82.92 (76.1-76.1) 90.24 (85.8-85.8)\nannotator18 88.21 (81.5-81.5) 84.33 (79.6-79.6) 91.81 (88.4-88.4) 82.84 (77.6-77.6) 90.00 (85.9-85.9)\nannotator19 89.29 (83.7-83.7) 87.85 (81.8-81.8) 91.36 (87.8-87.8) 88.04 (82.6-82.6) 89.84 (86.5-86.5)\nannotator20 90.18 (85.2-85.2) 86.46 (83.0-83.0) 93.44 (89.2-89.2) 89.54 (84.2-84.2) 92.17 (86.4-86.4) (c) Dataset Wrist and corresponding class labels\nAnnotator Wrist Capitate Lunate Radius Scaphoid Triquetrum Ulna\nall 92.21 (88.1-88.1) 92.80 (89.2-89.2) 94.09 (91.1-91.1) 94.61 (90.4-90.4) 94.67 (92.1-92.1) 93.00 (89.9-89.9) 91.80 (88.7-88.7)\nannotator01 93.77 (89.9-89.9) 92.80 (89.2-89.2) 94.09 (91.1-91.1) 94.61 (90.4-90.4) 94.67 (92.1-92.1) 93.00 (89.9-89.9) 91.80 (88.7-88.7)\nannotator02 91.17 (87.8-87.8) 91.04 (87.6-87.6) 91.07 (86.4-86.4) 92.35 (88.7-88.7) 91.58 (89.3-89.3) 89.56 (87.6-87.6) 90.31 (88.3-88.3)\nannotator03 90.64 (85.6-85.6) 91.49 (89.0-89.0) 89.29 (84.9-84.9) 91.80 (85.7-85.7) 92.08 (87.5-87.5) 88.67 (82.0-82.0) 87.09 (83.4-83.4)\nannotator04 95.03 (92.6-92.6) 94.72 (92.6-92.6) 95.59 (91.1-91.1) 95.99 (92.3-92.3) 95.66 (93.5-93.5) 94.59 (92.3-92.3) 95.00 (92.6-92.6)\nannotator05 89.20 (85.8-85.8) 89.30 (85.5-85.5) 89.30 (85.2-85.2) 90.57 (86.7-86.7) 89.47 (86.8-86.8) 88.46 (84.8-84.8) 87.96 (83.9-83.9)\nannotator06 94.30 (91.2-91.2) 94.52 (90.7-90.7) 94.58 (90.5-90.5) 94.87 (92.2-92.2) 94.29 (92.3-92.3) 92.88 (90.6-90.6) 94.23 (91.0-91.0)\nannotator07 94.60 (91.8-91.8) 95.28 (93.8-93.8) 94.72 (91.5-91.5) 94.10 (90.6-90.6) 94.88 (93.2-93.2) 94.44 (91.1-91.1) 93.44 (89.6-89.6)\nannotator08 94.59 (91.8-91.8) 95.25 (92.5-92.5) 94.58 (91.2-91.2) 95.24 (93.2-93.2) 94.69 (91.5-91.5) 94.14 (90.5-90.5) 94.37 (92.0-92.0)\nannotator09 91.20 (87.1-87.1) 91.07 (87.4-87.4) 91.44 (86.9-86.9) 91.33 (87.8-87.8) 91.53 (88.1-88.1) 90.84 (86.8-86.8) 90.47 (86.2-86.2)\nannotator10 94.59 (91.3-91.3) 94.29 (91.9-91.9) 94.62 (90.5-90.5) 95.11 (91.7-91.7) 95.18 (92.6-92.6) 94.08 (91.3-91.3) 93.01 (89.2-89.2)\nannotator11 88.35 (83.3-83.3) 87.56 (83.9-83.9) 90.32 (85.6-85.6) 86.98 (80.7-80.7) 89.45 (85.8-85.8) 87.77 (83.9-83.9) 83.17 (79.3-79.3)\nannotator12 92.01 (88.9-88.9) 92.01 (89.6-89.6) 92.24 (89.0-89.0) 92.86 (86.5-86.5) 92.63 (89.6-89.6) 91.54 (88.3-88.3) 89.80 (86.2-86.2)\nannotator13 91.41 (88.8-88.8) 91.15 (88.9-88.9) 91.78 (90.6-90.6) 92.82 (90.1-90.1) 91.42 (90.0-90.0) 90.24 (87.6-87.6) 90.53 (85.3-85.3)\nannotator14 91.23 (87.8-87.8) 90.50 (88.0-88.0) 90.97 (88.0-88.0) 93.24 (90.9-90.9) 91.25 (89.0-89.0) 89.03 (83.9-83.9) 92.11 (88.5-88.5)\nannotator15 92.96 (90.0-90.0) 92.72 (89.7-89.7) 94.12 (89.2-89.2) 94.11 (90.5-90.5) 94.01 (92.0-92.0) 91.93 (89.5-89.5) 92.08 (88.3-88.3)\nannotator16 80.70 (75.7-75.7) 81.46 (77.1-77.1) 80.90 (76.5-76.5) 79.76 (74.9-74.9) 82.49 (77.6-77.6) 79.99 (75.1-75.1) 76.81 (72.7-72.7)\nannotator17 93.54 (90.7-90.7) 93.75 (90.7-90.7) 92.92 (90.6-90.6) 94.49 (91.5-91.5) 93.47 (91.8-91.8) 92.67 (89.9-89.9) 93.41 (90.6-90.6)\nannotator18 92.22 (88.7-88.7) 91.02 (88.4-88.4) 93.06 (89.2-89.2) 92.87 (88.6-88.6) 92.41 (90.3-90.3) 91.67 (88.6-88.6) 92.26 (86.3-86.3)\nannotator19 93.14 (90.7-90.7) 92.63 (90.4-90.4) 93.72 (91.0-91.0) 93.96 (90.3-90.3) 93.73 (92.0-92.0) 93.12 (90.6-90.6) 91.97 (89.4-89.4)\nannotator20 92.42 (89.0-89.0) 92.88 (90.9-90.9) 92.11 (85.2-85.2) 94.83 (90.7-90.7) 92.42 (90.6-90.6) 91.68 (86.5-86.5) 91.14 (86.0-86.0) (d) Dataset Hip and corresponding class labels\nAnnotator Hip Femur L Femur R Hip L Hip R Femur Implant L Femur Implant R\nall 90.69 (82.1-82.1) 91.59 (88.7-88.7) 90.81 (88.7-88.7) 90.91 (87.6-87.6) 91.66 (87.4-87.4) 67.11 (60.2-60.2) 65.98 (55.8-55.8)\nannotator01 90.19 (81.0-81.0) 91.59 (88.7-88.7) 90.81 (88.7-88.7) 90.91 (87.6-87.6) 91.66 (87.4-87.4) 67.11 (60.2-60.2) 65.98 (55.8-55.8)\nannotator02 91.34 (83.8-83.8) 91.46 (87.9-87.9) 91.04 (85.0-85.0) 93.29 (87.7-87.7) 93.27 (86.7-86.7) 75.00 (70.6-70.6) 77.67 (70.4-70.4)\nannotator03 85.58 (74.7-74.7) 85.71 (81.0-81.0) 84.08 (80.3-80.3) 89.74 (84.8-84.8) 90.03 (83.3-83.3) 60.71 (52.4-52.4) 55.84 (49.1-49.1)\nannotator04 92.00 (85.0-85.0) 91.58 (88.4-88.4) 90.84 (86.3-86.3) 95.00 (89.3-89.3) 93.75 (89.2-89.2) 83.22 (76.7-76.7) 83.33 (77.4-77.4)\nannotator05 89.51 (81.1-81.1) 89.86 (85.9-85.9) 88.89 (84.2-84.2) 91.28 (86.5-86.5) 91.30 (87.0-87.0) 65.24 (59.4-59.4) 65.38 (59.4-59.4)\nannotator06 92.68 (85.9-85.9) 91.30 (86.8-86.8) 92.50 (88.9-88.9) 95.96 (89.7-89.7) 93.41 (89.5-89.5) 74.56 (70.8-70.8) 77.67 (70.2-70.2)\nannotator07 92.42 (84.9-84.9) 94.19 (89.2-89.2) 92.11 (88.7-88.7) 93.42 (89.3-89.3) 92.92 (86.5-86.5) 74.30 (72.9-72.9) 70.64 (62.7-62.7)\nannotator08 90.88 (83.3-83.3) 90.31 (84.0-84.0) 90.87 (84.5-84.5) 92.82 (87.1-87.1) 92.60 (87.3-87.3) 77.73 (74.6-74.6) 75.45 (65.7-65.7)\nannotator09 90.19 (81.4-81.4) 91.44 (88.8-88.8) 90.19 (86.4-86.4) 91.78 (83.7-83.7) 90.16 (83.1-83.1) 83.04 (77.1-77.1) 76.60 (69.3-69.3)\nannotator10 92.86 (85.5-85.5) 93.66 (89.3-89.3) 93.33 (88.7-88.7) 94.12 (89.1-89.1) 94.35 (90.7-90.7) 77.78 (75.0-75.0) 77.24 (70.3-70.3)\nannotator11 89.74 (79.6-79.6) 90.82 (86.7-86.7) 90.91 (86.6-86.6) 91.41 (86.4-86.4) 90.51 (84.0-84.0) 62.63 (60.7-60.7) 60.44 (55.5-55.5)\nannotator12 90.73 (82.9-82.9) 91.89 (85.5-85.5) 90.18 (87.0-87.0) 91.43 (87.4-87.4) 92.63 (86.0-86.0) 75.00 (70.0-70.0) 69.35 (61.7-61.7)\nannotator13 91.87 (83.5-83.5) 91.11 (87.1-87.1) 91.67 (85.5-85.5) 92.86 (87.4-87.4) 93.79 (88.1-88.1) 75.00 (70.2-70.2) 76.39 (59.2-59.2)\nannotator14 91.49 (83.9-83.9) 92.12 (86.6-86.6) 91.54 (88.7-88.7) 93.12 (88.5-88.5) 94.35 (89.1-89.1) 77.73 (66.1-66.1) 77.24 (64.8-64.8)\nannotator15 90.10 (82.4-82.4) 89.31 (82.5-82.5) 89.02 (82.9-82.9) 90.64 (85.8-85.8) 92.51 (88.1-88.1) 81.82 (74.1-74.1) 79.00 (71.2-71.2)\nannotator16 82.36 (71.0-71.0) 86.49 (80.2-80.2) 82.45 (76.2-76.2) 85.87 (79.6-79.6) 82.44 (75.6-75.6) 62.13 (47.6-47.6) 53.61 (42.7-42.7)\nannotator17 92.38 (84.9-84.9) 92.58 (89.8-89.8) 91.55 (89.7-89.7) 94.18 (89.8-89.8) 94.29 (89.5-89.5) 77.78 (74.6-74.6) 77.24 (71.2-71.2)\nannotator18 91.55 (83.4-83.4) 94.59 (88.7-88.7) 90.42 (86.3-86.3) 93.74 (88.3-88.3) 93.67 (86.8-86.8) 81.25 (74.6-74.6) 75.76 (72.6-72.6)\nannotator19 91.56 (85.4-85.4) 91.82 (87.4-87.4) 90.27 (86.7-86.7) 93.30 (89.5-89.5) 93.94 (87.9-87.9) 81.48 (70.8-70.8) 75.93 (70.5-70.5)\nannotator20 90.48 (82.5-82.5) 90.48 (84.2-84.2) 90.90 (86.5-86.5) 92.18 (84.8-84.8) 92.11 (88.2-88.2) 77.06 (74.6-74.6) 76.92 (71.7-71.7) Figure C.10: Spatial distribution of the mean ∆x and ∆y per annotator per class label. The same-colored (more transparent) rectangle\nrepresents each annotator's intra-rater consistency (∆w, ∆h). (a) Wrist (b) Lower Leg", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 56, + "total_chunks": 71, + "char_count": 10682, + "word_count": 1208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "19c77f47-9552-468c-bc07-c33880ddd3fe", + "text": "Figure C.11: Examples for bounding box annotations: Boxes with high IoU (%) (top row) and low values (bottom row) per data subset. Black dots are automatically extracted reference annotation, annotators' annotations are color-encoded. Inter-rater annotation consistency\nTable C.10 shows the inter-rater variability ranking, starting with the annotator with the lowest variability to all other\nannotators. This ranking is used for the iterative search to determine the threshold of model sensitivity to inter-rater\nvariability. The rows highlighted in bold have been tested in the iterative search.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 57, + "total_chunks": 71, + "char_count": 597, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "04eccdb7-90a8-45e1-aa1a-059df0bde48f", + "text": "Table C.10: Ranking of inter-rater variability, measured by averaged euclidean distance (mm), per annotator, starting with the lowest\nvariability. The euclidean distance (mm) is averaged for all comparisons of one annotator to all other annotators. For the combination prompt, the\neuclidean distance, averaged from center point and bounding box analysis, is used for the ranking, because it considers both prompts. Annotators\nhighlighted in bold have been used in the iterative search approach. (a) Center Point (b) Combination Annotator Eucl. distance (mm) Annotator Eucl. distance (mm) IoU (%) annotator15 2.51±5.5 annotator02 1.67±2.8 87.32±7.9 annotator02 2.54±5.9 annotator15 1.68±2.8 88.44±8.3 annotator20 2.68±6.8 annotator05 1.85±3.2 87.58±8.9 annotator01 2.69±6.8 annotator14 1.86±3.2 87.76±8.0 annotator14 2.70±6.8 annotator20 1.86±3.2 87.81±8.1 annotator05 2.73±6.8 annotator01 1.87±3.3 89.08±9.6 annotator17 2.83±7.2 annotator17 1.94±3.4 89.07±8.8 annotator18 2.88±6.9 annotator04 1.94±3.2 89.25±8.2 annotator04 2.93±6.9 annotator06 1.94±3.2 89.35±8.1", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 58, + "total_chunks": 71, + "char_count": 1063, + "word_count": 132, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0b0901bd-a29f-489d-9854-f46fa0dcbdec", + "text": "annotator06 2.94±6.9 annotator18 1.95±3.2 88.64±9.4 annotator08 2.95±7.2 annotator09 1.96±3.1 87.95±8.1 annotator19 2.99±6.6 annotator08 2.00±3.4 88.15±8.8 annotator12 3.01±6.7 annotator19 2.00±3.1 88.66±8.8 annotator03 3.05±7.4 annotator12 2.05±3.2 87.67±7.8 annotator11 3.05±7.0 annotator10 2.09±3.4 89.41±8.6 annotator09 3.06±8.2 annotator11 2.09±3.3 86.73±7.8 annotator16 3.12±7.4 annotator03 2.17±3.4 84.13±9.6 annotator13 3.27±8.5 annotator13 2.20±4.2 89.03±7.6 annotator10 3.27±6.9 annotator16 2.24±3.6 81.06±9.1 annotator07 3.63±7.7 annotator07 2.40±3.6 87.22±11.6 Segmentation performance with reference prompts Table D.11 reports the segmentation performance of all 2D and 3D models, with the selected models (i.e., smallest\nPareto-optimal models) highlighted as gray-shaded cells. This table is an extension of Table 4, where the Pareto-optimal\nmodels per category and prompt type are summarized. The axial slices with the lowest average DSC values (i.e., negative\nexamples) across all 2D models are shown in Figures D.12 - D.15.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 59, + "total_chunks": 71, + "char_count": 1040, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "476ad65b-2601-4741-a6aa-14ac8b7d0b53", + "text": "Table D.11: Segmentation performance of all 2D and 3D models per prompt type. Gray-shaded cells indicate the smallest 2D and 3D Pareto-optimal models per prompt type. Omitted results (-) mean that the experiment was not\nperformed, since it was not supported (see Table 1).", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 60, + "total_chunks": 71, + "char_count": 272, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a25aa8c-cbc0-4a31-a7ed-045019c6b6ad", + "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (M) (%) (%) (mm) (%) (%) (mm) (%) (%) (mm) Med-SAM 94 66.89±14.7 79.47±11.2 4.59±2.5 - - - - - - MedicoSAM2D 94 90.74±7.7 97.36±3.6 0.76±0.9 77.46±19.3 83.23±18.4 5.00±5.9 91.27±7.4 97.74±3.3 0.69±0.8 SAM-Med2d 271 78.87±13.8 91.43±8.5 2.27±1.8 73.69±17.0 84.48±14.9 5.35±5.0 79.88±13.2 91.59±8.2 2.47±2.1 medical ScribblePrompt-SAM 94 66.23±20.1 78.25±16.8 5.48±4.1 74.19±14.6 84.22±12.6 6.30±5.3 - - - ScribblePrompt-UNet 4 66.14±24.6 79.83±18.2 5.87±4.6 71.15±16.2 80.85±14.7 6.96±5.3 72.46±13.8 83.27±12.0 5.91±4.4 SAM B 94 89.03±9.7 96.89±4.7 1.10±1.4 85.43±14.4 90.82±13.0 4.83±6.3 91.80±8.0 97.84±4.4 1.07±1.6 SAM H 641 90.44±8.8 97.68±3.9 0.84±1.0 81.83±17.7 87.61±17.6 6.32±9.4 91.56±7.7 98.01±3.8 0.78±1.1 SAM L 312 89.53±9.4 97.34±4.2 0.91±1.1 79.34±20.0 84.83±19.8 6.92±10.7 91.41±8.1 97.92±4.3 0.80±1.2 SAM2.1 B+ 81 90.60±8.1 97.84±3.5 0.82±1.0 83.20±16.5 88.87±15.1 7.59±9.7 91.98±7.2 98.21±3.6 0.73±1.1 natural SAM2.1 L 224 88.39±8.7 97.30±3.9 0.92±1.0 81.72±17.4 88.44±16.4 6.60±10.7 90.90±6.9 98.36±3.2 0.69±1.0 SAM2.1 S 46 89.40±8.3 97.43±3.8 0.91±1.0 82.26±15.6 88.46±14.2 6.64±8.4 91.51±7.0 98.40±3.3 0.69±0.9 SAM2.1 T 39 89.57±8.4 97.55±3.8 0.88±1.0 82.12±16.3 88.62±14.8 6.16±8.6 91.83±6.9 98.38±3.2 0.71±1.0 3D Models evaluated volumetric Med-SAM2 39 79.56±11.1 80.25±10.5 13.49±11.1 - - - - - -", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 61, + "total_chunks": 71, + "char_count": 1459, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "07943c1b-448f-4018-889c-df912f9acfed", + "text": "MedicoSAM3D 94 51.78±15.1 52.73±13.9 34.85±13.6 54.39±16.4 53.70±15.4 36.89±17.9 52.16±15.0 53.04±13.9 34.65±14.0 SAM-Med3d-Turbo-crop 101 - - - 25.22±9.0 18.20±7.3 57.92±9.7 - - - SAM-Med3d-Turbo-resample 101 - - - 4.85±5.3 4.20±3.6 121.30±20.9 - - - SAM-Med3d-crop 101 - - - 28.76±8.6 19.91±5.8 52.93±10.2 - - - medical SAM-Med3d-resample 101 - - - 3.52±2.5 3.02±1.6 116.75±20.0 - - - SegVol 181 - - - 33.47±13.5 32.97±12.1 62.53±22.4 - - - Vista3D 218 - - - 25.70±13.1 22.32±11.6 58.14±16.0 - - - nnInteractive 102 76.15±9.3 77.51±9.2 25.36±9.9 69.40±11.2 68.23±12.0 30.98±9.4 75.92±9.4 76.60±9.6 26.53±10.3 SAM2.1 B+ 81 66.11±10.1 66.59±10.0 24.77±18.1 53.38±18.1 50.31±19.6 48.14±29.5 68.33±9.4 67.86±10.2 26.04±18.2\nnatural SAM2.1SAM2.1 LS 22446 67.69±10.258.98±11.8 68.48±10.057.27±11.4 31.67±21.655.04±30.1 56.90±19.148.41±20.0 53.96±20.244.29±20.9 47.84±31.269.02±34.4 70.22±10.162.42±11.3 69.88±10.759.79±11.4 32.21±22.055.14±29.8\nSAM2.1 T 39 61.87±11.9 63.40±11.0 34.24±22.6 54.74±15.9 52.92±16.9 46.40±28.5 65.89±9.8 66.34±9.8 33.41±21.4 Figure D.12: Axial slice of Wrist with lowest DSC value (69.9%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.13: Axial slice of Lower Leg with lowest DSC value (62.1%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.14: Axial slice of Shoulder with lowest DSC value (75.1%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.15: Axial slice of Hip with lowest DSC value (58.4%) across 2D models.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 62, + "total_chunks": 71, + "char_count": 1895, + "word_count": 267, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "553c83d6-360c-4859-8307-418d6500bcb8", + "text": "The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 63, + "total_chunks": 71, + "char_count": 170, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c6b232e1-b022-4ce6-b8ad-d827dea09e04", + "text": "SAM2.1\nComparing SAM2 (released July 29, 2024) and SAM2.1 (released September 29, 2024) showed only marginal differences\nin segmentation performance for the same prompt type and model size (Table E.12). Using the paired Wilcoxon signedrank test with Bonferroni correction (n = 12), none of the model pairs showed a statistically significant difference on\nany of the three metrics, except for the comparison between SAM2 T and SAM2.1 T prompted with bounding box. Table E.12: Comparison of 2D segmentation performance of all model sizes of SAM2 and SAM2.1 per prompt type.\n↗indicates that all metrics improve, whereas – denotes no consistent trend across metrics. Asterisk (∗) marks statistically significant differences\nbetween models (p-value< 0.05/12 = 0.0042. Model SAM2 Trend SAM2.1 DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) B+ 90.40±8.0 97.81±3.4 0.81±0.9 ↗ 90.60±8.1 97.84±3.5 0.82±1.0 L 88.23±8.8 97.20±4.0 0.93±1.0 ↗ 88.39±8.7 97.30±3.9 0.92±1.0 S 89.06±8.8 97.28±4.0 0.93±1.0 ↗ 89.40±8.3 97.43±3.8 0.91±1.0 T 89.07±8.5 97.39±3.9 0.92±1.0 ↗* 89.57±8.4 97.55±3.8 0.88±1.0 B+ 83.39±16.6 89.12±15.3 7.45±9.9 – 83.20±16.5 88.87±15.1 7.59±9.7 L 78.45±21.2 85.49±21.0 8.30±13.4 ↗ 81.72±17.4 88.44±16.4 6.60±10.7 S 81.51±16.9 87.56±16.3 7.22±9.5 ↗ 82.26±15.6 88.46±14.2 6.64±8.4 T 80.38±18.0 86.84±16.9 7.53±10.9 ↗ 82.12±16.3 88.62±14.8 6.16±8.6", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 64, + "total_chunks": 71, + "char_count": 1368, + "word_count": 201, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "567bb7ca-2fe0-4212-b65a-df2e5972a6e3", + "text": "B+ 91.82±7.1 98.32±3.3 0.70±1.0 – 91.98±7.2 98.21±3.6 0.73±1.1 L 90.78±7.0 98.28±3.1 0.68±0.9 – 90.90±6.9 98.36±3.2 0.69±1.0 S 91.48±7.1 98.28±3.5 0.71±1.0 – 91.51±7.0 98.40±3.3 0.69±0.9 T 91.33±6.9 98.26±3.3 0.73±1.0 ↗ 91.83±6.9 98.38±3.2 0.71±1.0", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 65, + "total_chunks": 71, + "char_count": 248, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "af2e99ed-f872-43d9-9abc-b205125d43f0", + "text": "Limited vs. unlimited volume propagation\nSAM2.1 and Med-SAM2 generate volumetric predictions via memory bank and a propagation mechanism, which\ncan be restricted to known start and/or end slices (see Table 1). Although MedicoSAM3D also employs slice-by-slice\npropagation, the original method does not include a volume restriction for prediction and was therefore not including in\nour analysis. Applying the prediction volume restriction requires knowing the object's top and bottom slices, which adds\ntwo extra annotations to the required input information. However, limiting the propagation yielded better performance\ncompared to unlimited propagation for all models (Table E.13). Table E.13: Comparison of volumetric prediction without (default setting) and with propagation limitation, per prompt type. Model unlimited propagation limited propagation DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) Med-SAM2 79.56±11.1 80.25±10.5 13.49±11.1 84.00±7.3 84.03±7.6 7.76±4.8 SAM2.1 B+ 66.11±10.1 66.59±10.0 24.77±18.1 83.47±6.5 84.07±6.9 6.75±4.4 SAM2.1 L 58.98±11.8 57.27±11.4 55.04±30.1 80.97±7.2 80.99±7.4 8.41±6.1 SAM2.1 S 67.69±10.2 68.48±10.0 31.67±21.6 82.70±6.8 84.15±6.8 7.85±6.7 SAM2.1 T 61.87±11.9 63.40±11.0 34.24±22.6 81.50±9.8 83.09±9.3 8.91±9.2 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 69.15±16.7 65.92±19.2 27.68±18.8 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 67.98±19.8 64.41±22.4 25.60±21.1", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 66, + "total_chunks": 71, + "char_count": 1424, + "word_count": 189, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7a846b1a-756d-4f4a-8cf4-4a2bed1da5de", + "text": "SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 70.76±17.4 67.50±19.7 22.96±19.2 SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 73.38±15.5 71.38±17.2 22.13±19.0 SAM2.1 B+ 68.33±9.4 67.86±10.2 26.04±18.2 86.47±4.7 86.87±5.8 6.59±4.2 SAM2.1 L 62.42±11.3 59.79±11.4 55.14±29.8 84.98±5.3 84.44±6.5 8.37±6.2 SAM2.1 S 70.22±10.1 69.88±10.7 32.21±22.0 86.16±5.7 86.77±6.4 7.47±6.0 SAM2.1 T 65.89±9.8 66.34±9.8 33.41±21.4 86.35±5.6 87.23±6.2 6.96±5.7 Single vs. multiple initial slices\nFor medical FMs (Med-SAM2, SegVol, Vista3D, nnInteractive), using multiple initial slices improved the performance\nfor all prompt types, whereas for SAM2.1 models (except SAM2.1 L box-prompted), the performance was better for a\nsingle initial slice (Table E.14). nnInteractive box-prompted outperformed Med-SAM2, which was the Pareto-optimal\nmodel for the default settings (i.e., single initial slice). Using the paired Wilcoxon signed-rank test with Bonferroni\ncorrection (n = 18), all model pairs showed a statistically significant difference in all three metrics, except for SAM2.1\nL and SegVol. Table E.14: Comparison of volumetric prediction with a single initial slice (default setting) or all initial slices, per prompt type.\n↗indicates that all metrics improve, whereas ↘indicates that all metrics deteriorate. Asterisk (∗) marks statistically significant differences between\nmodels. Model 1 initial slice Trend NS initial slices\nDSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) Med-SAM2 84.00±7.3 84.03±7.6 7.76±4.8 ↗* 86.57±6.3 87.54±6.3 4.75±3.1 SAM2.1 B+ 66.11±10.1 66.59±10.0 24.77±18.1 ↘* 59.80±9.0 60.19±7.2 38.10±20.4 SAM2.1 L 58.98±11.8 57.27±11.4 55.04±30.1 ↗ 60.01±11.9 59.17±10.5 51.21±31.3", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 67, + "total_chunks": 71, + "char_count": 1692, + "word_count": 228, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0bb1696a-f413-44cb-933d-6b1c896bb8a6", + "text": "SAM2.1 S 67.69±10.2 68.48±10.0 31.67±21.6 ↘* 60.84±9.2 60.10±8.1 54.76±25.9 SAM2.1 T 61.87±11.9 63.40±11.0 34.24±22.6 ↘* 55.64±8.9 57.06±8.2 46.79±24.4 nnInteractive 76.15±9.3 77.51±9.2 25.36±9.9 ↗* 90.02±5.8 92.08±5.9 2.69±1.8 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 ↘* 41.87±13.8 40.96±12.6 57.80±16.3 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 ↘ 38.92±22.1 37.36±20.2 74.30±38.0 SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 ↘* 44.84±16.8 42.52±14.3 71.08±29.5 SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 ↘* 42.53±15.1 43.22±14.0 62.72±30.3 SegVol 33.47±13.5 32.97±12.1 62.53±22.4 ↗ 38.32±14.2 37.42±14.2 19.86±8.4 Vista3D 25.70±13.1 22.32±11.6 58.14±16.0 ↗* 44.98±14.8 35.88±12.9 28.00±14.7 nnInteractive 69.40±11.2 68.23±12.0 30.98±9.4 ↗* 85.67±7.1 82.89±9.7 4.44±2.7 SAM2.1 B+ 68.33±9.4 67.86±10.2 26.04±18.2 ↘* 60.65±8.4 60.91±7.1 39.73±20.4 SAM2.1 L 62.42±11.3 59.79±11.4 55.14±29.8 ↘ 62.37±11.3 62.10±10.8 50.94±30.1 SAM2.1 S 70.22±10.1 69.88±10.7 32.21±22.0 ↘* 61.01±8.7 60.67±8.1 56.47±25.8 SAM2.1 T 65.89±9.8 66.34±9.8 33.41±21.4 ↘* 58.48±8.4 59.77±7.5 46.36±24.8 nnInteractive 75.92±9.4 76.60±9.6 26.53±10.3 ↗* 89.81±5.2 91.37±6.3 2.70±1.7", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 68, + "total_chunks": 71, + "char_count": 1148, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "176b2fa2-a4bd-454f-bfe8-aa81bc520ccb", + "text": "Single vs. multiple prompts\nThe support for multiple prompts varies for 3D models, with more models supporting multiple point (see Table 1). The\nmultiple prompt setting was equivalent to the default setting for 2D models. Comparing the volumetric segmentation\nperformance for single vs. multiple prompts per prompt type showed only marginal differences per model (Table E.15). Using the paired Wilcoxon signed-rank test with Bonferroni correction (n = 6 for bounding box, n = 24 for center point),\nonly MedicoSAM3D showed statistically significant difference in all three metrics. Table E.15: Comparison of volumetric prediction with a single (default setting) or multiple (up to 5) prompts, per prompt type. ↗indicates\nthat all metrics improve, ↘indicates that all metrics deteriorate, whereas – denotes no consistent trend across metrics. An asterisk (*) marks\nstatistically significant differences between models. Model 1 prompt Trend up to 5 prompts DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) MedicoSAM3D 51.78±15.1 52.73±13.9 34.85±13.6 ↘* 51.63±15.1 52.59±14.0 35.15±13.9 nnInteractive 76.15±9.3 77.51±9.2 25.36±9.9 – 76.63±8.7 78.09±8.6 25.26±9.7 MedicoSAM3D 54.39±16.4 53.70±15.4 36.89±17.9 ↘* 54.11±16.5 53.45±15.6 37.32±18.3 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 – 53.27±18.6 50.28±20.1 47.91±29.8 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 – 48.43±20.2 44.35±21.1 68.72±34.2 SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 ↘ 56.76±19.4 53.83±20.6 48.38±31.7", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 69, + "total_chunks": 71, + "char_count": 1487, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f10585bf-cd1b-464a-9e0d-de71e50da721", + "text": "SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 ↘ 54.23±16.5 52.37±17.4 47.53±29.2 SegVol 33.47±13.5 32.97±12.1 62.53±22.4 – 33.63±13.4 33.12±12.0 62.90±22.6 Vista3D 25.70±13.1 22.32±11.6 58.14±16.0 ↘ 25.63±13.0 22.31±11.5 58.34±16.1 nnInteractive 69.40±11.2 68.23±12.0 30.98±9.4 ↗ 69.66±10.8 68.50±11.6 30.80±9.2", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 70, + "total_chunks": 71, + "char_count": 307, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e7b0efef-acc7-4ecf-8d84-8ae577e817d3", + "text": "Comparison segmentation with reference and human prompts Table F.16 shows the average difference for the performance of FMs prompted with reference and human prompts. The paired Wilcoxon signed-rank test showed a statistically significant difference for the overall comparison of 2D and\n3D models, with p-value smaller than the Bonferroni-corrected α-value (0.05/6 = 0.0083). Table F.16: Difference in segmentation performance between reference and human prompts, per prompt type. The models with the least difference per prompt type are highlighted in bold. The selected models are the smallest Pareto-optimal models prompted\nwith reference prompts per category highlighted in bold in Table 4.", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 71, + "total_chunks": 71, + "char_count": 694, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f577a987-3563-4f95-8462-8ef5b8418841", + "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (M) (%) (%) (mm) (%) (%) (mm) (%) (%) (mm) 2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 3.39 -± 6.3 1.41 -± 4.4 -0.41 -± 1.1 1.241.33 ±± 6.35.5 0.190.28 ±± 3.63.5 -0.05-0.16 ±± 2.42.2 3.47 -± 6.3 2.15 -± 5.3 -0.62 -± 1.5 SAM B 94 - - - 1.39 ± 5.5 0.13 ± 2.9 0.16 ± 3.2 - - - SAM2.1 B+ 81 2.05 ± 5.8 0.92 ± 3.7 -0.31 ± 1.0 - - - - - - natural\nSAM2.1 T 39 - - - - - - 1.64 ± 5.2 0.99 ± 4.1 -0.36 ± 1.2 Average per prompt type 2.72 ± 6.1 1.16 ± 4.1 -0.36 ± 1.1 1.32 ± 5.8 0.20 ± 3.3 -0.02 ± 2.6 2.56 ± 5.8 1.57 ± 4.8 -0.49 ± 1.4 Average 2D Models 2.07 ± 1.0 % DSC (p < 0.001) 0.87 ± 0.7 % NSD (p < 0.001) -0.25 ± 0.3 mm HD95 (p < 0.001) 3D Models evaluated volumetric\nmedical Med-SAM2nnInteractive 10239 76.80±13.5- 79.27±11.2- 14.46±11.8- 68.12±12.6- 68.63±11.5- 30.10±8.8- 75.59±10.6- 77.29±9.1- 25.65±9.5-\nnatural SAM2.1SAM2.1 ST 4639 65.93±11.6- 67.83±10.2- 32.71±21.6- 53.72±16.3- 52.93±16.5- 46.84±27.8- 68.80±11.2- 69.19±10.9- 33.88±22.4- Average per prompt type 1.76 ± 5.8 0.96 ± 4.8 -0.89 ± 10.2 0.80 ± 7.4 0.07 ± 6.6 0.20 ± 7.8 0.63 ± 4.5 0.40 ± 4.2 -0.48 ± 8.2 Average 3D Models 1.06 ± 0.7 % DSC (p < 0.001) 0.47 ± 0.6 % NSD (p < 0.001) -0.39 ± 0.7 mm HD95 (p < 0.001)", + "paper_id": "2603.10541", + "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation", + "authors": [ + "Caroline Magg", + "Maaike A. ter Wee", + "Johannes G. G. Dobbe", + "Geert J. Streekstra", + "Leendert Blankevoort", + "Clara I. Sánchez", + "Hoel Kervadec" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10541v1", + "chunk_index": 72, + "total_chunks": 71, + "char_count": 1325, + "word_count": 276, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10544_semantic.json b/data/chunks/2603.10544_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..957cad6f3ea3b1d6ed302037a200c94e0de57aa7 --- /dev/null +++ b/data/chunks/2603.10544_semantic.json @@ -0,0 +1,954 @@ +[ + { + "chunk_id": "7b97d23d-efbc-49f5-8a18-b427f4f406ca", + "text": "1Osmo Labs PBC New York, USA Corresponding authors: guillaume@osmo.ai Residual connections are central to modern deep neural networks, enabling stable optimization\nand efficient information flow across depth. In this work, we propose SCORE (Skip-Connection\nODE Recurrent Embedding), a discrete recurrent alternative to classical layer stacking. Instead\nof composing multiple independent layers, SCORE iteratively applies a single shared neural\nblock using an ODE (Ordinary Differential Equation) inspired contractive update : ht+1 = (1 - Δt) * ht + Δt * Fᶿ(ht) This formulation can be interpreted as a depth-by-iteration refinement process, where the step\nsize Δt explicitly controls stability and update magnitude. Unlike continuous Neural ODE\napproaches, SCORE uses a fixed number of discrete iterations and standard backpropagation\nwithout requiring ODE solvers or adjoint methods. We evaluate SCORE across graph neural networks (ESOL molecular solubility), multilayer\nperceptrons, and Transformer-based language models (nanoGPT). Across architectures,\nSCORE generally improves convergence speed and often accelerates training. SCORE is reducing parameter count through shared weights. In practice, simple Euler\nintegration provides the best trade-off between computational cost and performance, while\nhigher-order integrators yield marginal gains at increased compute. These results suggest that controlled recurrent depth with contractive residual updates offers a\nlightweight and effective alternative to classical stacking in deep neural networks. Residual connections are a cornerstone of deep neural networks, enabling stable optimization\nand efficient information flow across many layers. Additive skip connections have proven\neffective in vision models such as ResNet and in sequence models.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 1, + "total_chunks": 56, + "char_count": 1802, + "word_count": 242, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05c5542e-ff04-4ad1-8a1d-cd35e302d64c", + "text": "In this work, we propose\nreplacing a stack of layers with recurrent refinement steps through a shared block, by revising\nthe skip connection so that it mimics a discretized Ordinary Differential Equation (ODE). This\napproach applies to any sequential architecture with identical dimensions; we evaluate it on\ngraph convolutional networks, Transformers, and deep feedforward networks. Existing ODE-based neural networks convert standard architectures into a continuous ODE and\nsolve it with a dedicated solver; examples include Graph Neural ODE(1) and Neural ODE(2). We bypass the need for a continuous ODE solver by generalizing the skip-connection update in\nthe spirit of a discretized ODE(3). Rather than treating residuals as simple additive shortcuts, we\nreinterpret the residual term as a velocity field governing embedding evolution, and in GNNs,\nmessage passing, under a discretized ODE. We evaluate several numerical integrators (Euler,\nHeun(3), Midpoint, RK4) and review the impact of the method on GNNs for the molecular\nsolubility benchmark ESOL(4) and on nanoGPT(5,6) with the Shakespeare dataset as well as\nAutosearch 5 min challenge. We refer to this approach as SCORE (Skip-Connection ODE Recurrent Embedding): the\nsequence of layers is replaced by recurrent steps that evolve the embedding according to a\ndiscretized ODE (fig 1). Empirically, we generally observe improved convergence stability and\nfaster optimization across multiple architectures. This behavior is also slightly observed for\nnanoGPT trained on the Shakespeare corpus and autosearch challenge. Simple Euler\nintegration offers the best trade-off between performance and cost; Heun or RK4 can yield slight\ngains at higher computational cost. Residual skip connections have become ubiquitous since ResNet(7), where they mitigate\nvanishing gradients and ease optimization. Sander et al. explored the classical ResNets\nstacking version with adjoint method (3) as well as the Heun example; they did not use any\nrecurrence layers in their ResNets examples. In graph neural networks, the same additive\nresidual formulation often yields mixed results; beneficial for some architectures (e.g. GAT(8))\nbut detrimental for others (e.g.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 2, + "total_chunks": 56, + "char_count": 2207, + "word_count": 321, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94e5f589-2d2e-409c-9143-d3120f14ba32", + "text": "MPNN(9), DMPNN(10), Graph Transformers(11)) in our\nexperiments. A limitation of classical stacking is that depth is implemented as the composition of\nindependent transformations, without explicit control over the magnitude or stability of iterative\nupdates. In contrast, a dynamical perspective treats depth as an evolution process governed by\ncontrolled update rules. More generally, a continuous-time view allows embeddings to evolve\naccording to a differential equation rather than a fixed discrete update. To obtain a dynamic, ODE-inspired skip connection without a continuous solver, we adopt a\nsimplified ODE analogy. The Graph Neural ODE(1) was first proposed in 2019 and relies on a\ncontinuous ODE formulation. We do not follow that route, as we use a fixed number of discrete\nsteps with a simple Euler-style update (the residual as velocity) and do not introduce any\ncontinuous ODE solver or adjoint gradients. The update is a single Euler step per \"layer\": the\nembedding is updated by adding a scaled residual (difference term), yielding a lightweight\nrecurrence that can be applied to GNNs, dense networks, and Transformers alike. Several architectural paradigms exist for deep models with repeated transformations: (i)\nclassical stacking of independent layers with or without residual connections, (ii) parameter tying\nacross depth as in ALBERT-style models(12), and (iii) recurrent depth refinement such as the\nUniversal Transformer(13). SCORE belongs to the third family in that it iteratively applies a\nsingle block across steps, but differs in its explicit ODE-motivated contractive update rule\n(equation 1). Unlike continuous Neural ODE models, SCORE uses a fixed number of discrete iterations and does not rely on an ODE solver or adjoint method. The step size Δt directly\ncontrols stability and contraction properties of the update. Prior work has explored parameter-efficient architectures through tied parameters and iterative\nrefinement. For example, ALBERT shares parameters across layers to reduce model size while\nmaintaining performance.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 3, + "total_chunks": 56, + "char_count": 2063, + "word_count": 303, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "920ff6a2-c039-46af-b6fc-7eed99843df5", + "text": "The Universal Transformers introduce a recurrent mechanism across\ndepth to refine representations iteratively using the same transformation function. In this perspective, stacking corresponds to a sequence of independent operators, while\nSCORE interprets depth as the repeated application of a single operator under a controlled\ndynamical update. Recent work has explored recurrent reasoning models for symbolic tasks (14) (e.g., Sudoku or\nARC-AGI). These approaches focus on iterative reasoning rather than architectural depth\nreduction and are therefore outside the scope of this work. SCORE can be interpreted as a Krasnosel'skii–Mann-style relaxed fixed-point iteration applied\nto a learnable operator Fᶿ, while recurrent reasoning models typically employ the unrelaxed\nrecurrence ht+1 = Fᶿ(ht) . Under this view, plain recurrent iteration appears as the special case α =\n1, and SCORE generalizes it through an explicit relaxation parameter that modulates update\nstability and dynamics. Empirically, SCORE often performs well with substantially reduced\ndropout, consistent with an implicit regularization effect induced by shared parameters and the\nrelaxed iterative update. Our contributions are:", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 4, + "total_chunks": 56, + "char_count": 1201, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0e3cc0cc-db56-46f5-a5e7-f423a02a1ea7", + "text": "• We introduce a gated residual formulation for the recursive application of a shared\nneural block. • Graph neural networks: replacing stacked convolutions with recurrent Euler residual\nsteps and a single shared convolution generally improves convergence stability. • Dense networks: replacing stacked dense layers with recurrent Euler residual steps\nand a single shared dense layer maintains performance while reducing parameter count. • Transformers: replacing stacked decoder blocks with recurrent Euler residual steps\nusing a shared block yields competitive performance on nanoGPT with a smaller number\nof parameters. Figure 1: SCORE skip-connection equation using recurrent layer in GNN In contrast to classical stacking of independent layers {F1, F2, …, Fk}, SCORE uses a single\nneural block F whose parameters are shared across steps. The same block is iteratively applied\nK times, producing a depth-by-iteration refinement process rather than a composition of distinct\nlayers. The residual can be interpreted as a velocity field governing embedding evolution across\npropagation steps.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 5, + "total_chunks": 56, + "char_count": 1092, + "word_count": 157, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b9947684-248b-4a5e-a6c7-2af59658515e", + "text": "The SCORE formulation is defined by equation 1. ht+1 = ht + Δt *( Fᶿ(ht) - ht ) (equation 1) The parameters of Fᶿ are tied across all iterations t = 1,...,K, making SCORE a recurrent depth\nformulation rather than a stacked architecture. It can be rewritten as a weighted contractive\nresidual recurrence equation 2. For example, Δt = 0.5 corresponds to averaging the previous\nembedding and the transformed embedding. ht+1 = (1 - Δt) * ht + Δt * Fᶿ(ht) (equation 2) For Δt in [0,1], this update corresponds to a convex interpolation between the previous\nembedding and the transformed embedding. The parameter Δt therefore directly controls the\nmagnitude of the update and can induce a contractive behavior when F is Lipschitz-bounded. In\npractice, this stabilizes the iterated application of the shared block and mitigates divergence or\noversmoothing. We can consider the SCORE as a static residual gate. In our study two Δt were used 0.5 or the\ninverse of number or recurrent steps both give similar results. Stability and Step Size Interpretation SCORE is derived from a first-order explicit Euler discretization of a differential equation of the\nform: 𝑑𝑡= 𝐹θ(ℎ) −ℎ (equation 3) Applying one Euler step with step size Δt yields Equation (1).", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 6, + "total_chunks": 56, + "char_count": 1241, + "word_count": 206, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0dac1128-5400-47c3-88b2-6655b30f5b25", + "text": "In this interpretation, Δt plays the\nrole of a time step controlling how far the embedding evolves at each iteration. A natural\nconservative choice is Δt = 1/K when using K refinement steps, analogous to refining a\ndiscretization with smaller steps. However, in practice we observe that a fixed averaging update\nΔt = 0.5 is equally stable and often slightly more effective. Empirically, both schedules produce\nstable dynamics across architectures, with Δt acting as a simple and effective stability knob\nrather than a parameter requiring delicate tuning.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 7, + "total_chunks": 56, + "char_count": 554, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c298f89-a1c6-405c-9464-5f0cf08adca3", + "text": "Euler Simplified family We explore several numerical integrators to approximate the ODE(15) solution using one of the\nfollowing methods: ​ •​ Euler (equation 1 and 2) ​ •​ Runge–Kutta 4 (RK4) Importantly, unlike Neural ODE approaches that rely on adaptive continuous solvers and adjoint\nbackpropagation, SCORE fixes the number of discrete steps K and uses standard\nbackpropagation through the unrolled iterations. While higher-order methods provide better theoretical accuracy(3), they also increase\ncomputational cost due to multiple evaluations of the GNN per layer (see supplementary figures\n9 and 10). In default experiments, we use four propagation steps and apply a scaling factor Δt. I\ndecided to define the Δt = 1 / n_steps where n_steps = 4 as default. So in practice, Δt range value is [1/7, 0.5], as we went from 1/2 to 1/7 factors using 2 to 7 steps (see supplementary\nfigures 18-21).", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 8, + "total_chunks": 56, + "char_count": 896, + "word_count": 144, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "32e4e62a-15a6-4a71-b102-bd58c5feed42", + "text": "We evaluate SCORE on two tasks: molecular property prediction with graph neural networks\nand language modeling with Transformers. We use ESOL as a well-established benchmark dataset for aqueous solubility prediction. We\nfollow a 5-fold cross-validation protocol with an 80/20 train/test split. We report benchmark\nmethod using the same CV split as well. For Transformer experiments, we use the Shakespeare dataset from the Gutenberg project,\nusing the nanoGPT training setup. We use a 90/10 training-validation split. We use the GPT-4o\ntokenizer, not the character simple split. We used the MLX implementation of nanoGPT as the\nbaseline from Karpathy developments : https://github.com/shakedzy/nanogpt. Few modifications were tested to modernize the architecture with state of the arts recent\nprogress in the field including Relu2, RMSnorm, RoPE and Normalize Q,K vectors based on\nthe nanoChat https://github.com/karpathy/nanochat, I called this version nanoGPTx. The goal\nwas here to see if we can reduce the number of Transformer layers and keep descend and fast\nconvergence using SCORE. A second experiment was run with the nanochat MLX version just after the autosearch code\nwas published. In this set up the goal is to get the smaller loss in 5 minutes time. SCORE\nprovides the smaller value with 4 M less parameters than the default version on an Apple\nMacBook M3 Max 128 Gb computer.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 9, + "total_chunks": 56, + "char_count": 1390, + "word_count": 215, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ada669d0-4071-4b8f-980e-08b20a78541c", + "text": "Graph Neural Network Architectures We compare native and ODE-residual variants of the following well-known graph neural network\narchitectures. AttentiveFP(16), DMPNN (ChemProp(10))), GAT(8), GATv2(17), GINE, MPNN(9)\nand Graph Transformer(11). Those models are generally very fast and give good performances\nespecially AttentiveFP and Chemprop. For each architecture we evaluate five configurations:", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 10, + "total_chunks": 56, + "char_count": 398, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2a3e82f9-e81e-4b44-b7d7-d50e607fab8d", + "text": "• GNN-base: model without skip connections • GNN-classic: residual connection with LayerNorm • GNN-skip05: Euler residual averaging (Δt = 0.5) • SCORE-GNN: recurrent shared block with Δt = 1/K • SCORE-GNN-skip05: recurrent shared block with Δt = 0.5 We systematically include the MolAttFP virtual node pooling aggregation instead of the classical\npooling for all models by default as in AttentiveFP (see supplementaries for ablation studies). Graph neural network training protocol Models are trained using the Adam optimizer with learning rate of 1e-3 and batch size 32. Training runs for up to 150 epochs per fold to analyse convergence behaviour. All experiments\nwere conducted using the MLX framework. All experiments were done on a M4 Apple Pro\nversion with 24 GB ram memory using a mlx-graphs custom version. We used recent RIGR features which are tautomer/resonance invariant. All models are plugged\ninto an identical MLP to avoid questioning the MLP final impact of performance between\nmodels.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 11, + "total_chunks": 56, + "char_count": 1001, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40b17fa7-9af1-403c-9b14-d82ec3c8efb4", + "text": "It consists of one Dropout 10% followed by 3 layers of respective dimension [128,64,32]\nusing leaky_relu activation function. The final projection is a linear dense output to 1 dimension. The SCORE-MLP is a 128 single layer recurrence using Δt = 1/N in the Euler equation. The ESOL log10 target was not scaled during the training as it can be the case in literature. So\nthe RMSE root mean squared error is the natural error along the LogSolubility range [−8.057,\n1.071].", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 12, + "total_chunks": 56, + "char_count": 470, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a861b99a-2ac1-40a3-a52b-0fb894471fde", + "text": "I use a 32 batch size and a learning rate without any early stopping or learning rate\nscheduler. I have also investigated the SCORE-MLP using the RDKit 217 features compared to\nMLP with 4 layers. Random Trees, Boosted Tree, Support vector machine and Lasso linear\nmodels were also evaluated using the same dataset in order to compare the performances. I\ntested CatBoost, XGBoost, LightBoost, Random Forest, SVR and Lasso with feature selection\nusing SHAP values importance from RDKit 217 features. It shows that Catboost with the RDKit\n217 features can provide a 0.56 RMSE in CV5 and this is the only method that can have this\nperformance over the 6 methods tested. We did not run any hyperparametrization(18) for the\nlayers dimension.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 13, + "total_chunks": 56, + "char_count": 735, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e09748f0-622e-4604-842d-281f04f96f16", + "text": "Graph neural network molecular features We optionally augment graph embeddings with a vector of 217 RDKit molecular descriptors. Special process for extreme and not available numbers in the RDkit matrix: ​ 1.​ Arcsinh squashing (mask NaN/ Inf) ​ 2.​ Standard scaling (mask NaN/ Inf) ​ 3.​ NaN / Inf replaced by zero (ie mean imputation in scaled space)", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 14, + "total_chunks": 56, + "char_count": 352, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f797bec3-9cd3-4eca-82dc-4b6c76468ddf", + "text": "To evaluate the generality of the SCORE formulation, we also apply it to Transformer\narchitectures using nanoGPT. The main question is whether a single Transformer block can be reused recurrently\n(SCORE-nanoGPT) instead of stacking multiple blocks. The goal is to evaluate whether\nrecurrent depth improves convergence speed and reduces model size, as this architecture was\nultra fine tuned, we do not observe much improvement compared to native models. We train nanoGPT models with embedding sizes 64 and 384 using the Shakespeare dataset. Models are trained for 10k–15k iterations using Adam or AdamW optimizers. The model was run for 10000 to 15000 iterations. Two models were tested, Small and Large\nwith respectively 64 or 384 embedding size, with the same context window 32, and 4 different\nlayers or 4 steps with the same layer. I used the GTP-4o tokenizer and start-of-play token as\ndescribed to be the best settings in the Github experiments. I used the Shakespeare\nGuntheberg dataset and computed loss to monitor model capabilities, I used Adam or AdamW.​ For the nanochat 5 min challenge, I have tested our SCORE recurrent method versus a 0.5\nresidual connection at every stage (aka skip05). We used 2 different NorMuon implementations\nwith Polar Express approximation and kept the rest of others autosearch (9 March) settings\ndefaults except for PR4 trial.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 15, + "total_chunks": 56, + "char_count": 1367, + "word_count": 218, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d90a6c87-542d-4f65-a9c0-8123f5322c37", + "text": "Baseline models using RDKit descriptors Before analyzing the performance of SCORE on graph neural networks, we first establish\nreference baselines using classical machine learning models trained on RDKit molecular\ndescriptors. The RDKit feature matrix (217 descriptors) was preprocessed by converting invalid\nvalues to NaN, applying arcsinh squashing to limit extreme values, followed by standard scaling\nand mean-imputation in the scaled space. We trained several classical machine learning models using the same dataset splits as the\nneural experiments. Among the tested models, CatBoost achieves the best performance with\nRMSE = 0.56 ± 0.03 (5-fold cross-validation). This result provides a strong reference baseline\nfor the ESOL dataset. Linear models such as LASSO highlight the intrinsic complexity and\nnon-linearity of the solubility prediction task. Feature selection using SHAP(19) improves linear\nmodel performance but still remains below the CatBoost baseline (Table 1). Dense networks: MLP vs SCORE-MLP To verify that the SCORE formulation is not limited to graph architectures, we evaluate its effect\non dense neural networks. We compare a classical multilayer perceptron (MLP) with its\nrecurrent counterpart SCORE-MLP, trained using identical data splits and optimization settings\nfor 150 epochs using the Adam optimizer. The results show that SCORE-MLP achieves similar predictive performance while slightly\nreducing the variance across folds, indicating that the recurrent formulation stabilizes dense\nmodels without degrading accuracy (Figure 1). Table 1 — Baseline models using RDKit descriptors CatBoost(20) 0.563±0.03 XGBoost(21) 0.674±0.03", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 16, + "total_chunks": 56, + "char_count": 1660, + "word_count": 231, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3ad2402b-9fa2-43a1-9515-aaf88eb1041a", + "text": "LightBoost(22) 0.614±0.04 Random Forest(23) 0.658±0.06 Lasso(25) (Top-10 features) 0.803±0.07 Lasso (Top-100 features) 0.636±0.02 SCORE-MLP (our method) 0.630±0.03 5-fold cross-validation results on the ESOL dataset. Figure 1 : CV5 benchmarks models RMSE for ESOL prediction lower the better Graph Neural Networks We evaluate the SCORE formulation on a range of graph neural network architectures. To\nobtain strong GNN baselines, we systematically incorporate the MolAttFP virtual node pooling\nmechanism, originally introduced in AttentiveFP, across all architectures. This pooling strategy\nsignificantly improves the stability of graph models and provides a fair comparison across\narchitectures. We also include the SCORE-MLP prediction head after graph pooling to maintain a consistent\narchitecture across all models. During training we observed that some architectures such as\nMPNN and Graph Transformer can be unstable with naive stacking, and benefit from\nLayerNorm (\"classical\" residual connections). In contrast, Euler-style skip connections with Δt =\n0.5 (skip05) provide stable behavior across most architectures. Overall results show that several\nSCORE-GNN variants outperform the CatBoost baseline, including DMPNN, AttentiveFP, GINE,\nGCN, GAT and GATv2.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 17, + "total_chunks": 56, + "char_count": 1265, + "word_count": 170, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6473a5e9-67f1-4451-921c-44fe80f42b7b", + "text": "Interestingly, the simple GCN architecture also achieves strong results,\ndemonstrating that the SCORE formulation can effectively propagate embeddings even with\nlightweight convolution operators. Across the top-13 performing models (Table 2): ●​ 10 out of 13 models are SCORE variants​ ●​ the second-best performing approach corresponds to the skip05 Euler residual\nformulation​ ●​ both SCORE and skip05 demonstrate strong compatibility with a wide range of GNN\narchitectures These observations suggest that Euler-style residual updates with controlled step size are well\ntolerated across graph convolution operators (Figure 2). Table 2 - Best performing GNN models (5-fold CV) Rank Model Mean best val RMSE\n1 dmpnn_skip05 0.533±0.04\n2 SCORE_dmpnn_skip05 0.542±0.05\n3 SCORE_gat_skip05 0.546±0.04\n4 SCORE_gine 0.547±0.05\n5 SCORE_mpnn 0.555±0.05\n6 gcn_skip05 0.557±0.03\n7 SCORE_dmpnn 0.558±0.04\n8 SCORE_gcn_skip05 0.559±0.01\n9 SCORE_gatv2_skip05 0.559±0.03\n10 gat_skip05 0.561±0.04\n11 SCORE_gine_skip05 0.562±0.04\n12 SCORE_gcn 0.562±0.03\n13 SCORE_gat 0.564±0.04 5-fold cross-validation results on the ESOL dataset. Figure 2 : CV5 benchmarks models RMSE for ESOL prediction lower the better Comparison of the 5 five configurations of 8 GNN architectures We apply the next SCORE into Transformer models using nanoGPT. We train nanoGPT models\non the Shakespeare dataset with embedding dimensions of 64 and 384. Models are trained for\n10k–15k iterations using Adam or AdamW optimizers. Using a larger embedding dimension (384), the SCORE model reaches validation loss 5.41,\ncompared with 5.67 for the native nanoGPT model, despite using fewer parameters (28M vs\n34M) see figure 3.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 18, + "total_chunks": 56, + "char_count": 1674, + "word_count": 237, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca358679-5305-4bb8-ba22-abba3b2ab4a6", + "text": "Figure 3 : Train and Validation Loss of nanoGPT variant Left small embedding versus right large embedding. SCORE models are learning faster with GTP-4o vocabulary\nembedding We also evaluate the modified nanoGPTx architecture with embedding size 64 and varying\nnumbers of recurrent steps. Across experiments, SCORE-based models converge slightly faster\nand achieve comparable or slightly improved validation loss. In these experiments, a fixed step size Δt = 0.5 performs slightly better than the theoretical Δt =\n1/N schedule, consistent with observations from the GNN experiments (figure 4). Considering karpathy's autosearch trials, the best setting without Agent intervention, was to use\ntwo steps SCORE unique blocks twice so replace d4 by two s2 (SCORE 2 steps) stacked.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 19, + "total_chunks": 56, + "char_count": 775, + "word_count": 116, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e5a6086-ba7a-45e1-ac8d-2991315cc948", + "text": "We\ngot a val_bpb 1.302 after 5 min, 1.282 after 6 min for 18 M parameters. The 4 stacks layers\nwith a skip05 residual (i.e. average) gave val_bpb 1.303 and 1.286 respectively with 22 M\nparameters. Also by removing the average skip05 we get the native 4 stacks layers (d4) that\ngave val_bpb 1.309. Again, the skip05 is improving the native model while the SCORE allows\nto reduce the parameter number. For references, the H100 Nvidia GPU card for 5 mins gives\nval_bpb 0.998, as the GPU clock is faster than MPS. One best model (aka 11 March 2026)\nobtained 1.2809 using a more sophisticated variant of norMuon implementation in d4\nafter hyperparameter fine tuning using autosearch 110 trials. The major differences to", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 20, + "total_chunks": 56, + "char_count": 714, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6d0b6feb-df17-4f51-ad1a-81b6eb5af1b2", + "text": "our initial NorMuon was stability, also the batch size of 16 to 8 to allow more iterations in\n5 mins. We were able to reach the val_bpb 1.2594 using our M3 max 128 GB hardware using\nthe skip05 option in the d4 22 M parameters model. The 2 sequential recursive SCORE blocks\ngave val_bpb 1.2731 which is expected to be worse as the d4 method was fine tuned\nover 110 trials and because we reduce the parameters to 18.4M instead of 22 M of the\noriginal d4. The d4 original model got val_bpb 1.2621 without skip05 (see table 12). The code is available here : https://github.com/guillaume-osmo/autosearch-mlx. Figure 4 : Train and Validation Loss of nanoGPT variant increase the depth of the nanoGPT structure, there is a little improvement for Δt = 1/N versus the 0.5 option", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 21, + "total_chunks": 56, + "char_count": 769, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e1b4d3b5-2969-4ed7-b095-ab4202f91b48", + "text": "Based on the Lottery ticket assumption(26), only a few portions of weights are really useful in\ndeep learning layers. If we use only one layer initialization we reduce the optimization\ndimensionality too. We empirically show that a single shared block can effectively support\nmulti-step representation refinement without performance degradation across several\narchitectures. This method generally converges faster and provides better performances. While\nthe established idea that Δt = 1/N is the best in theory, we have seen in our experiments that Δt\n= 0.5 is generally identical or even better making a good alternative. NanoGPTx is shown to support ablation of one Transformer step, without similar loss and\nconvergence speed, and a notable parameter reduction.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 22, + "total_chunks": 56, + "char_count": 764, + "word_count": 115, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b6b86b49-03b9-4fbd-b2f0-04ac1280f303", + "text": "This is very important in the context of\nLLM size. Indeed, these results suggest that SCORE provides a more stable and principled mechanism\nfor deep and multi-step message passing. By explicitly modeling the change in embeddings\nrather than summing representations, SCORE reduces oversmoothing, improves convergence speed and stabilizes training across heterogeneous GNN while being similar for MLP and\nTransformer already fine tuned architectures. As we reuse the same convolution weights, the models are smaller but not faster as we do not\nchange the number of steps. The reduction in parameter count may contribute to improved\noptimization stability by reducing the dimensionality of the parameter space. Because SCORE reuses a single shared block, models contain fewer parameters. Despite this\nconstraint, we obtain performance comparable to stacked architectures, suggesting that\nrecurrent depth can effectively replace multiple independent layers. In small-data settings, such as ESOL (~1000 molecules), we observe a more pronounced\nreduction in training time, whereas in larger-data settings, such as the Shakespeare Gutenberg\ncorpus, the gain is more moderate. This suggests that SCORE may act as an implicit regularizer\nwhose benefits are stronger in low-data regimes. This view is consistent with previous work\nshowing that Graph Transformers benefit from larger multitask datasets and auxiliary targets(27). On\nESOL, by contrast, the low-data regime appears to limit Graph Transformer performance, and\nSCORE partially mitigates this limitation. The fact that models without SCORE could outperform SCORE variants should be expected, given\nthat SCORE reduces the number of trainable parameters through weight sharing. In our\nexperiments, however, we also observed that this reduction in parameter count can sometimes be\nbeneficial for training, likely by improving optimization stability and acting as an implicit regularizer. SCORE introduces an implicit iterative refinement loop within each forward pass, which may\nreduce representational variance similarly to how ensemble averaging or repeated reasoning\nimproves output stability(28).", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 23, + "total_chunks": 56, + "char_count": 2148, + "word_count": 305, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "195e8059-5765-4f5e-b726-fb2216079848", + "text": "We introduce a recursive skip-connection block called SCORE that can be used in MLP, Graph\nNeural Convolution and Transformers. SCORE goals is to use the same layer recurrently. It is a\nlightweight yet effective alternative to classical skip connections of multiple layers. Across\nmultiple Graph neural network architectures on ESOL target, the simple SCORE, particularly\nEuler, with small step size factor or fixed step factor, delivers generally robust improvements in\nstability and performance without RDKit features. Similarly the use of SCORE-MLP and\nSCORE-Transformer which maintain competitive convergence and speed. This work demonstrates that continuous-time reasoning can meaningfully simplify and improve\nneural network design, without requiring full ODE during training or any adjoint methods. The Δt can be a learnable parameter of the model per convolution layer : a single trial was done\nthat did not provide better results. A more complete analysis can be done to determine if the\nstep dependent Δt provides nicer results. We can already see that 0.5 or 1/N factors work well.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 24, + "total_chunks": 56, + "char_count": 1092, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9b3a405b-7ab6-45b3-934a-7a5c6ff982b0", + "text": "This is also the first time that we systematically applied the MolAttFP virtual node trick by default\nin AttentiveFP into all our Graph architectures to get better results without the RDKit features\nthan with the RDKit features. Showing that the generative graph embedding is more efficient\nthan RDKit features. Our GNN alternative obtains better results than CatBoost with SCORE\nwhich is generally considered to be part of the best models. In terms of perspective, we can rethink the needs of several independent layers in Deep\nlearning models. One option that was working is to use several SCORE blocks sequentially, as\nobserved on the nanoGTP2 autosearch example. It would be interesting to leverage it in larger\nlanguage models. Our skip05 can already stabilize the residual connection even without\nSCORE blocks. The author was funded by Osmo Labs PBC for Graph Neural Network methods development.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 25, + "total_chunks": 56, + "char_count": 901, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b306d087-82c1-4e9c-895e-df736b1e4108", + "text": "Competing Interests and Consent for publication\nThe author declares that he has no competing interests. The author has read and agreed to the\npublished version of the manuscript. The author wants to thank Brian Kelley and Gregory Landrum for RDKit \"217\" descriptor c++\nimplementation support. 1.​ Poli M, Massaroli S, Park J, Yamashita A, Asama H, Park J. Graph Neural Ordinary\nDifferential Equations [Internet]. arXiv; 2019 [cited 2026 Feb 11]. Available from:\nhttps://arxiv.org/abs/1911.07532 doi:10.48550/ARXIV.1911.07532 2.​ Chen RTQ, Rubanova Y, Bettencourt J, Duvenaud D. Neural Ordinary Differential Equations\n[Internet]. arXiv; 2019 [cited 2026 Feb 16]. Available from: http://arxiv.org/abs/1806.07366 3.​ Sander ME, Ablin P, Peyré G.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 26, + "total_chunks": 56, + "char_count": 742, + "word_count": 104, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2c51ba0e-a4fe-44d4-9368-e3cb0052388d", + "text": "Do Residual Neural Networks discretize Neural Ordinary\nDifferential Equations? [Internet]. arXiv; 2022 [cited 2026 Feb 17]. Available from:\nhttp://arxiv.org/abs/2205.14612 doi:10.48550/arXiv.2205.14612 ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. J\nChem Inf Comput Sci. 2004 May 1;44(3):1000–5. doi:10.1021/ci034243x 5.​ Karpathy A. nanoGPT: The simplest, fastest repository for training/finetuning medium-sized\nGPTs [Internet]. 2022. Available from: https://github.com/karpathy/nanoGPT Andrej Karpathy's NanoGPT MLX version [Internet]. Available from:\nhttps://github.com/shakedzy/nanogpt 7.​ He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition [Internet].\narXiv; 2015 [cited 2026 Feb 16]. Available from: http://arxiv.org/abs/1512.03385 8.​ Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks\n[Internet]. arXiv; 2017 [cited 2026 Feb 11]. Available from: https://arxiv.org/abs/1710.10903 9.​ Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 27, + "total_chunks": 56, + "char_count": 1032, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfac077a-671e-47a7-9d2e-20d0b54dfcc4", + "text": "Neural Message Passing for\nQuantum Chemistry [Internet]. arXiv; 2017 [cited 2025 Oct 6]. Available from:\nhttp://arxiv.org/abs/1704.01212 doi:10.48550/arXiv.1704.01212 10.​Heid E, Greenman KP, Chung Y, Li SC, Graff DE, Vermeire FH, et al. Chemprop: A Machine\nLearning Package for Chemical Property Prediction. J Chem Inf Model. 2024 Jan 8;64(1):1. 11.​ Yun S, Jeong M, Kim R, Kang J, Kim HJ. Graph Transformer Networks [Internet]. arXiv;\n2019 [cited 2025 May 3]. Available from: https://arxiv.org/abs/1911.06455 12.​Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for\nSelf-supervised Learning of Language Representations [Internet]. arXiv; 2020 [cited 2026\nFeb 19]. Available from: http://arxiv.org/abs/1909.11942 doi:10.48550/arXiv.1909.11942 13.​Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł. Universal Transformers [Internet].\narXiv; 2019 [cited 2026 Feb 19]. Available from: http://arxiv.org/abs/1807.03819 14.​Freinschlag R, Bertram T, Kobler E, Mayr A, Klambauer G.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 28, + "total_chunks": 56, + "char_count": 1007, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "487c0d54-4224-4a58-8444-d9b768a93841", + "text": "Symbol-Equivariant Recurrent\nReasoning Models [Internet]. arXiv; 2026 [cited 2026 Mar 7]. Available from:\nhttp://arxiv.org/abs/2603.02193 doi:10.48550/arXiv.2603.02193 Study of Numerical solution of Ordinary Differential Equation by Taylor, Euler\nand Runge-Kutta methods.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 29, + "total_chunks": 56, + "char_count": 271, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4caa8847-2df7-48ea-bf0d-305862e578ad", + "text": "Available from:\nhttps://api.semanticscholar.org/CorpusID:250273914 16.​Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, et al. Pushing the Boundaries of Molecular\nRepresentation for Drug Discovery with the Graph Attention Mechanism. J Med Chem. 2020\nAug 27;63(16):8749–60. doi:10.1021/acs.jmedchem.9b00959 17.​Brody S, Alon U, Yahav E.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 30, + "total_chunks": 56, + "char_count": 332, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "afe371c6-a52c-4d21-a559-606fff31b804", + "text": "How Attentive are Graph Attention Networks? [Internet]. arXiv;\n2022 [cited 2026 Mar 9]. Available from: http://arxiv.org/abs/2105.14491 18.​Tetko IV, Van Deursen R, Godin G.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 31, + "total_chunks": 56, + "char_count": 173, + "word_count": 23, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "29d39e9d-523a-4d1b-bc6a-c5ebed9a00bf", + "text": "Be aware of overfitting by hyperparameter optimization!\nJ Cheminformatics. 2024 Dec 9;16(1):139. doi:10.1186/s13321-024-00934-w 19.​Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions [Internet]. arXiv;\n2017 [cited 2026 Mar 9]. Available from: http://arxiv.org/abs/1705.07874 20.​Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting\nwith categorical features [Internet]. arXiv; 2019 [cited 2026 Mar 9]. Available from:\nhttp://arxiv.org/abs/1706.09516 doi:10.48550/arXiv.1706.09516 21.​Chen T, Guestrin C.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 32, + "total_chunks": 56, + "char_count": 559, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d37084f8-83d1-4e7e-a2da-89d000eaa63a", + "text": "XGBoost: A Scalable Tree Boosting System. In: Proceedings of the\n22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\n[Internet]. San Francisco California USA: ACM; 2016 [cited 2025 May 3]. p. 785–94. Available from: https://dl.acm.org/doi/10.1145/2939672.2939785 22.​Sheridan RP, Liaw A, Tudor M.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 33, + "total_chunks": 56, + "char_count": 324, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a44edbb8-d03b-4027-8bb3-bcd8c7d4c803", + "text": "Light Gradient Boosting Machine as a Regression Method\nfor Quantitative Structure-Activity Relationships [Internet]. arXiv; 2021 [cited 2025 May 3]. Available from: https://arxiv.org/abs/2105.08626 doi:10.48550/ARXIV.2105.08626 Mach Learn. 2001 Oct 1;45(1):5–32. 24.​Cortes C, Vapnik V.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 34, + "total_chunks": 56, + "char_count": 286, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b254c22-fc7b-449d-a12c-35c9376c1474", + "text": "Support-vector networks. Mach Learn. 1995 Sep;20(3):273–97. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc Ser B Stat\nMethodol. 1996 Jan 1;58(1):267–88. doi:10.1111/j.2517-6161.1996.tb02080.x 26.​Frankle J, Carbin M.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 35, + "total_chunks": 56, + "char_count": 232, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "968cb5c6-b036-478f-b6bd-9dcf5b92bd04", + "text": "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural\nNetworks [Internet]. arXiv; 2019 [cited 2026 Feb 17]. Available from:\nhttp://arxiv.org/abs/1803.03635 doi:10.48550/arXiv.1803.03635 All You Need Is Synthetic Task Augmentation [Internet]. arXiv; 2025 [cited 2026\nMar 7]. Available from: http://arxiv.org/abs/2505.10120 doi:10.48550/arXiv.2505.10120 28.​Leviathan Y, Kalman M, Matias Y. Prompt Repetition Improves Non-Reasoning LLMs\n[Internet]. arXiv; 2025 [cited 2026 Feb 19]. Available from: http://arxiv.org/abs/2512.14982 29.​Shirzadi M, Dehkordi AS, Zehmakan AN. Adaptive Initial Residual Connections for GNNs\nwith Theoretical Guarantees [Internet]. arXiv; 2025 [cited 2026 Feb 18]. Available from:\nhttp://arxiv.org/abs/2511.06598 doi:10.48550/arXiv.2511.06598 30.​Svenstrup D, Hansen JM, Winther O.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 36, + "total_chunks": 56, + "char_count": 816, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2d889920-3798-4e64-b3d1-26e5895a0773", + "text": "Hash Embeddings for Efficient Word Representations\n[Internet]. arXiv; 2017 [cited 2026 Feb 19]. Available from: http://arxiv.org/abs/1709.03933 31.​Amsel N, Persson D, Musco C, Gower RM.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 37, + "total_chunks": 56, + "char_count": 186, + "word_count": 24, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "23ed1138-34fa-4b6a-9e0d-5b2824e52b8b", + "text": "The Polar Express: Optimal Matrix Sign\nMethods and Their Application to the Muon Algorithm [Internet]. arXiv; 2025 [cited 2026 Feb\n19]. Available from: http://arxiv.org/abs/2505.16932 doi:10.48550/arXiv.2505.16932 32.​Li Z, Liu L, Liang C, Chen W, Zhao T. NorMuon: Making Muon more efficient and scalable\n[Internet]. arXiv; 2025 [cited 2026 Feb 19]. Available from: http://arxiv.org/abs/2510.05491 33.​Zhang B, Sennrich R. Root Mean Square Layer Normalization [Internet]. arXiv; 2019 [cited\n2026 Feb 19]. Available from: http://arxiv.org/abs/1910.07467 After removing in all the models (including AttFP) the MolAttFP, we clearly see an average\nincrease of 0.03 RMSE compared to the models including MolAttFP. Again skip 0.5 (average\nEuler) and SCORE methods generally provide the best models. Only one model is equal to the\nCatBoost results, showing the very important MolAttFP contribution to GNN in general. Figure 6 : CV5 distribution comparing the 5 GNN options without MolAttFP The lower the better in green the best option for this architecture.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 38, + "total_chunks": 56, + "char_count": 1051, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a4b39908-4e9d-4290-9d5a-90362a228ae5", + "text": "Table 3 : Ablation MolAttFP in GNN models. Rank Model Mean best val RMSE\n1 gcn_skip05 0.564±0.052\n2 SCORE_attfp_skip05 0.574±0.049\n3 SCORE_attfp 0.574±0.049\n4 SCORE_gat 0.574±0.049\n5 gat_skip05 0.578±0.057\n6 SCORE_gcn_skip05 0.579±0.048\n7 SCORE_gatv2 0.579±0.070\n8 dmpnn_skip05 0.580±0.056\n9 SCORE_gine 0.580±0.047\n10 gcn 0.580±0.031\n11 gatv2_skip05 0.583±0.040\n12 SCORE_gine_skip05 0.585±0.049\n13 SCORE_dmpnn 0.586±0.057 Ablation of SCORE-MLP Figure 7 : CV5 distribution comparing the 5 GNN options without SCORE-MLP The lower the better in green the best option for this architecture.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 39, + "total_chunks": 56, + "char_count": 586, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c05cfa7c-c551-4104-9c26-e33a84cfb5dd", + "text": "Table 4 : Ablation SCORE-MLP in GNN models. Rank Model Mean best val RMSE\n1 dmpnn_skip05 0.538±0.042\n2 SCORE_dmpnn_skip05 0.541±0.056\n3 SCORE_gine 0.546±0.057 4 SCORE_gat_skip05 0.551±0.044\n5 SCORE_dmpnn 0.555±0.056\n6 SCORE_gat 0.555±0.045\n7 SCORE_gine_skip05 0.556±0.042\n8 gcn_skip05 0.557±0.025\n9 gat_skip05 0.560±0.042\n10 SCORE_gcn 0.562±0.018\n11 gine_skip05 0.566±0.044\n12 SCORE_gatv2_skip05 0.566±0.033\n13 SCORE_mpnn 0.567±0.061 Ablation of MolAttFP and SCORE-MLP After removing both MolAttFP and SCORE-MLP, we focus on the true effect of SCORE-GNN in\nthe model performance. We got very similar results as for Ablation MolAttFP in general. This\nmeans our SCORE-MLP is accepted and does not degrade the performances. Figure 8 : CV5 distribution comparing the 5 GNN options without both SCORE-MLP and\nMolAttFP The lower the better in green the best option for this architecture.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 40, + "total_chunks": 56, + "char_count": 881, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8530cc26-1a35-481c-999e-4b1dfdaf6ba2", + "text": "Table 5 : Ablation MolAttFP and SCORE-MLP in GNN models. Rank Model Mean best val RMSE\n1 gatv2_skip05 0.571±0.034\n2 SCORE_attfp 0.572±0.050\n3 SCORE_attfp_skip05 0.573±0.042\n4 SCORE_gatv2 0.575±0.068\n5 gcn_skip05 0.575±0.050\n6 SCORE_dmpnn 0.577±0.049\n7 dmpnn_skip05 0.577±0.055\n8 SCORE_gcn_skip05 0.578±0.050\n9 gcn 0.578±0.030\n10 SCORE_gine 0.579±0.044\n11 gat_skip05 0.580±0.055\n12 SCORE_gat 0.581±0.048\n13 SCORE_gat_skip05 0.586±0.049 Study of SCORE equation effect We run for 75 epochs the models to compare the 4 equations using the same 1/N for 4 steps.\nwe do not see a huge difference versus the complexity of the computation so we keep Euler on\nthe experiments. Figure 9 : SCORE variations using Euler Simplified family with RDkit features at 75 epochs Effect of using more Euler Simplified family (using Euler SCORE equation with 4 steps delta 1/4), without MolAttFP\noption. Figure 10 : SCORE variations using Euler Simplified family without RDkit features at 75 epochs Effect of using more Euler Simplified family (using Euler SCORE equation with 4 steps delta 1/4), without MolAttFP\noption.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 41, + "total_chunks": 56, + "char_count": 1098, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42b812b2-6ed0-432a-ac45-a36aa6ab0fa8", + "text": "2 to 7 convolution layers/steps In this trial, we investigate the RDKit additional effect using concatenation of 217 descriptors. We generally observed a faster convergence. This is particularly the case for Graph\nTransformer. Few methods generally outperform or reach similar performance without RDKit. This is the reason we did not include RDKit features in GNNs main studies.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 42, + "total_chunks": 56, + "char_count": 378, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "97d33413-35cf-472e-a3b7-aab35d87c59a", + "text": "Figure 11 : Validation RMSE of first 100 epochs of SCORE-AttFP vs AttFP AttFP-MO corresponds to the native AttentiveFP model validation 5-CV per epoch (using Euler SCORE equation\nwith 4 steps delta 1/4). Figure 12 : Validation RMSE of first 100 epochs of SCORE-GT vs GT a validation 5-CV per epoch (using Euler SCORE equation with 4 steps delta 1/4). Figure 13 : Validation RMSE of first 100 epochs of SCORE-GAT vs GAT a validation 5-CV per epoch (using Euler SCORE equation with 4 steps delta 1/4), 100 epochs. Figure 14 : Validation RMSE of first 100 epochs of SCORE-GINE vs GINE a validation 5-CV per epoch (using Euler SCORE equation with 4 steps delta 1/4), 100 epochs. Figure 15 : Validation RMSE of first 100 epochs of SCORE-MPNN vs MPNN a validation 5-CV per epoch (using Euler SCORE equation with 4 steps delta 1/4), 100 epochs. Figure 16 : Validation RMSE of first 100 epochs of SCORE-DMPNN vs DMPNN a validation 5-CV per epoch (using Euler SCORE equation with 4 steps delta 1/4), 100 epochs. SCORE Acceleration versus Native validation During the experiments, I found that we can map the two learning curves between native and\nSCORE version initial methods using the time (epoch) - warp alignment of learning curves. The\ncurves have a similar trend with a speed rating difference.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 43, + "total_chunks": 56, + "char_count": 1291, + "word_count": 222, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "32293fb1-6d6a-4494-ad07-9afff39e5507", + "text": "This could be useful for two main\nreasons: find the ideal number of epochs of native methods and make hyperparameter\noptimization on the SCORE space. For almost all cases except for GT without RDKit, we can fit\nthe two curves using a compression factor on Native curve in order to compute the speed\nacceleration factor. Table 6 : Acceleration factor convergence of GNN. Speed with RDKit without RDKit\nAcceleration\nFactor AttFP (with 1.9 2.9\nMolAttFP) Factor acceleration of SCORE versus the Native version using RDKit or no, without MolAttFP except AttFP. Figure 17 : validation loss time warping between SCORE-GAT and GAT Example of time-warping fitting between the two methods by compressing the native validation curve fitting. This implies that we have a clear speed improvement without losing precision of the model via\nSCORE method. Interestingly it also provides knowledge of the capabilities and SCORE method\nversus the original method. Oversmoothing analyses Study of number of steps / layers in GAT N in [2,7] Figure 18 : Number of Steps for SCORE-GAT versus GAT without MolAttFP option, dt = 1/N Validation RMSE comparison, by changing number of steps (dt = 1/N) from 2 to 7 In order to understand the robustness of the method to oversmoothing, we run 6 model versions\nchanging the number of steps/layers in GAT structure.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 44, + "total_chunks": 56, + "char_count": 1333, + "word_count": 219, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "12c0be25-4a10-4c6e-a91c-ff4388089d5d", + "text": "I found the same acceleration and\nimprovement performances over the runs with a between 7.6% improvement between the\nOriginal GAT and the SCORE-GAT using the best average performance. This is confirming the\nefficiency of this SCORE method as well as its stability. Table 7 : SCORE-GAT vs GAT at best validation RMSE steps Δt SCORE Native Diff Improvement 2 0.5 0.598 0.644 +0.045 7.0% 3 0.334 0.595 0.646 +0.052 8.0% 4 0.25 0.595 0.638 +0.043 6.7% 5 0.2 0.595 0.647 +0.051 7.9% 6 0.167 0.590 0.641 +0.051 7.9% 7 0.143 0.594 0.646 +0.052 8.0% Average N/A 0.595 0.644 +0.049 7.6% 5-fold CV average of best Validation RMSE over 150 epochs no MolAttFP no SCORE-MLP Study of number of steps / layers in GAT + MolAttFP 2 to 7 When starting the experiments I did not use the MolAttFP layer in AttentiveFP to study only the\natom graph convolution effect. The results shown that the MolAttFP is essential to get better\nresults than DMPNN.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 45, + "total_chunks": 56, + "char_count": 929, + "word_count": 163, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "606dce39-7c3e-4f82-9bee-d5ed5031e14b", + "text": "So I decided to check what is the impact of MolAttFP part in top of native\nGAT called Native+ and for RODE called RODE+. Basically the combination provide an even\nbetter RMSE with again an 6.2% improvement versus the Native+ version. It is interesting to\nobserved that the +MolAttFP delivers the most stable results and reaches an 0.55 RMSE. Figure 19 : Number of Steps for SCORE-GAT versus GAT with MolAttFP option, dt = 1/N Validation RMSE comparison, by changing number of steps (dt = 1/N) from 2 to 7, SCORE-GAT + MolAttFP and GAT\n+ MolAttFP Table 7 : SCORE-GAT vs GAT with MolAttFP at best validation RMSE steps Δt SCORE+ Native+ Diff Improvement 2 0.5 0.559 0.618 +0.059 9.6% 3 0.334 0.566 0.598 +0.032 5.3% 4 0.25 0.545 0.610 +0.065 10.6% 5 0.2 0.564 0.610 +0.045 7.5% 6 0.167 0.564 0.602 +0.038 6.2% 7 0.143 0.558 0.606 +0.048 8.0% Average N/A 0.559 0.607 +0.048 7.9% 5-fold CV average of best Validation RMSE over 150 epochs with MolAttFP no SCORE-MLP", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 46, + "total_chunks": 56, + "char_count": 960, + "word_count": 171, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "106bf32d-ffa8-49ce-96fb-285c326644b0", + "text": "Study of number of steps / layers in GATv2 + MolAttFP 2 to 7 One well known alternative to GAT is GATv2, an enhanced version that is expensive to compute\nas the v2 version needs to compute a non linear attentive equation compared to the GAT initial\nversion. We did not see a significant difference between SCORE+GAT versus SCORE+GATv2\nor between native+ versions. So we do not need to use the GATv2 version to get the best 0.55\nRMSE performance. Table 8 : SCORE-GATv2 vs GATv2 with MolAttFP at best validation RMSE steps Δt SCORE+ Native+ Diff Improvement 2 0.5 0.562 0.593 +0.030 5.1% 3 0.334 0.558 0.597 +0.040 6.7% 4 0.25 0.569 0.608 +0.040 6.5% 5 0.2 0.561 0.589 +0.028 4.7% 6 0.167 0.559 0.602 +0.043 7.2% 7 0.143 0.561 0.599 +0.038 6.3% Average N/A 0.562 0.598 +0.036 6.1% 5-fold CV average of best Validation RMSE over 150 epochs Figure 20 : Number of Steps for SCORE-GATv2 versus GATv2 with MolAttFP option, dt = 1/N", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 47, + "total_chunks": 56, + "char_count": 924, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c300ca2f-da6b-40fd-9623-7152a2c954f5", + "text": "Validation RMSE comparison, by changing number of steps (dt = 1/N) from 2 to 7, SCORE-GATv2 + MolAttFP and\nGATv2 + MolAttFP Study of number of steps / layers in DMPNN + MolAttFP 2 to 7 We decided to also run DMPNN with the additional MolAttFP trick. And the result was very\ngreat, showing that the native+ model is very robust and that the ODE+ version still get\nimprovement even if we are touching the limit of the data noise (see CatBoost results). Table 9 : SCORE-DMPNN vs DMPNN with MolAttFP at best validation RMSE steps Δt SCORE+ Native+ Diff Improvement 2 0.5 0.548 0.574 +0.026 4.6% 3 0.334 0.553 0.551 -0.002 -0.4% 4 0.25 0.543 0.562 +0.019 3.4% 5 0.2 0.557 0.560 +0.003 0.5% 6 0.167 0.564 0.560 -0.004 -0.6% 7 0.143 0.552 0.555 +0.004 0.7% Average N/A 0.553 0.561 +0.008 1.4% 5-fold CV average of best Validation RMSE over 150 epochs Figure 21 : Number of Steps for SCORE-DMPNN versus DMPNN with MolAttFP option, dt = 1/N", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 48, + "total_chunks": 56, + "char_count": 931, + "word_count": 169, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f8142a25-daf4-4987-bc6f-42f5b15a011d", + "text": "Validation RMSE comparison, by changing number of steps (dt = 1/N) from 2 to 7, SCORE-DMPNN + MolAttFP and\nDMPNN + MolAttFP Very recently a learnable skip connection for Graph Neural Network convolution called Adaptive\nInitial Residual Connection was proposed(29). I proposed to setup the learnable Δt [0.1,0.5]\nparameter using this equation Δt = 0.1 + 0.4 * σ(α) via the sigmoid function to constraint the\nsystem. The goal is to make the system make a dynamic gate to combine the current and\nprevious knowledge using equation 2.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 49, + "total_chunks": 56, + "char_count": 529, + "word_count": 88, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "81c494df-64f9-4baa-8d90-6abdbc63c378", + "text": "Current results did not show any benefits so far as I got\nvery similar results as the Δt preset by number of steps. I have tested the Muon Optimizer in order to compress the default nanoGPT version using\nsimple characters tokenizer on a tiny Shakespeare dataset. This task is more complex for the\nSCORE method than the Native nanoGPT variant at 11 M parameters for 3000 iterations.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 50, + "total_chunks": 56, + "char_count": 381, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "60eef1d2-a470-4211-b30b-bcd8f4365046", + "text": "I\nobserved that SCORE needed a very small dropout and preferred Adam to AdamW as we need\nto leverage all the weights. Muon allowed the convergence better than Adam on a m_SCORE\nversion 3.6M parameter, consisting of 2 stacking SCORE blocks (M = 2, steps = 3). The model\ntraining takes the same time to reach 1.57 val loss, almost the same as the original nanoGPT\n11M with AdamW at 1.56 val loss. This is really nice to see that the Muon has this capability to leverage all the weights of the matrix instead of the sparse idea used by default in LLM. Basically this shows the fact that the Optimizer is essential for SCORE SLM models.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 51, + "total_chunks": 56, + "char_count": 632, + "word_count": 117, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53aa03fb-16dc-4516-8597-35c75a54c954", + "text": "The\nSCORE method (steps = 6) with 1.8M parameters also reaches 1.56 val loss. In table 10, we\ngenerate example sentences based on training with Muon optimizer of 10.7 M, 3.6 M and 1.8 M\nmodels while the original nanoGPT reaches 1.56 with AdamW optimizer, it only achieves 1.6\nwith Muon (with dropout 0.2 or 0.01) with learning rate of 3e-4. Table 10 : SCORE, mSCORE and nanoGPT generator examples nanoGPT : 10.7 M, 6 layers mSCORE : 3.6 M 2 x 3 steps SCORE : 1.8 M , 6 steps\nval loss 1.57 val loss 1.56 val loss 1.56 Is Head you such again; for that Is that lady for her have set a detter, I'll askly the die.\never grace! Shepherd:\nWherein we wear you have gover SICINIUS: Let thee grace with and what I do\nfather.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 52, + "total_chunks": 56, + "char_count": 714, + "word_count": 139, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a9fdccd0-ed9c-447b-a57c-cca20148b039", + "text": "The fly sovereignce what thou canst not late? Is Lord the loss, where isparish thy\nMy conscient, gopinion to and Secrvivant; richzance\nhopece Boundstand there, not he nobld! In mades with griefs have can\nInsued bewing grions himself LEONTES: pardon'd:\nspardon'd: Yet ask will'd to do, To will you with a good consul, like\nTo will you will be more. Marcius, Signolous I have slel: love,\nwhich, and we curse at I fear the way arreate they do good And because and any they do\npartieng he manague in' hands by grief, power are mine.\nthe swom: The abjoure sweetss; know when I LEONTES:\nAnd before law with such rems; her will stay re abundred What, fellow, what were strive, and\nand their her Int clean my hand love a bear they true?\npraye love a bearer hein could did A clown:\nstory-merris; These such revengealess of any And he love a be? And he mans giver the humornt tritle over CAPULET:\nanother! Tyruesther's friends ame thank a me IsABELLA:\nThis the farful king's tearsure for his do them; Ratueous man'st torment him, it am\nloving. The pudge in shrow!--O contemn slaved to the life it know;\nJULIET: like throw his an; She am sade he light, being and\nWith thy contrary, may armshal, and Test mude what of noble sir:\nneight didined, Show do many of requests", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 53, + "total_chunks": 56, + "char_count": 1257, + "word_count": 228, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0e3b529c-2cd2-471a-894f-dcc72a0fed9e", + "text": "By adding Bigram hash embedding(30), NorMuon with Polar express(31), and replacing\nLayerNorm by RMSnorm, we can reach with 12 steps 1.53 and best 1.51. Bigram is really\naccelerating a lot in the early training steps while NorMuon(32) and RMSNorm(33) are fine\ntuned versions to be more precise and have less parameters. Here is a final prompt generator\nscore model with 1.8 M with 12 steps train with 90/10 or 10/90 train-validation split (the\nvalidation loss diverges a lot in this case as we do use only 10% of data for training and we\ncannot compare with the 1.51 validation loss anymore). But we can see that the Generated\ntexts have both the speakspare style in example from table 11. Table 11 : SCORE generator examples train with 10/90 or 90/10 splits Generated text with 10/90 train-val split Generated text with 90/10 train-val split best val loss best val loss 1.51 ICINIUS: I'll ask the fool.", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 54, + "total_chunks": 56, + "char_count": 902, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "01d2781b-b0b8-4f3c-be8d-e964d5b482aa", + "text": "May from thence: LUCIO:\nLest have you to cry him: Let me to her be you had\nIndeed for your cosuntry. May the firew's grace less! all she shall set himpance. BRUTUS: Shall; how, sir, my king dear love hath can\n'Tis most like he words his house wints, Savance you to knee to do it. As he would shall be form all your op? CAPULET:\nCORIOLANUS: Oh, sir, go and with way all and the ground\nWhat then shall I bloods I dishonours hip To reap him was foreful nose; kneel not we creature\nThat wouldst do thee with work me invoits. From us ready that there is follow a bed? SICINIUS: Came couldst me stay. The crose of crue at at dat the no midnifit MAMILLIUS:\nAg the breation walls, but I amoure ompet O grace!\nThat thou speak'd it follow. And with mostre!\nFirst Senator: BUCKINGHAM:\nNo, Caius Caius Marcius coming to Marcius. Her sake, I do think me to look as to-morrow\nAll: Asside thou deadly by arms? MARCIUS: To thy of\nThe deare i' the pears uside the f", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 55, + "total_chunks": 56, + "char_count": 948, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db907757-3ad5-4a4b-96e1-921921f1cf3f", + "text": "Table 12 : Advanced Autosearch on M3 Max 128 GB (5 min is representing the true constraint\nhardware constraint) Config using stable NorMuon val_bpb Params Trials d4 + skip05 (yours) 1.2594 22M 1 manual d4 no skip05 (yours) 1.2621 22M 1 manual SCORE 2-recursive (yours) 1.2731 18.4M 1 manual PR#4 d4 (BL3IP) 1.2809 22M 110 via autosearch", + "paper_id": "2603.10544", + "title": "SCORE: Replacing Layer Stacking with Contractive Recurrent Depth", + "authors": [ + "Guillaume Godin" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10544v1", + "chunk_index": 56, + "total_chunks": 56, + "char_count": 336, + "word_count": 57, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10545_semantic.json b/data/chunks/2603.10545_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..428af4e70da8a76c19b1a78b5f49bd896e76115c --- /dev/null +++ b/data/chunks/2603.10545_semantic.json @@ -0,0 +1,522 @@ +[ + { + "chunk_id": "b8418b44-7209-40fb-b02b-ec8ec278cc2a", + "text": "Learning to Score: Tuning Cluster Schedulers\nthrough Reinforcement Learning Martin Asenov Qiwen Deng Gingfung Yeung\nEdinburgh Research Centre Edinburgh Research Centre Edinburgh Research Centre\nCentral Software Institute, Huawei Central Software Institute, Huawei Central Software Institute, Huawei\n0000-0003-4610-3112 0009-0005-3663-0914 0000-0002-3845-0686 Adam Barker\nEdinburgh Research Centre\nCentral Software Institute, Huawei\nSchool of Computer Science\nUniversity of St Andrews\n0000-0002-0463-72072026", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 0, + "total_chunks": 26, + "char_count": 507, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "329eb4eb-a7c8-40a7-bbcb-0181b6752032", + "text": "Abstract—Efficiently allocating incoming jobs to nodes in and Borg [5], [6] employ a two-step approach for assigningMar large-scale clusters can lead to substantial improvements in both pods to nodes [7], which is illustrated in Figure 1.\ncluster utilization and job performance. In order to allocate The first step involves selecting feasible nodes for every pod\nincoming jobs, cluster schedulers usually rely on a set of scoring11 through a set of filtering functions, which are hard constraints functions to rank feasible nodes. Results from individual scoring\nfunctions are usually weighted equally, which could lead to sub- such as node resource capacity checks (CPU, memory, GPU)\noptimal deployments as the one-size-fits-all solution does not take and network topology requirements, e.g., if the pod requires\ninto account the characteristics of each workload.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 1, + "total_chunks": 26, + "char_count": 865, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "abcfc18e-6c7d-4495-8b2d-e0b6629703cb", + "text": "Tuning the being in a specific region [4], [8]. The second step involves\nweights of scoring functions, however, requires expert knowledge calculating scores for all feasible nodes using scoring funcand is computationally expensive.\ntions [4], [9], [10]. A final score is computed by summing up This paper proposes a reinforcement learning approach for[cs.LG] learning the weights in scheduler scoring algorithms with the the individual scores and the pod is allocated to the node with\noverall objective of improving the end-to-end performance of jobs the highest normalized score.\nfor a given cluster. Our approach is based on percentage improve- Despite having to schedule different types of workloads with\nment reward, frame-stacking, and limiting domain information. different optimization targets, scheduler scoring functions are\nWe propose a percentage improvement reward to address the\ntypically weighted equally. Specific clusters can be config- objective of multi-step parameter tuning. The inclusion of framestacking allows for carrying information across an optimization ured to assign different weights to prioritize certain scoring\nexperiment. Limiting domain information prevents overfitting functions over others, e.g., prioritizing tighter bin packing on\nand improves performance in unseen clusters and workloads. the cluster. This process is, however, manual and requires\nThe policy is trained on different combinations of workloads and knowledge of the specifics of the typical workloads, cluster\ncluster setups. We demonstrate the proposed approach improves\nconfiguration, and expert know-how [11]. performance on average by 33% compared to fixed weights and\n12% compared to the best-performing baseline in a lab-based Black box optimization approaches such as random search,\nserverless scenario. or Bayesian Optimization [11], [12] can be adopted. However,\nIndex Terms—scheduling, scoring functions, reinforcement tuning the weights of scoring functions is particularly difficult\nlearning, tuning due to the computational cost of evaluating a new configuration. Additional challenges include the high dimensionality of I.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 2, + "total_chunks": 26, + "char_count": 2140, + "word_count": 299, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4ba870dc-6f95-4fc4-841b-a22342a429fe", + "text": "INTRODUCTIONarXiv:2603.10545v1 workload-cluster specifications, the large number of scoring\nCluster orchestration systems like Kubernetes [1] are defunctions to be tuned and generalization to unseen configurasigned to run multiple workload types, including user-facing\ntions.\nservices, batch-processing tasks, and machine learning apIn this paper, we propose a reinforcement learning approach\nplications. They must carefully balance a set of competing\nto automate tuning the weights of the scoring functions to difrequirements, such as ensuring high utilization at the cluster\nferent workloads and cluster configurations. With the proposed\nlevel whilst maintaining the quality of service for the underapproach, we are able to learn stronger bias for the weights\nlying applications [2], [3].\nsampling strategy compared to standard heuristics-based apOne of the key tasks for the scheduler, in order to meet\nproaches. This allows us to use existing infrastructure for job\nthese requirements, is to schedule jobs (or pods in the case of\nscheduling while dynamically tuning the system depending on\nKubernetes) to nodes in the cluster. Modern cluster orchestrathe type of workload and cluster configuration.\ntion systems such as Kubernetes [1], Azure VM Allocator [4]\nOur reinforcement learning approach is based on three main\nCorrespondence email: sirlab@huawei.com ideas. First, we formulate multi-step parameter tuning as a", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 3, + "total_chunks": 26, + "char_count": 1421, + "word_count": 201, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e3846ae3-bf6e-489d-b9f0-b112ac85b073", + "text": "cluster cluster\nnode 0 node 1 node 2 scoring node 0 node 1 node 2\nw1 MostAllocated ( pod 1 ; nodenode 14 ) 9 3 pod 1 pod 1 5 ImageLocality ( ; ... nodenode 14 ) node 3 node 4 node 5 allocation node filtering node 3 node 4 node 5 w2\npodpodpod123 node 4 nodenode 14 Scheduler Capability ( pod 1 ; ) w3\nPodFitsResources 3 7\nincoming NoDiskConflictCheckNodeCondition node 6 node 7 node 8 ₊ node 6 node 7 node 8\npod 1 ; nodenode 14 ) pods wk ResourceBalance (\nnode 9 node 10 node 11 overall score for ( pod 1 ; node 4 ): 9 node 9 node 10 node 11 Fig. 1: Filtering and scoring steps in a job scheduler. Assigning pods to nodes in a cluster job scheduler is typically a\ntwo-step process of filtering feasible nodes, followed by scoring functions. In this work, we focus on optimizing the relative\nweighting (w1, w2, w3, ... ,wk) of the different scoring functions in different cluster and workload scenarios, with the goal of\noptimizing a given metric. reinforcement learning problem through the use of methods onto a set of defined scores to allow for more fine-grained\nlike frame stacking and techniques for balancing exploration- control. In Kubernetes, this is referred to as requested-toexploitation like entropy regularization. Second, we propose capacity-ratio [13].", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 4, + "total_chunks": 26, + "char_count": 1266, + "word_count": 228, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5a9a069-9563-469a-9fae-71fbd0830f46", + "text": "To implement specific preferences for\nusing a percentage improvement reward as the optimization allocation between nodes and pods, affinity and taints scoring\ntarget to encourage exploration. Third, we implement a simple functions are often implemented, that attract or repel a pod to a\ntechnique to prevent overfitting and improve generalization by specific node [14]. To effectively manage nodes spread around\nlimiting domain information. We implement these features in the cluster, based on region, zones, etc., topology scoring\na framework leveraging state-of-the-art reinforcement learning functions are implemented [4], [5], [15]. With the introduction\nmodels with the option to easily add new multi-step optimiza- of new scoring functions coupled with the increasing number\ntion problems. of different workloads, weighting different scoring functions\nThe presented approach in this paper is general, but our becomes an increasingly important problem.\nworkloads focus primarily on serverless applications in the\nB. Optimizing weights of scoring functionscontext of a Function as a Service (FaaS) environment, and our\ncluster configurations consist of heterogeneous devices ranging Different scheduling objectives are desirable, depending\nfrom powerful cloud CPU and cloud GPU machines to less on the workload and cluster configurations. For example, in\npowerful edge devices, which can be highly distributed. deep learning scenarios, we might want to pack pods in coThis paper makes the following key contributions: located nodes within the same cluster in order to achieve\nreduced network latency and higher throughput. Similarly, • Formulation of multi-step parameter tuning of weights of\nfor MapReduce tasks, these tasks read data from multiple scoring functions as a reinforcement learning problem.\nmachines and have high network requirements [16]. On the • Reinforcement learning approach based on percentage\ncontrary, for critical online user-facing services, we might improvement reward, frame stacking and limiting domain\nwant to spread out pods to increase redundancy due to single information.\ncluster failure. Regardless of the high-level objective pursued, • Extensive evaluation on tuning weights of scoring functhe choice of scoring functions' weights is often non-trivial. tions in a FaaS system, improving performance by 33%\nFor example, within Kubernetes, efficient packing can be over constant weights, and 12% over the best-performing\nachieved with both MostAllocated and RequestedToCapac- optimization baseline.\nityRatio (RTCRatio) strategies. Moreover, scheduling them\nThe remainder of this paper is structured as follows: Secwith the goal of packing can cause interference for the network\ntion II discusses related work. Section III describes our tuning\nand disk resources [17]. Therefore, it is useful to carefully\napproach.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 5, + "total_chunks": 26, + "char_count": 2853, + "word_count": 406, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b05259d4-2034-42b4-b352-12ea950878e4", + "text": "Section IV presents the system implementation and\nbalance the trade-off between pack and spread.\nevaluation of the effectiveness of the proposed approach in a\nAutomatically tuning the weights of the individual scorheterogeneous FaaS system. Section V contains conclusions\ning functions is desirable to optimize for a specific target,\nand future work.\nsuch as application performance or bin packing of pods to\nII. This can improve the targeted metric, e.g. reducing\nfunction execution time or network traffic [18]. Scheduling and scoring functions the optimized weights show not just a binary selection of\nMany cluster orchestration systems employ a two-step ap- important/insignificant scoring functions but significant relaproach of filtering and scoring for scheduling as referenced tive differences, e.g. in a homogeneous cloud scenario image\nin Figure 1. Scoring functions implement different objectives locality is more important than data locality [18].\nfor pod-to-node allocation. For example, most-allocated and\nC. Blackbox optimization methodsleast-allocated scoring functions in Kubernetes aim for tighter\npacking and spreading of workloads, respectively. Alterna- Many blackbox optimization algorithms for parameter tuntively, schedulers can define mathematical functions such as ing have been proposed, from simple methods like grid and\nthe piece-wise linear function that maps utilization values random search to more sophisticated methods that impose (domain) cluster config 3 workload 3 36 metrics 9 cluster config 2 (domain) (domain) workload 2 10 past metrics metrics cluster config 1 workloadworkload 11 7 2 1 4 weights 3 improvementpercentage reward Scheduler FaaS 8 Parallel pod pod pod Gym wrapper samples Policy + func node 1 node 2 Benchmark 1 func 2 func 3 Environments\nweights weights pod pod pod w1,w2,...,wk 5 weights of scoring functions w1,w2,...,wk scoring functions\nfunc 4 func 5 func w1,w2,...,wk 6 node 3 node 4 Fig. 2: Reinforcement learning for tuning weights of scoring functions. We pose the optimization of weights of scoring\nfunctions as a parameter tuning problem and propose a reinforcement learning based solution. In this work, we propose\nusing percentage improvement reward 4 , encoding past samples information through the use of frame stacking or recurrent\npolicies 5 , and limiting domain information to prevent overfitting 6 . We develop an extensive gym wrapper 2 , including\nthe option for parallel environments 7 , and demonstrate the capability of our approach in an example FaaS benchmark\nscenario 1 . a different bias on the type of optimization problem that is on heuristics with an end-to-end reinforcement learning agent\nsubsequently exploited. Some of the more traditional methods could lead to an impressive gain in performance, another\ninclude genetic algorithms (GA) [19], where based on an aspect to consider is the safety aspect of a reinforcement\ninitial population crossover and mutation and repeated until learning agent deployed in production [30].", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 6, + "total_chunks": 26, + "char_count": 3015, + "word_count": 451, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "69c310eb-0b68-4202-a59e-6e00c429f88b", + "text": "While GA is easy to implement, it tends reinforcement learning can be used in combination with exto scale poorly for high-dimensional problems and has no isting infrastructure by instead tuning parameters of already\nconvergence guarantees. A surrogate model over the target deployed algorithms, such as in database systems [31] [32].\nmetric could be introduced and optimized - commonly in In this work, we take the latter approach and focus on tuning\nthe form of Gaussian Processes and Bayesian Optimization weights of scoring functions within job schedulers.\nrespectively [20]. Bayesian optimizations have been applied to\nIII. APPROACHmany aspects of systems optimization, from tuning database\nsystems [12] to jobs collocation [11]. Alternatively, optimiza- In this work, we focus on tuning scoring functions' weights\ntion can be structured over the domain of parameters of in job schedulers in FaaS. We pose multi-step parameter\ninterest, split below and above a certain threshold - commonly tuning as a reinforcement learning problem, where we aim\nin the form of Tree-of-parzen-estimators [21]. While those to achieve better sampling efficiency by learning a stronger\napproaches impose useful biases via different kernel functions bias from past experience. We develop a software framework\nor sampling strategies via generative models of the domain for parameter tuning based on state-of-the-art reinforcement\nvariables, they tend not to take full advantage of the domain learning approaches and perform experiments with an example\ninformation of the targeted optimization problem. Tuning FaaS system.\nweights of scoring functions is particularly challenging due As seen in Figure 2, the tuning approach comprises three\nto the following: main components: the FaaS Benchmark 1 , Gym wrapper\n2 , and the Reinforcement learning (RL) agent 3 . The FaaS • Computational cost of evaluating a new configuration\nbenchmark encapsulates the underlying FaaS platform for • High dimensionality of workload-cluster specification\nFunction executions and emits metrics on how they perform. • High dimensionality of weights of the scoring functions\nThe Gym wrapper allows the FaaS benchmark to be repreAs such traditional methods are ill-suited as it would take an sented as an interactive environment that takes action and emits\nunreasonable amount of time to converge on a desirable so- observation, similar to OpenNetLab [33].", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 8, + "total_chunks": 26, + "char_count": 2416, + "word_count": 363, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2cc4f9ee-1295-4ebd-b92b-709002492db6", + "text": "The agent is responlution. Reinforcement learning has been presented as a viable sible for interacting with the FaaS environment and assigning\nalternative for parameter tuning in the context of parameter the appropriate weights to available scoring functions.\ntuning of cluster management frameworks [22], while also\nshowing that it can outperform conventional approaches [23], A. Deep Reinforcement Learning Agent\n[24]. In this work, we expand on those approaches and address Reinforcement Learning is a machine learning approach\nspecific challenges in the context of tuning weights of scoring where an agent learns how to make decisions that will lead\nfunctions. to optimal outcomes over time. A reinforcement learning\nproblem is defined by a: state space - a representation of\nD. Reinforcement learning for scheduling the environment at any given time; action space - a set\nReinforcement learning is defined as an optimization prob- of all possible actions the agent can take; and a reward\nlem, where an agent interacts with an environment through a function - a set of all possible rewards the agent can receive\nset of actions that change its state, with the goal of optimizing from the environment. Reinforcement learning typically uses\na reward [25]. Reinforcement learning has been applied to one of two approaches: value-based or policy-based. In the\nmany diverse domains, from robot control [26] to tuning value-based approach, the agent learns to estimate the value\nlarge language models based on human preferences [27]. of each state-action pair and selects actions that maximize\nSimilarly, there has been an increasing interest from the cloud this value. In the policy-based approach, the agent learns\ncommunity as a viable alternative to traditional scheduling al- a policy directly without explicitly estimating the value of\ngorithms [28], [29]. While substituting decision-making based state-action pairs. There is also a hybrid approach called actor-critic, which combines elements of both value-based and reinforcement learning this can be achieved through simple\npolicy-based methods. Deep reinforcement learning refers to approaches such as adding a percentage for random access,\napproaches using neural networks to represent the policy or or more sophisticated approaches such as adding entropy\nvalue functions respectively. regularization [34].", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 9, + "total_chunks": 26, + "char_count": 2365, + "word_count": 353, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2ca0ed6a-a034-4e8d-959f-508d20ce196f", + "text": "In this work, we leverage state-of-the-art approaches, like\nTABLE I: Skippy Scheduler Scoring Functions. A total ofsoft actor-critic (SAC), which evolved from max entropy reineight scoring functions are used as part of scheduling. For fixedforcement learning and the actor-critic, key idea behind is to\nweights baseline, and initial weights selection for optimizationnot just maximize cumulative rewards but also make the policy\nalgorithms, all weights are set to 1, except LeastAllocated andmore random [34] and RecurrentPPO where the PPO [35]\nRTCRatio which are set to 0.incorporating the recurrent neuron networks (RNN, LSTM,\nGRU) will enable the agent to handle partially observable Scoring Func. Description\nenvironments better [36]. We present the training parameters in\nLeastAllocated Favors nodes with the lowest utilization\nSection IV-B. We formulate multi-step parameter tuning as a\nreinforcement learning problem. We then use this formulation MostAllocated Favors nodes with the highest utilization\nto address the problem of tuning weights of scoring functions RTCRatio Piecewise linear function of utilization\nwithin job schedulers, with the following definition: LocalityType Tag for the type of machine, e.g. edge vs cloud\n• State: DataLocality Estimated time to download necessary data\n– Static: cluster and workload information, such as Capability Tag for capability of the machine, e.g. GPU\nnumber and types of machines, workload type, etc.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 10, + "total_chunks": 26, + "char_count": 1457, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "33af19cd-0a70-47e7-b411-8538d0d041ee", + "text": "Balanced\n– Dynamic: encodings of action-reward pairs of ex- Resource Favors nodes with least stddev. across resources\nplored actions so far. LatencyAware\nEstimated time to download the container image\n• Action: weights of the scoring functions. ImageLocality\n• Reward: improvement over a defined metric (percentage 3) Limiting Domain Information: Generalization to unseen\nimprovement reward). environments is another important property for tuning scoring\n1) Percentage Improvement Reward Function: To encap- functions' weights. It is desirable that once an algorithm is\nsulate multi-step parameter tuning as a reinforcement learn- trained, it is able to perform well within scenarios different\ning objective, we propose using a percentage improvement from the original training domain. Reinforcement learning is\nreward 4 function defined as follows: known for exploiting the training environment by often finding\nunintended shortcuts for achieving high rewards [38].", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 11, + "total_chunks": 26, + "char_count": 966, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "856cea7f-23f3-4291-b152-cb0319ec2767", + "text": "To mit- ( max(r1,,r2,..,rn)−r0 if i = n,\nri = r0 (1) igate the problem, we propose a simple technique of limiting\n0, otherwise. static domain information to prevent overfitting and achieve\nwhere n is the number of allowed samples per experiment. better generalization 6 . This can range from including only\nThis is motivated by an exploration objective, as we want the part of the known domain information to excluding it all.\nmaximum percentage of improvement over a default action This allows the algorithm to learn a good general policy for\n(e.g. same weights for all scoring functions) in one of the exploration and exploitation while preventing overfitting. In\nchosen actions across the experiment. The proposed reward our experiments, we opt for the second option of limiting\nhas the benefits of normalization across experiments, as it is domain information, e.g. including only coarse description\nagnostic to the initial metric value from the initial action but variables for workload and cluster information.\ninstead optimizes the rate of improvement. Gym Wrapper\n2) Multi-step Parameter Tuning: The state space of the\nThe above-described contributions are implemented as areinforcement learning agent should encapsulate information\nsoftware framework for parameter tuning using reinforcementabout the environment. In the context of scoring functions'\nlearning algorithms by developing a general environmentweights tuning, this includes static variables such as cluster\nwrapper 2 . An environment is defined by the followingsetup and workload characteristics, but also dynamic informaspaces:tion of the performed experiments so far - pairs of explored\nweights and the corresponding reward. • Static: static parameters throughout an experiment. To encode the action-reward pairs information in addition • Domain train: parameters during training.\nto the static characteristics we consider two approaches 5 . • Domain test: parameters for extrapolation experiments. The first is to present the information explicitly using frame- • Initial action: action taken at the beginning of an experstacking [37] where the number of stacks is equal to the num- iment.\nber of maximum samples to be acquired. A second alternative • Reward: optimization metric of choice.\nis to instead use a recurrent policy such that information is • Actions: parameters to be optimized.\nencoded within the hidden state of the network. Each space described above consists of one or multiple\nBalancing exploration-exploitation in a systematic way is a [variable name, min, max] used for normalization purposes.\ndesirable property of any parameter-tuning algorithm. Within The spaces, in addition to other options (e.g. hiding part of the", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 12, + "total_chunks": 26, + "char_count": 2714, + "word_count": 409, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "91e2fa5c-e49d-49a9-9e31-892523c9944c", + "text": "static variables, adding noise, etc.), are then used to construct • Hybrid_balanced: configuration consists of a similar\nthe environment. number across devices except for NVIDIA Xavier NX\nand TX2. IMPLEMENTATION AND EVALUATION • Hybrid_balanced_jetson: have similar setup as\nhybrid balanced with higher number of NVIDIA Nano. To evaluate the proposed approach, we perform large-scale\nWe use two network typologies, as shown in Figure 4.\nexperiments with a high-fidelity simulator, faas-sim [18], that\n• All connected, internet topology: where individual devicesprimarily targets the FaaS platform 1 on a set of different\nhave uniform bandwidth, ensuring fast access connectivworkloads 10 and cluster setups 9 . It captures both network\nity.\ntopology delays, implemented with Ether [39], and heteroge-\n• Limited, urban topology: where the network is layered\nnous hardware execution time, making it suitable to evaluate\nand has bandwidth limitation to simulate delays in conscheduler placement performance in larger cluster setup [18].\nnectivity. Skippy is the implemented scheduling system [18] 8 , conFor more extensive evaluation, we allow the disconnect be-taining a set of scoring functions described in detail below.\ntween compute units and their usual network topology, i.e. we\ndon't assume that a cloud_cpu cluster configuration would\nA. Experimental setup\nnecessarily have an all-connected topology. Instead, we treat\na) Scoring Functions: The Skippy scheduling system cluster configurations and topology as separate factors across\ncomes with a default scheduler. We extend the default scoring experiments.\nfunctions of Skippy with MostAllocated, LeastAllocated and c) Workload: We use a random combination of up to 8\nRequestedToCapacityRatio inspired by their equivalents in different functions to form a workload.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 13, + "total_chunks": 26, + "char_count": 1822, + "word_count": 263, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "adaaf983-ff6a-4c03-9e91-a6ec36096ed7", + "text": "Each function follows a\nKubernetes as shown in Table I Poisson distribution with a constant rate of arrival. We evaluate\nthe performance of each experiment by three metrics weighted\nTABLE II: Cluster and workload configuration across equally - mean function execution time, mean function queue\nexperiments. To evaluate how well the proposed approach time, and the number of successful requests executed within a\ngeneralizes to novel scenarios, we test the trained agent on specified time window. Each benchmark run lasts for 100 secnovel scenarios with different cluster setups, novel workload onds, excluding the time for the initial allocation of function\nfunctions, and different scheduling options. pods.\nd) Optimization metric: To quantify the performance of Configuration Train environment Test environment\ndifferent workloads across approaches, we use the following\nCluster setups 3 8\nset of three metrics:\nWorkload functions 5 8\n• mean function execution time: µfet\nRequests per second 10 5-30 • mean function queue time: µwait\n% of nodes to score 100 10-100 • number of successful requests executed: Nsuccess/Ntotal\nMin # nodes per func 1 1-10 The overall score for optimization and evaluation we define\nMax # nodes per func 100 50-100 as:\nScale factor 1 1-5 score(workload) =\n# of nodes 30-180 200-400 X (avg(µfet(f) + µwait(f) + Nsuccess/Ntotal))\nb) Cluster Setup: Each cluster setup consists of a variety f∈workload\n(2)\nof heterogenous hardware and network topologies. We use a\nWe normalize the three metrics between 0 and 1, so the\ntotal of 8 different cluster setups as defined in Figure 3.\nfinal score is also normalized.\n• Cloud_CPU: configuration mainly consists of Xeon\nB. Training CPUs, with 71% of total devices.\n• Cloud_GPU: configuration mainly consists of Xeon For training the reinforcement learning agent, we use\nGPUs, with 70% of total devices. SAC [34] with frame stacking to account for multi-parameter\n• Edge_Cloudlet: configuration consists of a higher tuning. Due to the explicit entropy regularization, we find\nnumber of Intel NUC (mini desktop with dedicated GPU), that SAC achieves more robust exploration and tends not\na medium number of Raspberry PI (RPI) 3 and 4, and a to get stuck in premature local minima during optimization.\nlower number of NVIDIA Nano. We use stable baselines [40] with an 512x512x512 MLP\n• Edge_GPU: configuration mainly consists of NVIDIA network, with ReLU activations, for both QNet and policy\nNano, and a low number of Xeon GPU. networks. We normalize the state and action spaces and train\n• Edge_SBC: configuration mainly consists of RPI 3 and with multiple environments in parallel. To evaluate how well\n4, and without any NVIDIA TX2 devices. the proposed method generalizes, we use just 3 out of the\n• Edge_TPU: configuration consists of a higher number of 8 cluster setups for training - cloud cpu, cloud gpu\ncoral devboard and NVIDIA Nano. and edge cloudlet. Different workload is generated 80 cloud_cpu cloud_gpu edge_cloudlet edge_gpu edge_sbc edge_tpu hybrid_balanced hybrid_balanced_jetson 4 4 4 4 4 4 4 4\n60 RPI RPI RPI RPI RPI RPI RPI RPI RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3 RPI TX2 Nano NUC NX RockPi 3\n40 XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoard XeonCpu XeonGpu DevBoarddevices Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia Nvidia Intel Xavier Nvidia#\n20 Coral Nvidia Coral Nvidia Coral Nvidia Coral Nvidia Coral Nvidia Coral Nvidia Coral Nvidia Coral Nvidia Fig. 3: Different heterogeneous cluster configurations used for training and evaluation.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 14, + "total_chunks": 26, + "char_count": 3969, + "word_count": 641, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a76e975-c12f-45ec-97c2-8c1829f36c1c", + "text": "Distributions of the types of\nmachines used for benchmark experiments. Only cloud_cpu, cloud_gpu and edge_cloudlet cluster configurations\nare used during training. We use additional cluster configurations to evaluate how well the proposed approach is able to adapt\nto unseen machines' distributions. weights of the scoring functions in benchmark experiments\nwith similar configurations, workloads, and machine distributions during training. We compare the proposed approach\nagainst baselines, as defined in Section IV-C.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 15, + "total_chunks": 26, + "char_count": 520, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2044e724-f274-49f8-bba7-a553f94dc7cf", + "text": "For evaluation,\nwe use ten different benchmark experiments and compare the\nbest-obtained score (as defined in eq.2) during optimization.\n(a) Internet topology (b) Urban topology The best-obtained score, weights selection, and information\nabout cluster and workload configuration in each experimentFig. 4: Example network configurations within the cluster\nare visualized in Figure 5.setup. We use two types of cluster connectivity across benchOptimizing the weights of scoring functions leads to a majormark experiments.\nincrease in the above-defined metric compared to default fixed\nin each experiment, using a random set of 5 functions\nweights. The proposed approach also outperforms standard\nresnet50_training, resnet50_preprocessing,\nbaseline approaches. We observe that in simpler experiments\nresnet50_inference, mobilenet_inference and\nwhere the cluster has homogeneous connectivity and workload\nspeech_inference.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 16, + "total_chunks": 26, + "char_count": 918, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "102b12d6-b3bc-4ef6-89e0-c83243af5d58", + "text": "More details about training options\nwith just a few functions, tuning the weights of scoring funccan be seen in Table II.\ntions does not lead to significant benefits (Exp0). Baselines as the number of functions in the workload grows and the\ncluster has heterogeneous compute units and connectivity, To compare the proposed approach against standard meththen optimizing the weights yields higher benefits (Exp1). Weods in the literature, namely fixed scoring function weights\nobserve that the reinforcement learning agent learns a set of(Fixed), random search (RS), Bayesian Optimization (BO),\nsignificant and insignificant weights - e.g. the locality has aand Tree-Structured Parzen Estimator (TPE). Fixed weights\nlow weights value and standard deviation across experiments,employ a similar configuration to Kubernetes by assigning\nin contrast to other scoring functions such as capability. Thethe same constant weight for all scoring functions (except\nproposed approach improves performance by 33% over fixedLeastAllocated and RTCRatio, which have a weight of 0).\nweights for the scoring functions and by 20% over the nextRandom search is one of the simplest heuristics for parameter\nbest-performing baseline.optimization, which randomly and independently samples val-\n2) Generalisation to other scenarios: In this set of ex-ues for each parameter across the domain of interest. Bayesian\nperiments, we evaluate how well the proposed approach canOptimization uses a surrogate model for the underlying opextrapolate to unseen cluster setups, workloads, and schedulingtimization metric, often Gaussian processes. Different biases\nframework options. We again sample ten different configura-and assumptions can then be imposed through the choice\ntions for evaluation but with extended workloads, additionalof kernel, e.g. smoothness of the underlying optimization\ncluster setups, and scheduling options. Details of the differ-landscapes, or in other words, similar parameter configurations\nences can be seen in Table II.lead to similar outcomes. This can then be formally posed\nDespite very different configurations for testing, in termsas an optimization process through a choice of acquisition\nof cluster, workload, and scheduling options, we observe thatfunction, most of which balance between exploration and exthe proposed method again outperforms baselines as seen inploitation. Our experiments use a standard squared exponential\nFigure 6.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 17, + "total_chunks": 26, + "char_count": 2441, + "word_count": 343, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7e052ff-2d6a-4e73-91e8-ad605142e859", + "text": "We observe that the reinforcement learning agent iskernel and an upper confidence bound acquisition function\nable to adapt the sampling strategy in terms of the importancewith a weight of 0.5. TPE is based on sequential model-based\nof scoring functions. For example, locality weight is exploredoptimization, similar to Bayesian optimization. However, TPE\nas part of the optimization in multiple experiments and has autilizes a non-parametric approach called Parzen estimators,\nrelatively high value in Exp2 - unlike any of the experimentsinstead of using Gaussian processes as the surrogate model.\nin Figure 5. Moreover, the overall distribution of the selectedEach optimization method uses the fixed weights as an initial\nweights is often different, e.g. in Exp0 and Exp4. In novelsample, followed by an additional four samples.\nscenarios, the proposed approach improves performance by\nD. Results and evaluation 20% over fixed weights and by 6% over the best-performing\n1) Tuning scoring functions: In the first set of experiments, baseline. The mean score across experiments in similar and\nwe evaluate how well the proposed approach can tune the other scenarios is visualized in Figure 7. Optimized best score\nExp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6\n1.0 143 nodes; 148 nodes; 103 nodes; 151 nodes; 102 nodes; 77 nodes;\ncloudcpu; cloudgpu; edgegpu; edgecloudlet; edgecloudlet; edgecloudlet;\n0.8 internet urban internet urban internet urban\nscore 0.6 topologyresnet_inf topologyresnet_inf topologymobilenet topologyresnet_inf topologyresnet_inf topologyresnet_inf\nmobilenet speech speech mobilenet mobilenet mobilenet\n0.4 speech resnet_train resnet_train speech resnet_train speech\nFix RS BO TPE RL resnet_train Fix RS BO TPE RL resnet_pre Fix RS BO TPE RL resnet_pre Fix RS BO TPE RL resnet_train Fix RS BO TPE RL Fix RS BO TPE RL resnet_train\nresnet_pre\nmean samples Explored weights (RL) Scoring functions:\nbest sample 1- balanced\n10 2 - capability\n3 - locality\n4 - image locality\n5 5 - data locality weights\n6 - rtc ratio\n0 7 - most allocated\n1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 8 - least allocated Fig. 5: Tuning weights of scoring functions on similar cluster and workload configurations. Example results for six\nexperiments visualized across two columns. For each experiment the following three characteristics are described (from left\nto right): best score (as defined in eq. 2) from the set of explored weights' configurations; mean and standard deviation, best\nweights selection from the reinforcement learning algorithm; short description of the experiment. We compare the proposed\napproach against four baselines, including fixed weights (Fix), random search (RS), Bayesian Optimization (BO), and Treestructured Parzen Estimator (TPE). In each experiment, the fixed weight configuration was used as an initial sample (same as\nFix), followed by four optimization steps. A total of eight scoring functions were used.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 18, + "total_chunks": 26, + "char_count": 2982, + "word_count": 485, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bc1cf829-f4a2-4970-92f9-fe86359bd9e8", + "text": "Optimized best score\nExp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6\n1.0 337 nodes; 379 nodes; 287 nodes; 374 nodes; 362 nodes; 342 nodes;\nhybrid- edgegpu; cloudcpu; hybrid- cloudcpu; edgecloudlet\n0.8 balanced;urban urbantopology internettopology balanced-jetson; internettopology internet topology\ntopology internet resnet_inf\nresnet_inf resnet_inf resnet_inf topology mobilenetscore 0.6 resnet_inf speech mobilenet resnet_train speech mobilenet resnet50_pre tf_gpu mobilenet resnet_pre resnet_train speech 0.4 tf_gpu resnet_pre f o tf_gpu resnet_pre resnet_train Fix RS BO TPE RL Fix RS BO TPE RL pi Fix RS BO TPE RL tf_gpu Fix RS BO TPE RL pi pi Fix RS BO TPE RL tf_gpu tf_gpu Fix RS BO TPE RL\nf o pi f o\npi Explored weights (RL) Scoring functions: mean samples\nbest sample 1- balanced\n10 2 - capability\n7.5 3 - locality\n4 - image locality\n5 5 - data locality weights\n2.5 6 - rtc ratio\n0 7 - most allocated\n1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 8 - least allocated Fig. 6: Example results on novel cluster and workload configurations. Follows the same notation as Figure 5.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 19, + "total_chunks": 26, + "char_count": 1122, + "word_count": 221, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d7989bd7-1da9-4cc1-878d-91ad95ebeeda", + "text": "Results of\nunseen cluster and workload configurations as described in Table II 1.0 1.0 both workloads and compute units, the scheduling process is\noften not tailored toward those specific needs.\n0.8 0.8\nscore score In this work, we presented an approach for tuning weights of scoring functions in job schedulers using reinforcement learn- 0.6 0.6\ning. We benchmark the proposed approach on a representative\nFaaS scheduling system with various cluster setups and work- 0.4 0.4\nFixed RS BO TPE RL Fixed RS BO TPE RL loads. We demonstrate that the proposed approach achieves\n(a) Similar configurations (b) Novel configurations better performance in comparison to standard parameter tuning\nalgorithms, including in scenarios that are not covered during\nFig. 7: Summary results for the proposed method. Mean\nthe model training, with an improvement of up to 33% over\nof the best-achieved score (as defined in eq. 2) across ten\ndefault static weight and up to 12% over the best-performing\nexperiments. A total of five samples of weights were used per\nbaseline.\nmethod, with the exception of Fixed, which uses the default\nweights. The initial weights sample in every experiment is the The proposed approach is well suited from an engineering\nsame as Fixed. Configurations of the experiments in (a) and standpoint as it requires minimal modification to an existing\n(b) are uniformly sampled from Table II. scheduling infrastructure. The proposed approach is agnostic\nto the number and type of scoring functions the scheduler\nV. CONCLUSIONS AND FUTURE WORK\nuses.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 20, + "total_chunks": 26, + "char_count": 1552, + "word_count": 247, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f0af7840-6f53-4853-b834-2ab0a0f871cd", + "text": "Once trained, the reinforcement learning agent can be\ndeployed on top of an existing scheduling infrastructure with Job scheduling in the context of ever-expanding demand\nthe task of tuning weights of scoring functions. Importantly, for deploying heterogeneous workloads across various cluster\nwe demonstrate that the proposed method is able to generalizeenvironments remains an important consideration for maxito unseen configurations, including different cluster setups,mizing efficiency. Different scoring functions often drive that\nworkloads, and scheduling options. process by evaluating pod-to-node allocation through desired\ncharacteristics. Yet, despite increased heterogeneity across In future work, we will be exploring the transferability of", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 21, + "total_chunks": 26, + "char_count": 752, + "word_count": 97, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "de2c915c-7d61-4d2b-b04e-f7564244ddea", + "text": "learned policy between different scheduling systems, expand- [18] T. Dustdar, \"Optimized container scheduling the set of scoring functions, and using additional metrics ing for data-intensive serverless edge computing,\" Future Generation\nComputer Systems, vol. 114, pp. 259–271, 2021.\nfor improved optimization. [19] D. Whitley, \"A genetic algorithm tutorial,\" Statistics and computing,\nREFERENCES vol. 4, pp. 65–85, 1994.\n[20] B. Wilkes, \"Borg, \"Taking the human out of the loop: A review of bayesian optimization,\"\nomega, and kubernetes,\" Communications of the ACM, vol. 59, no. 5, Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015.\npp. 50–57, 2016. [21] J. K´egl, \"Algorithms for hyper-\n[2] Y. Li, and parameter optimization,\" Advances in Neural Information Processing\nT. Guan, \"Scaling large production clusters with partitioned synchro- Systems, vol. 24, 2011.\nnization.,\" in USENIX Annual Technical Conference, pp. 81–97, 2021. [22] A. Bansal, \"Selftune: Tuning cluster\nW. Ding, \"Mlaas in the wild: Workload analysis and managers,\" in 2023 Networked Systems Design and Implementation,\nscheduling in large-scale heterogeneous gpu clusters,\" in 19th USENIX USENIX, USENIX, April 2023. Symposium on Networked Systems Design and Implementation, pp. 945– [23] H. Schmidt-Thieme, \"Hyp-rl: Hy-\n960, 2022. perparameter optimization by reinforcement learning,\" arXiv preprint\n[4] O. Dion, arXiv:1906.11527, 2019. Russinovich, et al., \"Protean: [24] F. Evans, \"A framework for automated\nVm allocation service at scale,\" in Proceedings of the 14th USENIX cellular network tuning with reinforcement learning,\" IEEE Transactions\nConference on Operating Systems Design and Implementation, pp. 845– on Communications, vol. 67, no. 10, pp. 7152–7167, 2019.\n861, 2020. [25] R. Barto, Reinforcement learning: An introduction.\n[5] A.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 22, + "total_chunks": 26, + "char_count": 1830, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bdf91311-bb78-4c0d-b73c-2050f1430d95", + "text": "Wilkes, \"Large-scale cluster management at google with borg,\" in [26] B. Recht, \"A tour of reinforcement learning: The view from continuous\nProceedings of the 10th European Conference on Computer Systems, control,\" Annual Review of Control, Robotics, and Autonomous Systems,\npp. 1–17, 2015. vol. 2, pp. 253–279, 2019.\n[6] M. Hand, [27] OpenAI, \"Gpt-4 technical report,\" 2023. Harchol-Balter, and J. Wilkes, \"Borg: the next generation,\" in [28] H.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 24, + "total_chunks": 26, + "char_count": 446, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb19eb45-442c-4154-a0a2-2c65cc33fc1f", + "text": "AlProceedings of the 15th European Conference on Computer Systems, izadeh, \"Learning scheduling algorithms for data processing clusters,\" in\npp. 1–14, 2020. Proceedings of the ACM special interest group on data communication,\n[7] Z. Jin-zhong, \"Research on kubernetes' pp. 270–288, 2019.\nresource scheduling scheme,\" in Proceedings of the 8th International [29] Y. Papka, \"Deep\nConference on Communication and Network Security, pp. 144–148, reinforcement agent for scheduling in hpc,\" in 2021 IEEE International\n2018. Parallel and Distributed Processing Symposium (IPDPS), pp. 807–816,\n[8] J. Chen, IEEE, 2021.\nand M. Guo, \"Characterizing and orchestrating vm reservation in geo- [30] D. Schulman,\ndistributed clouds to improve the resource efficiency,\" in Proceedings of and D. Man´e, \"Concrete problems in ai safety,\" arXiv preprint\nthe 13th Symposium on Cloud Computing, pp. 94–109, 2022. arXiv:1606.06565, 2016.\n[9] J. Wang,\n\"3sigma: distribution-based cluster scheduling for runtime uncertainty,\" T. Liu, et al., \"An end-to-end automatic cloud database tuning\nin Proceedings of the 13th European Conference on Computer Systems, system using deep reinforcement learning,\" in Proceedings of the 2019\npp. 1–17, 2018. International Conference on Management of Data, pp. 415–432, 2019.\n[10] E. Fontoura, and [32] H.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 25, + "total_chunks": 26, + "char_count": 1315, + "word_count": 187, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5dd2a987-1a74-4d62-bf7e-1fc4595c48a7", + "text": "Bianchini, \"Resource central: Understanding and predicting work- Automatically tuning static parameters for distributed file systems using\nloads for improved resource management in large cloud platforms,\" in deep reinforcement learning,\" in 2022 IEEE International Conference\nProceedings of the 26th Symposium on Operating Systems Principles, on Cloud Engineering (IC2E), pp. 150–159, IEEE, 2022.\npp. 153–167, 2017. [33] J. Yoneki, \"High-dimensional bayesian optimization with M. Xiong, \"Opennetlab: Open\nmulti-task learning for rocksdb,\" in Proceedings of the 1st Workshop on platform for rl-based congestion control for real-time communications,\"\nMachine Learning and Systems, pp. 111–119, 2021. in 6th Asia-Pacific Workshop on Networking, July 2022.\n[12] T. Tiwari, \"Clite: Efficient and qos-aware co-location [34] T. Levine, \"Soft actor-critic: Offof multiple latency-critical jobs for warehouse scale computers,\" in policy maximum entropy deep reinforcement learning with a stochastic\n2020 IEEE International Symposium on High Performance Computer actor,\" in International conference on machine learning, pp. 1861–1870,\nArchitecture (HPCA), pp. 193–206, IEEE, 2020. Kim, \"Scheduler for distributed and collabora- [35] J. Klimov, \"Proxtive container clusters based on multi-resource metric,\" in Proceedings imal policy optimization algorithms,\" arXiv preprint arXiv:1707.06347,\nof the International Conference on Research in Adaptive and Convergent 2017. Systems, pp. 279–281, 2020. [36] M. Preuss, \"Generalization,\n[14] J. De Turck, \"Towards network- mayhems and limits in recurrent proximal policy optimization,\" 2022.\naware resource provisioning in kubernetes for fog computing applica- [37] V. G.\ntions,\" in 2019 IEEE Conference on Network Softwarization (NetSoft), Bellemare, A. Ostrovski,\npp. 351–359, IEEE, 2019. Hassabis, \"Human-level control through\n\"Medea: scheduling of long running applications in shared production deep reinforcement learning,\" Nature, vol. 518, pp. 529–533, 2015.\nclusters,\" in Proceedings of the 13th European Conference on Computer [38] J. Legg,\nSystems, pp. 1–13, 2018. \"Scalable agent alignment via reward modeling: a research direction,\"\n[16] M. Shenker, 2018.\nand I.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 26, + "total_chunks": 26, + "char_count": 2207, + "word_count": 294, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5cf78bc3-9403-449e-b36f-194530eea041", + "text": "Stoica, \"Delay scheduling: A simple technique for achieving [39] T. Dustdar,\nlocality and fairness in cluster scheduling,\" in Proceedings of the 5th \"Synthesizing plausible infrastructure configurations for evaluating edge\nEuropean Conference on Computer Systems, 10th European Conference computing systems.,\" in HotEdge, 2020.\non Computer Systems, (New York, NY, USA), p. 265–278, Association [40] A. Dorfor Computing Machinery, 2010. mann, \"Stable-baselines3: Reliable reinforcement learning implemen-\n[17] Y. Anwar, \"Characterizing co-located datacenter tations,\" The Journal of Machine Learning Research, vol. 22, no. 1,\nworkloads: An alibaba case study,\" in Proceedings of the 9th Asia-Pacific pp. 12348–12355, 2021.", + "paper_id": "2603.10545", + "title": "Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning", + "authors": [ + "Martin Asenov", + "Qiwen Deng", + "Gingfung Yeung", + "Adam Barker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10545v1", + "chunk_index": 27, + "total_chunks": 26, + "char_count": 721, + "word_count": 95, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10549_semantic.json b/data/chunks/2603.10549_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..a0fda42b93277237677e4229c19be4499afea18f --- /dev/null +++ b/data/chunks/2603.10549_semantic.json @@ -0,0 +1,877 @@ +[ + { + "chunk_id": "ead9b589-2575-4b25-8587-3ec0c9ee1b29", + "text": "Towards Cognitive Defect Analysis in Active\nInfrared Thermography with Vision-Text Cues Mohammed Salah, Eman Ouda, Giuseppe Dell'Avvocato, Fabrizio Sarasini, Ester D'Accardi, Jorge Dias, Davor\nSvetinovic, Stefano Sfarra, and Yusra Abdulrahman", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 0, + "total_chunks": 35, + "char_count": 242, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b2e29a2b-e6b0-4022-8c29-b0e9e6bdc028", + "text": "Abstract—Active infrared thermography (AIRT) is currently vision-language models, dimensionality reduction, defect detecwitnessing a surge of artificial intelligence (AI) methodologies tion\nbeing deployed for automated subsurface defect analysis of\nhigh performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs I. INTRODUCTION\nrequires the creation of time consuming and expensive datasets Carbon fiber reinforced polymers (CFRPs) are valued in\nof CFRP inspection sequences to train neural networks. To\nthe aerospace industry for their exceptional strength-to-weight address this challenge, this work introduces a novel languageguided framework for cognitive defect analysis in CFRPs using ratio, corrosion resistance, and fatigue performance, enabling\nAIRT and vision-language models (VLMs). Unlike conventional lighter airframes, improved fuel efficiency, and extended2026 learning-based approaches, the proposed framework does not service life.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 1, + "total_chunks": 35, + "char_count": 1002, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c96eb9e1-61fd-4e9c-8e91-4de7d1d6d7f0", + "text": "Modern transport aircraft now rely heavily on\nrequire developing training datasets for extensive training of CFRPs in primary load-bearing structures such as fuselage\ndefect detectors, instead it relies solely on pretrained multimodal\nskins, wings, spars, ribs, and tail sections, as well as in VLM encoders coupled with a lightweight adapter to enableMar generative zero-shot understanding and localization of subsur- secondary components, including control surfaces, fairings,\n11 faceproposeddefects.systemBy leveragingenables generativepretrainedzero-shotmultimodalunderstandingencoders, theof doors,ing manufacturingnacelles, andandinteriorin service,panelsCFRP[1], [2].partsHowever,can developdurthermographic patterns and automatic detection of subsurface a variety of defects and damage modes, such as porosity,\ndefects. Given the domain gap between thermographic data and\nresin-rich areas, fiber waviness or wrinkling, matrix cracking, natural images used to train VLMs, an AIRT-VLM Adapter\nis proposed to enhance the visibility of defects while aligning debonding, delaminations, and impact-induced barely visible\nthe thermographic domain with the learned representations impact damage, which degrade stiffness, strength, and fatigue\nof VLMs. The proposed framework is validated using three life and may remain hidden from visual inspection [3].\nrepresentative VLMs; specifically, GroundingDINO, Qwen-VL- To assess material properties, structural integrity, and sub-[cs.CV]\nChat, and CogVLM. Validation is performed on 25 CFRP\nsurface defects without causing damage, nondestructive testing inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial (NDT) techniques are employed, particularly in safety-critical\nscenarios. Experimental results demonstrate that the AIRT-VLM fields such as aerospace. Commonly used NDT methods\nadapter achieves signal-to-noise ratio (SNR) gains exceeding 10 for CFRP inspection include ultrasonic testing, radiographic\ndB compared with conventional thermographic dimensionality- inspection, and infrared thermography (IRT), all of which can\nreduction methods, while enabling zero-shot defect detection\ndetect hidden anomalies and subsurface damage [4]. Among with intersection-over-union (IoU) values reaching approximately\n70%.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 2, + "total_chunks": 35, + "char_count": 2338, + "word_count": 285, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "caa56567-e3d0-4776-8088-cbe7a62779ed", + "text": "These findings indicate that coupling pretrained VLMs with these techniques, IRT has emerged as a valuable tool for\nthe proposed adapter enables reliable localization of subsurface identifying both surface and subsurface defects by analyzing\nCFRP defects without defect-specific training and extensive the thermal response of CFRP structures [5].\ndataset preparation. IRT is based on monitoring the propagation of heat on the\nIndex Terms—Infrared thermography, non-destructive testing, surface of a material, where disturbances in thermal patterns\nindicate the presence of internal anomalies [6]. In particThis research was funded by Khalifa University of Science and Technology\nular, active infrared thermography (AIRT) improves defect through the [Advancing Non-Destructive Testing (NDT) through Innovative Integration of Infrared Thermography (IRT) and Emerging Technolo- detectability by applying external thermal excitation, sucharXiv:2603.10549v1 gies in Aerospace Applications] under Project ID: KU-[INT]-[FSU]-[2024]- as flash lamps, halogen heaters, or lasers, which increases\n[8474000660].\nthe thermal contrast between sound and defective regions M. Abdulrahman are with the Department of\nAerospace Engineering, Khalifa University, Abu Dhabi, UAE. As a result, AIRT is especially suitable for inspecting\nman is also with the Advanced Research and Innovation Center (ARIC), non-metallic and multilayered materials where conventional\nKhalifa University, Abu Dhabi, UAE.\ninspection methods may be limited. AIRT has been widely G. Sfarra are with the Department of Industrial\nand Information Engineering and Economics (DIIIE), University of L'Aquila, adopted in sectors such as aerospace, energy, and civil infrasPiazzale E. Pontieri 1, 67100 L'Aquila, Italy. tructure, where large-scale and on-site inspections are often\nF.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 3, + "total_chunks": 35, + "char_count": 1830, + "word_count": 246, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c55ab58d-9dc1-4fcd-9f62-c88f3895b1b6", + "text": "Sarasini is with the Department of Chemical Engineering Materials\nrequired. Furthermore, recent advances in artificial intelligence Environment & UDR INSTM, Sapienza University of Rome, Rome, Italy. D'Accardi is with the Department of Mechanics, Mathematics and (AI) have enabled automated defect characterization in AIRT. Management (DMMM), Polytechnic University of Bari, Via 5 Orabona 4, AI-based pulsed thermography (PT) algorithms have been\n70125 Bari, Italy.\nproposed for defect classification [8], segmentation [9], [10], J. Dias is with the College of Computing and Mathematical Sciences,\nKhalifa University, Abu Dhabi, 127788, UAE. and depth estimation [11].", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 4, + "total_chunks": 35, + "char_count": 667, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "125ad35c-fee4-47a8-abfc-477fb42e3ac5", + "text": "In addition, to accelerate thermoD. Svetinovic is with the Department of Computer Science, Khalifa graphic inspection processes and enable coverage of large\nUniversity of Science and Technology, Abu Dhabi, UAE.\nstructures, robotic and line-scan thermography systems have Yusra Abdulrahman is the corresponding author (email:\nyusra.abdulrahman@ku.ac.ae). been introduced [12]. Although AI methodologies are currently being investigated are costly, time-consuming, and often limited in generalizabilin AIRT, a challenge in AI-based AIRT is the scarcity of ity across varying inspection conditions.\ndatasets and the need to prepare costly datasets for training AI In these methods, AIRT dimensionality reduction techniques\nmodels for defect analysis. Vision–Language Models (VLMs) are primarily used as preprocessing steps to generate compact\noffer a promising paradigm for zero-shot defect detection in thermographic representations suitable for deep neural network\nAIRT; however, current thermographic representations pro- input, since raw inspection sequences are highly dimensional\nduced by conventional dimensionality-reduction techniques and computationally expensive to process. AIRT dimensionare not designed to generate image representations aligned ality reduction is used to generate compact thermographic\nwith the natural-image domain of foundation VLMs, limiting representations suitable for deep neural network input, as\ntheir direct applicability for zero-shot reasoning. To address raw inspection sequences are highly dimensional and comthese challenges, this work proposes a zero-shot cognitive putationally expensive to process. Such techniques serve two\nframework for defect analysis in AIRT using vision and main purposes: compressing thousands of frames into lowtext cues, leveraging the strong reasoning capabilities of dimensional feature vectors and enhancing defect visibility\npretrained multimodal VLMs. Specifically, an AIRT–VLM by improving the typically low signal-to-noise ratio of raw\nadapter is introduced to transform thermographic information thermal images, which otherwise degrades defect characterinto VLM-compatible representations, enabling off-the-shelf ization performance [17]. Common dimensionality reduction\nVLMs to perform zero-shot subsurface defect localization methods include Thermal Signal Reconstruction (TSR) [18],\nwithout thermography-specific training or large annotated Pulsed Phase Thermography (PPT) [19], and Principal Comdatasets. As such, the focus of this work is methodological ponent Analysis (PCA) [20]. Physics-informed PCA variants\nand AI-driven, applying multimodal VLMs to thermographic have been proposed to improve AIRT analysis [21], and\ndata for zero-shot defect localization, rather than advancing such representations are widely used as preprocessing in AIthe physical modeling of infrared thermography. based pipelines, including multimodal fusion approaches [17]. The key contributions of this work are as follows: More recently, data-driven autoencoders, particularly CNN-\n1) A novel zero-shot cognitive defect analysis framework based models, have been employed to learn compact latent\nfor CFRP components is introduced, addressing the features that capture nonlinear spatial and temporal patterns\nchallenge of preparing time-consuming, costly datasets in thermographic data [17], [22]–[24].\nfor AI-based thermographic inspection. Therefore, defect detection is feasible with learning-based\n2) The AIRT-VLM Adapter is proposed to bridge the AIRT, which follows the traditional deep neural network\ndomain gap between thermographic data and natural training pipeline: inspection sequences are first collected as\nimage distributions used in pretrained Vision–Language training data, then processed through dimensionality reduction\nModels, enhancing the visibility of the defect and rep- to obtain compact thermographic representations, and finally\nresentation alignment. used to train networks that are later deployed for downstream\n3) The proposed framework is tested to detect impact tasks. However, this pipeline suffers from two major drawdamage at different energy levels. The results show backs: preparing AIRT datasets for neural network training\nthat VLMs coupled with the AIRT-VLM adapter enable is costly and time-consuming, and traditional dimensionalityreliable grounding of subsurface defects. reduction methods do not guarantee a unified, image-like representation suitable for foundation-level, generalizable models\ncapable of zero-shot defect detection.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 5, + "total_chunks": 35, + "char_count": 4541, + "word_count": 588, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40f03bcd-e6ad-4ce1-85b7-d70362e8324e", + "text": "Moreover, the resulting\nA. Related Work\nthermographic representations are not inherently aligned with\nCompared to earlier NDT techniques, such as radiography, the natural-image domain on which vision–language models\neddy current testing, and ultrasonic testing, AIRT offers higher (VLMs) are pretrained, further limiting their direct applicabilefficiency, faster evaluation, and fully non-contact inspection ity to zero-shot cognitive analysis. Hence, instead of relying\n[13].", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 6, + "total_chunks": 35, + "char_count": 476, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3fad1bb9-5a26-4ab3-a34d-be950b16dd6e", + "text": "This growing interest has spurred extensive research on extensive data preparation and hand-crafted dimensionalityinto learning-based models for defect detection, including reduction methods, this work proposes a zero-shot cognitive\nadaptations of Faster R-CNN and YOLOv5 for IRT data [8], framework for defect analysis in AIRT using vision–text cues,\nas well as ConvLSTM architectures to better capture temporal without the need for large-scale data collection, preparation,\ndependencies [14]. For defect segmentation, various neural or defect-specific training.\nnetwork architectures have been explored, such as U-Net The remaining sections of this paper are organized as folvariants for composites and forged components [15], [16], and lows. Section II outlines the sample preparation and methodConvLSTM-based models for 3D defect depth reconstruction. ology.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 7, + "total_chunks": 35, + "char_count": 862, + "word_count": 116, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ebe57e60-2fdd-4c19-9d46-43fe57438685", + "text": "Section III presents the experimental validations of the\nHowever, many approaches rely mainly on spatial informa- proposed methodology. Finally, section IV presents conclution and underutilize temporal dynamics in thermographic sions, findings, and future aspects of the proposed framework.\nsequences. To address this limitation, 3D CNNs have been\nintroduced to explicitly model spatiotemporal features and II. MATERIALS AND METHODS\nimprove subsurface defect segmentation [11]. Specimens and Data Collectionaforementioned models are supervised and their performance\ndepends heavily on the availability of large, carefully anno- 1) Specimens: Two types of additively manufactured cartated thermographic datasets, whose acquisition and labeling bon fiber reinforced polymer specimens were investigated: infrared camera, which was positioned orthogonally to the\ninspected surface. This central positioning was selected to\nminimize dimensional errors due to perspective effects. The\nlamp arrangement, schematically illustrated in Fig. 2, was\ndefined to ensure sufficiently uniform heating of the specimen\nsurface and to limit non-uniform heating artifacts. The surface was heated for approximately 4 ms, which, considering\nspecimen thickness and previously estimated thermophysical\nproperties [26], is short enough to satisfy the impulsive heating\nassumption and justify the Dirac pulse approximation. The\ndifference between A.1 and A.2 lies solely in the inspected\nsurface: A.1 corresponds to the front surface (the impacted\nside), whereas A.2 corresponds to the rear surface. Configu-Fig. 1: Front-side view of the impacted specimens, subjected\nration A.3 employed the same optical arrangement, i.e., two to low-velocity impact at 5 J and 15 J.\nhalogen lamps positioned laterally to the infrared camera and\noriented as shown in Fig. 2. In this case, two halogen sources\nwith a nominal power of 650 W each were used.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 8, + "total_chunks": 35, + "char_count": 1913, + "word_count": 270, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aeb4602e-a13c-4e06-946d-50a78436993c", + "text": "The surfacea PEKK matrix reinforced with continuous carbon fibers\nwas heated for a total duration of 20 s. Only the front surface(PEKK-CF) and a PA12 matrix reinforced with short carbon\nwas inspected in this configuration.fibers (PA12-CF). The PEKK-CF material consisted of a crossIn configuration B.1, Fig. 3, a single flash lamp with aply lay-up obtained by stacking six 0°/90° plies containing\nnominal energy of 3 kJ was positioned at the minimumcontinuous fibers, separated by intermediate layers of PEKK\npossible distance from the back surface of the specimen towith short fiber reinforcement. Subsequently, it was consolimaximize energy density transfer. To shield the infrared sensordated by hot pressing at approximately 450 °C to achieve a\nand prevent pixel saturation from direct exposure to the lightsymmetric laminate and satisfactory interlaminar bonding. The\nsource, a cardboard frame was placed between the camera andPA12-CF specimens were produced by fused filament fabricathe specimen.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 9, + "total_chunks": 35, + "char_count": 1002, + "word_count": 147, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "33966d35-7ff5-4eac-a51e-a9e09d1fe81c", + "text": "On the other hand, configuration B.2 used thetion through alternating 0°/90° deposition, resulting in a layersame geometry adopted in B.1, with the only difference beingby-layer architecture with a relatively homogeneous in-plane\nthe replacement of the flash source with a halogen lamp offiber distribution and reduced through-thickness continuity as\n650 W nominal power. The back surface was heated for adescribed in previous work [25].\nduration of 40 s. Surface temperature evolution was recorded All specimens had nominal dimensions of approximately\n75 × 75 mm2 and a thickness of about 4 mm. Low-velocity for all configurations using a FLIR A655sc microbolometric\ninfrared camera operating in the 7–14 µm spectral range,impact tests were carried out on the central area of each\nwith an acquisition frequency of 50 Hz. The resulting spatialspecimen using a drop-tower configuration.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 10, + "total_chunks": 35, + "char_count": 885, + "word_count": 132, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "606a698b-c317-4594-a6ef-7377be30da68", + "text": "In the present\nwork, specimens impacted at 5 J and 15 J were analyzed, with\ntwo PEKK samples tested at −70◦C and the remaining ones\nimpacted at room temperature. The specimens were inspected\nwith the surface as it is, i.e., in the same condition shown in the\nFig. 1. The visible indentation marks correspond to the main\nimpact location and were used as a reference when defining\nthe inspection area in the thermographic tests.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 11, + "total_chunks": 35, + "char_count": 426, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "16558b7b-074b-4c6e-a6a3-9bb8eacb701f", + "text": "In the PEKK-CF laminate, the cross-ply arrangement of\ncontinuous fibers produces direction-dependent thermal transport within the plane. In contrast, in the PA12-CF specimens,\nthe use of short fibers yields an almost isotropic response in\nthe x–y plane and comparatively low diffusivity [26].\n2) Data Collection: The specimens were tested under different heating configurations involving both reflection and\ntransmission modes. The reflection modes involve inspection\nfrom the front side with pulsed flash heating and long-pulse\nheating, and the back side with pulsed flash heating using\nhalogen lamps (see Fig. 2). On the other hand, the transmission\nmode involves inspection from the front side with pulsed\nflash heating and long-pulse heating using halogen lamps, see\nFig. 3. In configurations A.1 and A.2, the specimen surface was Fig. 2: Top: Front-side heating with pulsed flash lamps\nheated using two flash lamps with a nominal energy of 3 (A.1-A.2) and front-side long-pulse halogen heating (A.3).\nkJ each. The lamps were placed laterally with respect to the Bottom: Schematic representation of inspection setup.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 12, + "total_chunks": 35, + "char_count": 1120, + "word_count": 168, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "095fa51d-8727-4a9b-b63f-bf268012cd47", + "text": "sequence, where Ik is a thermogram timestamped at k =\n1, 2, . . . , Nt, Ny is the image height, and Nx is its width. S is reshaped to (Nt, Ny × Nx) by a raster-like operation and\nstandardized by ˆS = S −µk, where µk = X S(k), (1)\nk=1\nand ˆS = {S(n)}Nx×Ny1 is a matrix consisting of the centered\npixel-wise thermal responses. This operation is essential to\nensure that the subsequent dimensionality reduction module\nfocuses on the relative thermal variations induced by defects\nrather than absolute temperature offsets, thereby enhancing\nthe discriminability of subtle defect signatures across different\nspecimens and acquisition conditions. Afterwards, the ˆS is\nused to train the masked autoencoder to generate the compact\nlatent representation. The architecture of the autoencoder is the AIRT-MaskedCAAE, outlined in [27], leveraging the masked sequence\nautoencoding strategy for fast training. To ensure that the\nnetwork avoids learning trivial, identity reconstruction, and\nfocus on input signal features, S(n) is subjected to a binary\nFig. 3: Top: Backside heating with flash (B.1) and halogen masking operation with additive Gaussian noise, yielding a\nlamp (B.2). Bottom: Schematic representation of corrupted sequence, ˆS(n) by\ntransmission inspection setup.\nˆS(n) = M ⊙S(n) + N(0, σ2), (2) resolution for reflection configuration A is 0.22 mm/pixel and where M is a 1-D binary mask indicating visible (1) and\nfor transmission configuration B is 0.19 mm/pixel. masked (0) patches, and N(0, σ2) represents additive Gaussian\nnoise with zero mean and variance σ2.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 13, + "total_chunks": 35, + "char_count": 1567, + "word_count": 247, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "71015260-0f3a-4e5f-b3af-ee9814272506", + "text": "Consequently, ˆS(n) is\nB. Methodology passed through the encoder, f(·), as An overview of the proposed framework is shown in Fig. 4.\nzn = fθ(ˆS(n)), (3)The input inspection sequence is first standardized to normalize its temporal and spatial dynamics across specimens. generating the latent representation zn. The decoder, g(·) aims\nTo interface AIRT data with VLMs, which are pretrained on to reconstruct the original, uncorrupted signal ˜S(n) as\nnatural RGB imagery, the AIRT-VLM adapter is devised as\na dimensionality reduction module designed to compress the ˜S(n) = gϕ(zn). (4)\nfull thermal sequence into a single image representation with\nThe reconstruction strategy serves as guidance to train thea denoised defect signal. The AIRT-VLM adapter is a masked\nnetwork asautoencoder that extracts the dominant spatiotemporal features\nassociated with subsurface defects. The autoencoder generates N\nl latent images that are pooled into a 2-D, single domain- L = X ∥˜S(n)i −S(n)i ∥22. (5)\naligned, thermal image that preserves defect visibility while re- N i=1\nmaining semantically closer to the distribution of images seen\nNote that the training loss defined in [28] is the combined\nby VLMs during pre-training. Note that the domain-aligned\nreconstruction-knowledge distillation loss to generate a strucimage representation does not retain the full spatiotemporal\ntured latent space. For this work, we opt for a reconstruction\nphysical information of the original AIRT sequence, but is infocused loss function as it is aimed to generate a single\nstead optimized to enhance defect saliency and improve zerodomain-aligned image representation. In addition, the AIRTshot localization performance when interfaced with VLMs. Masked-CAAE is trained on the inspection sequence, comAccordingly, the domain-aligned image is subsequently fed\npressed to a latent size of l = 10, which is then pooled to geninto a VLM to generate a prediction of the bounding box\nerate the domain-aligned image. During training, optimization\n(x1, y1, x2, y2). Through this two-stage pipeline, a domain- is performed using the Adam optimizer with a fixed learning\nadaptive reduction followed by VLM-driven reasoning, the rate of 10−3, a batch size of 32, and training is conducted for\nframework enables a zero-shot VLM-compatible inspection\n100 epochs.\nprocess capable of detecting and localizing defects directly After online training, each z(n) ∈zn is utilized as a pixel\nfrom thermographic sequences.\nvalue, and its length is the number of channels, formulating\n1) AIRT-VLM Adapter: Let S = {Ik}Nt1 , of shape T = {T1, T2, · · · , Tl}, where l is the latent vector size. Thus,\n(Nt, Ny, Nx), be the 3D matrix representing the inspection Fig. 4: Overview of the defect analysis framework. The inspection sequence is preprocessed to generate a compact single\nimage representation similar to the VLMs pretraining domain.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 14, + "total_chunks": 35, + "char_count": 2890, + "word_count": 444, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f30c0e72-c786-482c-ae18-22f2fbdb8ae4", + "text": "Consequently, a VLM generates bounding boxes for defects. from all K frames in the original image sequence, now the where b contains the predicted defect localization coordinates.\nautoencoder latent images are only a few l images. After The visual encoder fvis(·) extracts a feature representation of\nobtaining the latent representation set T = {T1, T2, . . . , Tl} the thermal image,\nfrom the autoencoder, a global average pooling operation is\nv = fvis(IVLM), (8)applied to aggregate the multi-channel latent features into a\nsingle enhanced thermogram by while the text encoder ftext(·) embeds the tokenized inspection\nl prompt,\nIVLM = Pool (T) = X Ti, (6) t = ftext(P). (9)\ni=1\nThese unimodal representations are combined by a generic\nwhere IVLM represents the high-SNR, domain-aligned thermal multimodal fusion operator F(·, ·) that aligns the semantic\nimage, where the higher the contrast and SNR of IVLM, the information in the prompt with the visual content of the\nhigher the detection accuracy and lower the uncertainty of the thermal image as\nVLMs generated predictions. The domain-aligned image, IVLM u = F(v, t). (10)\nis then fed to VLMs for generating the bounding box location\nof the subsurface defects. The fused representation u captures the joint visual–linguistic\n2) Cognitive Defect Analysis: VLMs possess multimodal understanding required to infer the presence and extent of\nreasoning capabilities of large pretrained models to detect subsurface defects. A prediction head ψ(·) operating on u\nand localize defects in a zero-shot manner. Given that IVLM yields the bounding-box estimate by\nis both high in contrast and semantically aligned to the b = ψ(u). (11)\nstatistics of natural images, it can be reliably interpreted by\nVLMs conditioned through natural-language prompts. The This formulation is intentionally model-agnostic, allowing the\nVLMs output the bounding box of the defect, identifying VLM to be instantiated by any multimodal architecture, inits location in the domain-aligned image. Let P denote the cluding those based on transformer cross-attention, region–text\ntextual inspection instruction provided to the model, and let matching, or grounding-based detection heads. Regardless of\nΦ(·) represent a generic VLM composed of a visual encoder, the specific architecture, the VLM associates the high-SNR\na text encoder, and a multimodal fusion module. The VLM defect structures in IVLM with the semantic concept of a\nreceives the paired input (IVLM, P) and produces a bounding- \"defect\" as defined in the inspection prompt. In this work,\nbox estimate through the prompt P takes the form\nInspect the thermal image of a\nb = Φ IVLM, P , b = ⟨x1, y1, x2, y2⟩, (7) CFRP sheet and output the defect\nbounding box as ⟨x1, y1, x2, y2⟩. This standardized instruction constrains the model's output B. AIRT-VLM Adapter Evaluation\nformat and reduces ambiguity. Importantly, the VLM-driven The purpose of the AIRT-VLM adapter is to mitigate the\nanalysis requires no thermal-domain fine-tuning or labeled domain shift between thermal inspection data and the naturalthermographic data, as the reasoning capability emerges from image distributions on which VLMs are pretrained.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 15, + "total_chunks": 35, + "char_count": 3193, + "word_count": 497, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f736ac30-b9b0-4b98-9648-f280e59b2418", + "text": "For a\nthe model's large-scale multimodal pretraining. Consequently, thermal representation to be interpretable by a VLM, the defect\nthe proposed approach enables flexible, data-efficient, and gen- patterns in the domain-aligned image must be visually distinct\neralizable defect localization directly from pulse-thermography and sufficiently clear. This clarity reflects how effectively\nsequences through the AIRT-VLM adapter. the AIRT-VLM adapter enhances defect-relevant signals while\nsuppressing background noise and acquisition artifacts. EXPERIMENTS\nquantitatively assess this improvement, contrast and SNR metA. Experimental Evaluation rics are used to evaluate the perceptual quality of the domainThe proposed framework is tested with three VLMs coupled aligned image. Table I reports the contrast and SNR values\nwith the proposed AIRT-VLM adapter; namely, CogVLM obtained after applying the AIRT-VLM adapter and compares\n[30], Qwen-VL Chat [31], and GroundingDINO [32]. The them against the raw thermogram as well as several state-ofexperimental validation follows by evaluating the efficacy of the-art AIRT dimensionality reduction techniques, including\nthe AIRT-VLM adapter in enhancing the defect signal and PCA [34], TSR [33], DAT [22], 1D-DCAE-AIRT [23], and\nclarity in terms of contrast and SNR. Each is calculated as C-AET [29]. Fig. 5 summarizes the aggregated contrast and\nSNR performance across all methods, while Table II presents N 1 PNp=1 Yd(p) − M1 PMq=1 Ys(q) qualitative comparisons illustrating the visual differences be- tween the proposed AIRT-VLM representation and existing Contrast = , (12) N 1 PNp=1 Yd(p) + M1 PMq=1 Ys(q) dimensionality reduction approaches. The results in Table I and Fig. 5 show that the defect\n1 1 signal tends to increase with increasing impact energies across\nN PNp=1 Yd(p) − M PMq=1 Ys(q) all sequences. This is expected since high impact energies\nSNR = , (13)\nσs create distinct defects that are easily identifiable compared\nwhere N denotes the total number of pixels in the defective to low impact defects. Nevertheless, the significant contrasts\nregion Yd, with Yd(p) representing the pth pixel intensity in and SNR obtained for all defect classes are of the same\nthat region. M refers to the number of pixels in the sound order of magnitude.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 16, + "total_chunks": 35, + "char_count": 2299, + "word_count": 344, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "66e7c45e-a289-4beb-b5d0-6ad91851ebda", + "text": "For instance, the proposed AIRT-VLM\nregion Ys, with Ys(q) being the qth pixel intensity, while σs adapter increases the contrast by approximately 50% and the\ncorresponds to the standard deviation of pixel values in the SNR by 20 dB compared to the raw thermograms. On the\nsound region Ys. Note that the sound ROIs were obtained other hand, the proposed framework outperforms traditional\nmanually, while the ROI of the defective area is obtained from and learning-based AIRT dimensionality reduction methods.\nthe labels.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 17, + "total_chunks": 35, + "char_count": 519, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "93841088-ef04-4bda-bbb8-c61c6298b289", + "text": "This evaluation is discussed in Section III-B. In According to Table II, the visualizations highlight sharper\naddition to signal enhancement evaluation, the more important defect boundaries, reduced halo artifacts, and superior supprestask is assessing the defect-detection performance of the sion of background weave and non-uniform heating effects. VLMs, as this represents the main objective and motivation Quantitatively, contrast improvements of up to 25% and SNR\nof this work. The evaluation is performed using two metrics: gains exceeding 10 dB are observed compared to the strongest\nthe Intersection-over-Union (IoU) and the normalized center baseline, such as 1D-DCAE-AIRT.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 18, + "total_chunks": 35, + "char_count": 682, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f79949e7-5399-404a-a173-0a7d94fb978c", + "text": "These improvements stem\ndistance (NCD), defined respectively as from the masked feature autoencoding strategy, which prevents\ntrivial identity reconstruction and compels the network to focus\n|Bpred ∩Bgt| IoU = (14) on defect-relevant cues, the multi-level feature attention that |Bpred ∪Bgt|, amplifies salient channels across convolutional layers, and\nq the self-attention block that captures long-range spatial and (xpredc −xgtc )2 + (ypredc −ygtc )2\nNCD = , (15) temporal dependencies. The obtained results also highlight\nq W gt2 + H2gt that the AIRT-VLM adapter is capable of compressing a\nthermographic sequence to a single image, while exposing\nwhere Bpred and Bgt represent the predicted and ground truth defect relevant features effectively. This is essential for reliable\nbounding boxes, respectively. The center coordinates of each defect grounding, which is further discussed in the following\nbounding box are given by (xpredc , ypredc ) and (xgtc , ygtc ), while section. Wgt and Hgt denote the width and height of the ground truth\nbox. The defect bounding boxes have been labeled manually\nC. Defect Detection Evaluation\nand verified multiple times for consistency and spatial accuracy by cross-checking the annotations across independent The previous evaluation evaluated the AIRT-VLM adapter\nreview passes, ensuring reliable ground-truth localization for in terms of enhancing the clarity of the defects. Consequently,\nquantitative evaluation. This evaluation is discussed in Sec- three different VLMs, Qwen-VL, CogVLM, and Groundtion III-C. Sections III-D and III-E discuss ablation studies ingDINO, are evaluated on the AIRT-VLM adapter represenand highlight some limitations in the proposed framework, tation.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 19, + "total_chunks": 35, + "char_count": 1726, + "word_count": 251, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "de257916-06e7-4e4e-8ded-e97034550874", + "text": "The evaluation is based on the IoU and the normalized\nrespectively. distance between the generated and ground truth bounding TABLE I: Quantified contrast and SNR for the AIRT-VLM adapter representation benchmarked against state-of-the-art\nthermography dimensionality reduction methods, TSR, PCA, DAT [22], 1D-DCAE-AIRT [23], and C-AET [29], under\nambient and low-temperature (−70◦C) conditions. Samples Metric Raw TSR PCA DAT 1D-DCAE-AIRT C-AET Ours Contrast 0.207 0.241 0.302 0.366 0.383 0.361 0.478\nAmbient 6\nSNR (dB) 21.75 24.50 30.71 33.48 35.83 34.11 42.18\n5 J\nContrast 0.198 0.229 0.289 0.351 0.366 0.344 0.456 −70◦C 7\nSNR (dB) 20.90 23.62 29.40 32.05 34.21 32.88 40.27 Contrast 0.227 0.287 0.387 0.395 0.436 0.387 0.534\nAmbient 7\nSNR (dB) 22.11 24.74 32.95 38.29 38.37 36.58 43.19\n15 J\nContrast 0.216 0.271 0.366 0.378 0.415 0.369 0.508 −70◦C 6\nSNR (dB) 21.08 23.90 31.42 36.10 36.84 35.02 41.36 The figure also presents benchmarks, comparing\nthe proposed framework against state-of-the-art dimensionality\nreduction methods coupled with the three VLMs for defect\ndetection and grounding. The results in Fig. 6 show that Qwen-VL, CogVLM,\nand GroundingDINO achieve IoUs higher than 60%. when\ncoupled with the AIRT-VLM adapter.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 20, + "total_chunks": 35, + "char_count": 1231, + "word_count": 187, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e587faa3-225d-4bdf-a9ef-b6ec6f2ac072", + "text": "Note that the size of\nthe defects in the thermograms tends to be relatively small,\napproximately covering 5-10% of the image size. As such,\nmodels that achieve IoUs exceeding 50% tend to be within\nthe acceptable range. Similarly, achieving a normalized center\ndistance less than 0.05 Px provides emphasis on the model's\ngrounding capabilities. Accordingly, coupling the VLMs with\nthe AIRT-VLM adapter highlights strong zero-shot grounding\ncapabilities with IoUs reaching to approximately 70% and\nnormalized center distance ≈0.015.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 21, + "total_chunks": 35, + "char_count": 530, + "word_count": 78, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9da346ef-6d3f-4178-b826-467e5be970ff", + "text": "In addition, it is worth\nmentioning that all the models achieve consistent performances, demonstrating that the framework relies heavily on\nthe domain-aligned input. If the performances are compared\nwith the models' performance when coupled with the state-ofthe-art AIRT dimensionality reduction techniques, the obtained\nIoUs do not exceed 50%, demonstrating the importance of\nthe AIRT-VLM adapter to generate domain-aligned images\nfor stable and reliable subsurface defect grounding. These\nresults show that defect detection in AIRT is attainable, where\nthe subsurface defects are detected in a zero-shot manner,\nmitigating the challenge in time consuming preparation of\nAIRT training datasets. In addition to the previous evaluation routines, two studies\nare conducted on the proposed framework. The analysis in this\nsection has been carried out on all 25 CFRP inspection seqeFig. 5: Aggregate a) contrast and b) SNR on all 25 CFRP\nunces to ensure comprehensive and representative assessment.\ninspection sequences.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 22, + "total_chunks": 35, + "char_count": 1016, + "word_count": 147, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b37b7f9e-a5c6-4196-9599-086594b7c6fd", + "text": "The first study studies the effect of pooling on latent image\nrepresentation. As mentioned in Section II, global average\npooling is applied on the autoencoder latent space to generate\nboxes. Note that the enhanced defect signal of the domaina domain-aligned image representation for cognitive defect\naligned image results in a higher detection success rate and\nanalysis. As such, the performance of the VLMs is analyzed\nIoU. Table III shows the detected bounding boxes on two dewhen max pooling and PCA are applied on the latent images\nfective CFRP samples with impact damages at 5 J and 15 J.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 23, + "total_chunks": 35, + "char_count": 593, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "010b74da-eabe-468a-8681-af7f41ea626e", + "text": "In\ninstead of average pooling. Table IV outlines the contrast,\naddition, Fig. 6 displays the aggregate IoU and the normalized\nSNR, IoU, and normalized center distance when utilizing the\ncenter distance of the VLMs on all sequences of each impact\naforementioned pooling operations. Note that the IoU and TABLE II: Qualitative comparisons between state-of-the-art AIRT dimensionality reduction techniques: TSR [33], PCA [34],\nDAT [22], 1D-DCAE-AIRT [23], C-AET [29], and the proposed AIRT-VLM adapter representation, under ambient and\nlow-temperature (−70◦C) conditions. Method 5 J 5 J (−70◦C) 15 J 15 J (−70◦C)", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 24, + "total_chunks": 35, + "char_count": 609, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5b8a468-cff8-4e4d-9967-faa2aed28308", + "text": "Fig. 6: IoU and normalized center distance achieved by Qwen-VL, CogVLM, and GroundingDINO when coupled with\nAIRT-VLM adapter on all 25 CFRP inspection sequences. The performances of these models are also presented when\ncoupled with state-of-the-art AIRT dimensionality reduction methods. TABLE III: Defect detection results of Qwen-VL, CogVLM, and GroundingDINO coupled with the AIRT-VLM adapter,\nunder ambient and low-temperature. The results show consistent defect detection performances across all samples. Model 5 J 5 J (−70◦C) 15 J 15 J (−70◦C)", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 25, + "total_chunks": 35, + "char_count": 549, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7e9d4e2-c8e5-4a9d-aefe-c7d2a3040660", + "text": "TABLE IV: Quantified performances of average, max, andnormalized center distance are reported when using QwenPCA pooling operations.VL. The results in Table IV show that max pooling results\nin consistently lower defect detection performance across all Metric Average Pooling Max Pooling PCA\nmetrics. This is because max pooling amplifies noise even in\nContrast 0.522 0.471 0.547\nthe presence of defective signals. On the other hand, average SNR (dB) 42.87 39.18 42.41\npooling exhibits similar performance to PCA. Both methods IoU 0.691 0.539 0.701\nNormalized Center Distance 0.0138 0.0378 0.0118\ncan be applied; however, the proposed framework opts for\naverage pooling for its computational efficiency compared to\nPCA.\napplied to generate the defect bounding box. This acts as In the second study, the elimination of the pooling operation\nan ensemble operator to aggregate all bounding boxes fromis studied. The autoencoder generates l = 10 latent images.\nthe inspection run on each latent image. Table V shows theInstead of reducing the latent space to a single domain-aligned\nperformance of Qwen-VL when utilizing both aforementionedimage fed to the VLM, all l = 10 images are fed to the VLMs,\ndefect detection methods in terms of IoU, normalized centerand then non-maximum suppression (NMS) is, consequently,\ndistance, and total execution time. The model performances", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 26, + "total_chunks": 35, + "char_count": 1370, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "38088e20-0811-4c5f-b144-95f1410d1b6e", + "text": "TABLE V: Defect detection performance when the pooling fects without thermography-specific training or curated theroperation is replaced with non-maximum suppression (NMS). mal datasets, achieving intersection-over-union (IoU) values\nof approximately 0.7 and normalized center distances (NCD)\nMetric Average Pooling NMS\naround 0.015. In practice, this enables accurate defect localizaIoU 0.691 0.707 tion without curated thermal datasets, labeling procedures, or\nNormalized Center Distance 0.0138 0.0136 additional model retraining, resulting in a significant reduction\nExecution Time (s) 4.3 37.8 in inspection setup time and overall analysis cost. From an industrial perspective, the proposed framework\nremoves the dataset bottleneck that currently limits the deployshow consistency across the two methods. Although the pool- ment of AI in thermography-based quality assurance, allowing\ning operation tends to flatten the latent space to a single rapid integration into existing inspection chains and providing\ndimension, this comes at the cost of a 10-fold increase in repeatable, operator-independent defect localization. Overall,\nexecution time.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 27, + "total_chunks": 35, + "char_count": 1150, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7ee5caf1-5c91-4084-9619-2d3b69fa84d2", + "text": "Thus, the proposed multimodal defect analysis the proposed method bridges high-performing AIRT defect\nframework utilizes average pooling, which ensures comparable detection with the flexibility of multimodal AI, offering a\nperformance with less computational demands. viable route to scalable, training-free thermographic inspection. The combination of robust signal enhancement, zeroE. Limitations shot grounding, and minimal computational overhead positions\nthis framework as a strong candidate for next-generation NDT\nWhile the previous results show that zero-shot grounding of\nsystems, suitable for continuous monitoring, fast screening,\nsubsurface defects is attainable with the proposed framework,\nand automated quality control of CFRP components. While\nthe work still has one limitation that opens the door to future\neffective for defect localization, future research will focus\nresearch. Since the approach relies on dimensionality reduction\non fine-tuning VLMs with physics-informed objectives and\nand compresses the entire inspection sequence into a single\nleveraging richer temporal cues from AIRT sequences, enimage representation for VLMs, depth estimation of defects\nabling more generalizable cognitive defect analysis capable of\ncannot be carried out. This is because the framework is\nidentifying defect types and estimating their depth to assess\ndesigned to generate a domain-aligned image that resembles\nseverity.\nthe type of data seen during VLM pretraining, which naturally\nleaves out the physics-based intuition contained in the full\nREFERENCESAIRT sequence. Another limitation; the framework cannot\ndifferentiate between defect types, such as delaminations, [1] A. Asmatulu,\nvoids, or impact damage, which are common in real industrial \"Fiber-reinforced composites for aerospace, energy, and marine applications: an insight into failure mechanisms under chemical, thermal,\nsettings. To differentiate between the aforementioned defect oxidative, and mechanical load conditions,\" Advanced Composites and\ntypes, language-guided defect analysis cues need to be carried Hybrid Materials, vol. 8, 01 2025.\nout in the spatiotemporal domain. At this stage, the method [2] X. Gao, \"Material\nperformance, manufacturing methods, and engineering applications in\nonly identifies the presence of a defect.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 28, + "total_chunks": 35, + "char_count": 2312, + "word_count": 312, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "52c280c0-ea1f-40ba-af9c-ea4508816e28", + "text": "Future work will aviation of carbon fiber reinforced polymers: A comprehensive review,\"\nfocus on fine-tuning VLMs to better capture the underlying Thin-Walled Structures, vol. 209, p. 112899, 2025. [Online]. Available:\nphysics of AIRT, enabling more generalizable defect analysis https://www.sciencedirect.com/science/article/pii/S0263823124013387\n[3] M. Boztepe, \"An overview of non-destructive testing for composites\nthat includes identifying defect types and estimating defect materials,\" ALK¨U Fen Bilimleri Dergisi, vol. 7, pp. 43–54, 04 2025.\ndepths to assess their severity. Yet, experimental validation [4] J. Sultan, A. Łukaszewicz, J. Oksiuta,\nacross multiple impact energy levels—corresponding to de- and F. Shahar, \"Recent trends in non-destructive testing approaches\nfor composite materials: A review of successful implementations,\"\nfects with different geometrical characteristics—demonstrates Materials, vol. 18, no. 13, 2025. [Online]. Available: https://www.mdpi.\nthat the proposed framework remains robust to variations in com/1996-1944/18/13/3146\ndefect geometry and can reliably localize subsurface anomalies [5] H. Duan, \"Stft-cnn enabled\ndespite these structural differences. quantitative detection of liquid ingress in honeycomb composites\nvia infrared thermography,\" Quantitative InfraRed Thermography\nJournal, vol. 0, no. 0, pp. 1–17, 2025. [Online]. CONCLUSIONS https://doi.org/10.1080/17686733.2025.2533739\nAI-driven active infrared thermography (AIRT) is increas- [6] N. Omar, \"Experimentally\nvalidated defect depth estimation using artificial neural network in\ningly adopted for automated inspection of composite materials; pulsed thermography,\" Infrared Physics & Technology, vol. 98, pp.\nhowever, existing AI-based pipelines remain constrained by 192–200, 2019. [Online]. Available: https://www.sciencedirect.com/\nthe need for large, labeled thermographic datasets and setup- science/article/pii/S1350449519300532\n[7] N. Mayyas,\nspecific training.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 29, + "total_chunks": 35, + "char_count": 1978, + "word_count": 237, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1e7d449-8c14-4161-a91e-c886eb475e7d", + "text": "To address these limitations, this work intro- \"Ir thermographic analysis of 3d printed cfrp reference samples\nduced a zero-shot cognitive defect analysis framework that in- with back-drilled and embedded defects,\" Journal of Nondestructive\ntegrates AIRT with off-the-shelf multimodal vision–language Evaluation, vol. 37, no. 3, p. 59, Jul 2018. [Online]. Available:\nhttps://doi.org/10.1007/s10921-018-0512-2\nmodels (VLMs) through a lightweight AIRT–VLM adapter [8] Z. Kersemans, \"A flexible deep\nthat produces domain-aligned thermal image representations. learning framework for thermographic inspection of composites,\" NDT\nExperimental validation on 25 CFRP inspection sequences us- & E International, vol. 139, p. 102926, 2023. [Online]. Available:\nhttps://www.sciencedirect.com/science/article/pii/S096386952300141X\ning Qwen-VL-Chat, CogVLM, and GroundingDINO demonstrates that the proposed method can localize subsurface de- Wang, and vol. 0, no. 0, pp. 1–22, 2025. [Online]. Available: https://doi.org/10. Ma, \"Surface defect detection of cfrp materials based on infrared 1080/10589759.2025.2595519\nthermography and attention u-net algorithm,\" Nondestructive Testing [26] G. D'Alessandro,\nand Evaluation, vol. 39, no. 2, pp. 238–257, 2024. [Online]. Sfarra, \"Nondestructive thermographic evaluation of\nhttps://doi.org/10.1080/10589759.2023.2191954 thermal diffusivity in additively manufactured fiber-reinforced compos-\n[10] Z. Maldague, \"A dataset ites using low-cost cooling: an early-stage analysis,\" in Thermosense:\nof pulsed thermography for automated defect depth estimation,\" Thermal Infrared Applications XLVII, vol. 13470. Applied Sciences, vol. 13, no. 24, 2023. [Online]. Available: 216–223.\nhttps://www.mdpi.com/2076-3417/13/24/13093 [27] M.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 30, + "total_chunks": 35, + "char_count": 1759, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "38e012dd-306b-427c-813f-062caad7926f", + "text": "Li, \"Spatio-temporal \"Masked sequence autoencoding for enhanced defect visualization\n3-d residual networks for simultaneous detection and depth estimation in active infrared thermography,\" 2025. [Online]. Available: https:\nof cfrp subsurface defects in lock-in thermography,\" IEEE Transactions //arxiv.org/abs/2512.23000\non Industrial Informatics, vol. 18, no. 4, pp. 2571–2581, 2022. [28] M. Abdulrahman, \"Pca-guided autoencoding for structured\nM. Maldague, \"Drone- dimensionality reduction in active infrared thermography,\" 2025.\nbased non-destructive inspection of industrial sites: A review and [Online]. Available: https://arxiv.org/abs/2508.07773\ncase studies,\" Drones, vol. 5, no. 4, 2021. [Online]. Mishra, \"Constrained autoencoderhttps://www.mdpi.com/2504-446X/5/4/106 based pulse compressed thermal wave imaging for sub-surface defect\n[13] M. Abusafieh, and detection,\" IEEE Sensors Journal, vol. 22, no. 18, pp. 17 335–17 342,\nG. Sankaran, \"The calibration and sensitivity aspects of a self- 2022.\nreferencing routine when applied to composites inspection: Using a [30] W. Yang,\npulsed thermographic setup,\" Journal of Nondestructive Evaluation, L.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 31, + "total_chunks": 35, + "char_count": 1159, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "95102b37-2f34-4509-912a-2a2d7c7d6e70", + "text": "Tang,\nvol. 35, p. 51, 08 2016. \"Cogvlm: Visual expert for pretrained language models,\" 2023.\n[14] U. Valeske, \"Defect shape detection and [31] J. Lin,\ndefect reconstruction in active thermography by means of two- C. Zhou, \"Qwen-vl: A versatile vision-language model for\ndimensional convolutional neural network as well as spatiotemporal understanding, localization, text reading, and beyond,\" 2023. [Online].\nconvolutional lstm network,\" Quantitative InfraRed Thermography Available: https://arxiv.org/abs/2308.12966\nJournal, vol. 19, no. 2, pp. 126–144, 2022. [Online]. Yang,\nhttps://doi.org/10.1080/17686733.2020.1810883 H. Zhu et al., \"Grounding dino: Marrying dino with grounded pre-\n[15] Z. Duan, training for open-set object detection,\" arXiv preprint arXiv:2303.05499,\nand H. Zhang, \"Automatic segmentation of microporous defects in 2023.\ncomposite film materials based on the improved attention u-net module,\" [33] A. Burgholzer, \"Extension of\nQuantitative InfraRed Thermography Journal, vol. 22, no. 4, pp. 313– the thermographic signal reconstruction technique for an automated\n328, 2025. segmentation and depth estimation of subsurface defects,\" Journal\n[16] D. Valeske, \"Defect shape detection of Imaging, vol. 6, no. 9, 2020. [Online]. Available: https:\nand defect reconstruction in active thermography by means of two- //www.mdpi.com/2313-433X/6/9/96\ndimensional convolutional neural network as well as spatiotemporal con- [34] C.-M. Yao, \"Thermographic\nvolutional lstm network,\" Quantitative InfraRed Thermography Journal, data analysis for defect detection by imposing spatial connectivity\nvol. 19, no. 2, pp. 126–144, 2022. and sparsity constraints in principal component thermography,\" IEEE\n[17] M. Abdulrahman, \"Multi-modal Transactions on Industrial Informatics, vol. 17, no. 6, pp. 3901–3909,\nattention networks for enhanced segmentation and depth estimation of 2021.\nsubsurface defects in pulse thermography,\" 2025. [Online]. Available:\nhttps://arxiv.org/abs/2501.09994\n[18] C. Shao, \"Non-destructive testing\nof airfoil based on infrared lock-in thermography,\" in 2018 IEEE\nInternational Conference on Information and Automation (ICIA), 2018,\npp. 1623–1628.\n[19] U.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 32, + "total_chunks": 35, + "char_count": 2187, + "word_count": 283, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2ef4db8f-e433-49e5-a369-03b683939eac", + "text": "M¨uller, \"Modified pulse-phase thermography\nalgorithms for improved contrast-to-noise ratio from pulse-excited\nthermographic sequences,\" NDT & E International, vol. 116, p.\n102325, 2020. [Online]. Available: https://www.sciencedirect.com/\nscience/article/pii/S0963869519307546\n[20] S. Ibarra-Castanedo, and X.", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 33, + "total_chunks": 35, + "char_count": 309, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "881037ad-e437-48ea-b807-42eac264ca77", + "text": "Maldague, \"Robust principal component\nthermography for defect detection in composites,\" Sensors, vol. 21, no. 8,\n2021. [Online]. Available: https://www.mdpi.com/1424-8220/21/8/2682\n[21] J.-Y. Yao, \"Sparse principal component thermography for subsurface defect detection in composite products,\" IEEE\nTransactions on Industrial Informatics, vol. 14, no. 12, pp. 5594–5600,\n2018.\n[22] K. Yao, \"Deep autoencoder\nthermography for defect detection of carbon fiber composites,\" IEEE\nTransactions on Industrial Informatics, vol. 19, no. 5, pp. 6429–6438,\n2023.\n[23] Y. Chen, \"Onedimensional deep convolutional autoencoder active infrared thermography: Enhanced visualization of internal defects in frp composites,\"\nComposites Part B Engineering, p. 111216, 03 2024.\n[24] Y. Sankaran, \"A taguchi design of experiment approach to\npulse and lock in thermography, applied to cfrp composites,\" Journal\nof Nondestructive Evaluation, vol. 36, no. 4, p. 72, Oct 2017. [Online]. Available: https://doi.org/10.1007/s10921-017-0450-4\n[25] A. Dell'Avvocato, M. ˇSvantner, F. Sfarra, \"Data processing methods for thermographic ndt with\nlocalised cryogenic cooling,\" Nondestructive Testing and Evaluation,", + "paper_id": "2603.10549", + "title": "Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues", + "authors": [ + "Mohammed Salah", + "Eman Ouda", + "Giuseppe Dell'Avvocato", + "Fabrizio Sarasini", + "Ester D'Accardi", + "Jorge Dias", + "Davor Svetinovic", + "Stefano Sfarra", + "Yusra Abdulrahman" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10549v1", + "chunk_index": 34, + "total_chunks": 35, + "char_count": 1183, + "word_count": 151, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10559_semantic.json b/data/chunks/2603.10559_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..1ac82ad8d5a1c663d87a29009d2c9599892fda59 --- /dev/null +++ b/data/chunks/2603.10559_semantic.json @@ -0,0 +1,1602 @@ +[ + { + "chunk_id": "daca0d9d-fdf7-476f-8c6d-6317a278d48c", + "text": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting Jing Liu∗1, Maria Grith2, Xiaowen Dong3, and Mihai Cucuringu4,1,5\n1Department of Statistics, University of Oxford, UK 2Finance Department, Neoma Business School, FranceMar\n11 3Department of Engineering Science, University of Oxford, UK\n4Department of Mathematics, University of California Los Angeles, US 5Oxford-Man Institute of Quantitative Finance, University of Oxford, UK\n[cs.LG] This paper studies cross-market return predictability through a machine learning framework", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 0, + "total_chunks": 80, + "char_count": 548, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4c67a60b-e41a-4203-88ca-278eae8683bd", + "text": "that preserves economic structure. Exploiting the non-overlapping trading hours of the U.S. and Chinese equity markets, we construct a directed bipartite graph that captures time-ordered predictive linkages between stocks across markets. Edges are selected via rolling-window hypothesis testing, and the resulting graph serves as a sparse, economically interpretable feature-selectionarXiv:2603.10559v1 layer for downstream machine learning models. We apply a range of regularized and ensemble methods to forecast open-to-close returns using lagged foreign-market information. results reveal a pronounced directional asymmetry: U.S. previous-close-to-close returns contain substantial predictive information for Chinese intraday returns, whereas the reverse effect is This informational asymmetry translates into economically meaningful performance differences and highlights how structured machine learning frameworks can uncover cross-market dependencies while maintaining interpretability. ∗Corresponding author; Email: jing.liu@exeter.ox.ac.uk Keywords: Return prediction, cross-market analysis, machine learning, bipartite graphs JEL Classification: G17, G15, C58 Return prediction remains a central problem in empirical asset pricing and portfolio management, yet", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 1, + "total_chunks": 80, + "char_count": 1269, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1763e9dd-81c7-474a-a1f6-5d844f262f47", + "text": "its statistical difficulty is amplified by noise, non-stationarity, and nonlinear dependence structures", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 2, + "total_chunks": 80, + "char_count": 103, + "word_count": 12, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "89f34c63-b0a8-412c-9b4e-d1eae42287ec", + "text": "in financial markets. While machine learning methods have become increasingly prevalent in singlemarket forecasting applications (Chen et al., 2015; Wang, 2024; Yang and He, 2026), comparatively little attention has been paid to stock-level cross-market return prediction under realistic tradingsession timing constraints. Most existing studies on return forecasting focus on predicting within a single market. example, Chen et al. (2015) apply a Long Short-Term Memory (LSTM) model to predict stock returns in the Chinese market, while Wang (2024) studies U.S. stock return prediction using neural Similarly, Yang and He (2026) propose an intraday volume-based uncertainty proxy to predict return direction in the Chinese market.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 3, + "total_chunks": 80, + "char_count": 730, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8cd805d-dd64-4dab-a82d-a56a60c3889a", + "text": "These studies demonstrate the growing use of machine learning methods in single-market settings. By contrast, research on cross-market interactions has largely emphasized contemporaneous co-movement, spillovers, or causal transmission rather than explicit stock-level return prediction. For instance, Eun and Shim (1989) analyze the", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 4, + "total_chunks": 80, + "char_count": 332, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5a351a92-09dc-4f73-9a38-1fe79a4580b7", + "text": "international transmission of stock market movements using vector autoregression, Baur and Jung (2006) evaluate contemporaneous return correlations using GARCH models, and Rapach et al. (2013) document the leading role of the U.S. market through causality tests. Sarwar (2014) examine the", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 5, + "total_chunks": 80, + "char_count": 288, + "word_count": 40, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "95a71bfd-0964-493f-a0a7-0741a6effada", + "text": "relationship between U.S. market uncertainty and European equity returns during crisis periods, while Jung et al. (2024) study interdependency patterns between the U.S. and Chinese markets using threshold overnight co-movement processes. Only a limited number of studies have attempted explicit cross-market predictive analysis using machine learning models. For example, Lee and Yoo (2020) apply a deep neural network to fuse information from the U.S. and South Korean markets for index-level return prediction, and Kumar et al. (2024) propose a graph neural network to model volatility spillovers across markets. most existing work operates at the index level, and to our knowledge no prior study has examined", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 6, + "total_chunks": 80, + "char_count": 711, + "word_count": 105, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9fafb950-7877-4172-bac4-6c8d509031a2", + "text": "stock-level cross-market return prediction between the U.S. and Chinese markets under realistic trading-session timing. Our study fills this gap by developing a directed bipartite graph framework for stock-level cross-market return forecasting between the U.S. and Chinese equity markets. time-ordered bipartite graph that selects cross-market predictors based on rolling-window screening, thereby capturing directed predictive links across non-overlapping trading sessions. predictors are then embedded into a suite of ten machine learning models to forecast next-session", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 7, + "total_chunks": 80, + "char_count": 572, + "word_count": 73, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "20a2cd9b-6a64-414a-a2c9-f95a3d497455", + "text": "open-to-close (OPCL) returns in each market. Empirically, we demonstrate a pronounced directional asymmetry: U.S. market information is substantially more informative for predicting Chinese stock returns than vice versa. Ratios (SRs) obtained when forecasting Chinese stocks using U.S. predictors consistently exceed those in the reverse direction.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 8, + "total_chunks": 80, + "char_count": 348, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a31083ed-bbf2-4ba7-b8f3-a59f210beb60", + "text": "We further show that both the graph-based selection mechanism and cross-market information contribute materially to predictive performance. Sector-level patterns in the estimated graph reveal economically interpretable transmission channels across markets. For instance, sector-level aggregation of the bipartite graph reveals meaningful cross-sector transmission patterns rather than a block-diagonal structure. is on documenting directional cross-market predictability and the structure of the associated dependency graph rather than designing a fully implementable trading strategy. performance metrics are reported pre-transaction-cost and without liquidity-optimized weighting, and should be interpreted as evidence of predictive asymmetry rather than deployable alpha. Our setting is economically and statistically distinctive because the U.S. and Chinese equity", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 9, + "total_chunks": 80, + "char_count": 868, + "word_count": 105, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "be9094d7-1c2e-4e0b-a9e6-8b6976ec7e2e", + "text": "sessions do not overlap. This implies that U.S. previous-close-to-close (pvCLCL) information is fully observed before the subsequent Chinese OPCL window begins, yielding a clean timing structure for cross-market prediction. The directed bipartite graph can therefore be interpreted as a time-ordered map of potential information transmission channels across markets, rather than a contemporaneous The structure of the paper is arranged as follows. Section 2 provides a detailed review of related Section 3 describes the data we use and the definitions of financial terms involved. introduces the graph-based methodology for feature selection and prediction. Section 5 presents the evaluation metrics and experimental results.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 10, + "total_chunks": 80, + "char_count": 725, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "67030693-bfe4-4200-af51-4b0cce01b6e2", + "text": "Finally, Section 6 summarizes the study and discusses future research directions. 2.1 Cross-Market Analysis and Prediction Global financial markets have become increasingly interconnected with the intensification of international economic and financial integration. As a result, shocks, volatility, and information can propagate rapidly across countries through multiple transmission channels. A substantial body of", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 11, + "total_chunks": 80, + "char_count": 415, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ffa36859-a1dd-4ca9-8b36-8c82df179c8b", + "text": "research therefore examines cross-market linkages, including price discovery, return co-movement, volatility spillovers, and broader measures of financial interconnectedness, typically within econometric frameworks. Such interconnectedness has motivated studies of directional information flow and market leadership across countries. For example, Liu and An (2011) examine information transmission and price discovery between the U.S. and Chinese markets. Asgharian et al. (2013) study how economic and geographical relationships across countries affect stock market returns. and Tan (2015) analyze daily returns and volatility dynamics in the U.S. and Chinese markets. Clements et al. (2015) investigate global transmission of news and volatility across financial markets, while Ahmad et al. (2018) explore market interconnectedness through return and volatility spillovers. Huang and Liu (2023) construct a financial network to characterize cross-market risk spillovers and interaction topology. These studies primarily emphasize contemporaneous relationships and", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 12, + "total_chunks": 80, + "char_count": 1065, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fb2a7edd-a43f-43c4-a3d3-74f14a885150", + "text": "transmission mechanisms rather than explicit stock-level return prediction. Beyond studying cross-market information propagation, a growing strand of research incorporates signals from multiple markets into forecasting models to improve predictive performance. integration typically relies on feature engineering that embeds external market indicators, deep learning architectures that fuse multi-market inputs, or graph-based models designed to capture inter-market dependencies. For example, Thenmozhi and Sarath Chand (2016) use foreign index information to enhance index prediction, Lee and Yoo (2020) develop multimodal deep learning models for cross-market index forecasting, and Lin et al. (2025) leverage external futures market data to predict movements of the China Securities Index. Gong et al. (2025) propose a cross-market volatility forecasting framework exploiting risk transmission across markets. However, much of this literature focuses on aggregate indices or volatility measures rather than stock-level return prediction. Moreover, network structures are often employed to characterize", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 13, + "total_chunks": 80, + "char_count": 1105, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7729180b-c519-4641-88f1-325eff8e0fc5", + "text": "spillovers and interconnectedness rather than as predictive screening devices for individual stocks. Explicit stock-level cross-market return forecasting under time-ordered, non-overlapping trading sessions remains largely unexplored. Applying our methodology in this setting is therefore novel.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 14, + "total_chunks": 80, + "char_count": 295, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f425e5ab-2e90-45b1-a660-38960197dce1", + "text": "Empirically, the directed bipartite structure reveals pronounced asymmetry in cross-market predictability, with U.S. stocks exerting substantially stronger predictive influence on Chinese stocks These findings align with the literature documenting asymmetric cross-market return predictability with a leading role for the U.S.. Rapach et al. (2013) show that lagged U.S. market returns possess substantial predictive power for foreign equity markets, while the reverse predictability is considerably weaker, highlighting the central role of the U.S. in global price discovery. Similarly, Siliverstovs (2017) finds that the predictive influence of the U.S. is particularly pronounced during market downturns, reinforcing the view that U.S. information dominates international return Focusing specifically on China-related markets, Mohammadi and Tan (2015) document significant return and volatility spillovers from the U.S. to China mainland and Hong Kong, with weaker effects in the opposite direction.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 15, + "total_chunks": 80, + "char_count": 1002, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "61591ce7-92eb-4919-b5e9-d0825e123785", + "text": "2.2 Graph Methods in Finance Graph methods provide a way to represent relationships among financial entities, rather than treating each entity in isolation.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 16, + "total_chunks": 80, + "char_count": 156, + "word_count": 23, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8dd39627-582b-428a-96da-5901b425ee5d", + "text": "The use of graphs aligns with the view that financial systems are interconnected (Bardoscia et al., 2021), and that modeling these interconnections can improve forecasting and risk-management (Chen and Fan, 2025). Many financial phenomena, such as asset co-movements, spillovers and supply-chain linkages, are naturally represented as graphs. Bipartite graphs, which originate in graph theory and network science as representations of relationships between two distinct sets of nodes (Guillaume and Latapy, 2006; Newman, 2018), provide a natural framework for modeling interactions across disjoint groups. finance, bipartite structures arise in contexts such as credit networks, production networks, and supply-chain relationships, where connections form between two heterogeneous sets of entities rather than within a single homogeneous market.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 17, + "total_chunks": 80, + "char_count": 845, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "374e5260-70d8-4045-ae81-96f7a84c5fd3", + "text": "For instance, Kley et al. (2020) study extremal dependence for operational risk by a bipartite graph. Wang and Chen (2020) design a bipartite- graph-based recommender for crowdfunding with sparse data.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 18, + "total_chunks": 80, + "char_count": 201, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7b5c84e2-41b1-45ef-943d-bd123724a4a1", + "text": "In econometrics, Wu et al. (2024) propose a quasi-maximum likelihood approach to estimate a bipartite network influence model. A growing literature applies graph neural networks (GNNs) and related architectures to financial Wang et al. (2021) provide a survey of GNN methods in financial applications, including stock movement prediction, loan default risk assessment, recommender systems, fraud detection, and other financial events. Chen et al. (2018) apply a Graph Convolutional Network (GCN) to integrate information from related companies and improve stock price prediction. (2021) propose an LSTM-relational GCN that captures inter-stock relationships through correlation matrices to predict overnight movements. Capponi et al. (2024) develop a GNN framework for asset pricing using supply-chain data. Zhang et al. (2025) incorporate cross-stock spillover effects to forecast multivariate realized volatilities, and Luo et al. (2025) construct a semantic company relationship graph to enhance stock price forecasting.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 19, + "total_chunks": 80, + "char_count": 1023, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0c67c099-ed6f-4d02-baca-ce052986b4e4", + "text": "Some recent empirical work extends graph-based forecasting to richer and more dynamic For example, Cheng and Li (2021) employ a Graph Attention Network (GAT) to model momentum spillovers in stock returns, while Kumar et al. (2024) introduce a temporal GAT that combines graph convolution and attention mechanisms to capture structural and temporal dependencies across global market indices. Lee et al. (2025) show that GCN- and GAT-based models can outperform conventional machine learning baselines by exploiting symmetric interdependencies among financial indices. Related research also incorporates multimodal information into graph Cheng et al. (2022) integrate financial events and news into a multimodal GNN framework for price prediction, and Liu et al. (2024) develop a multiscale dynamic GCN that combines textual and numerical inputs to forecast stock movements. Despite these advances, most graph-based forecasting models construct within-market networks, where edges are defined through contemporaneous similarity, correlation, or learned attention Such graphs typically capture symmetric interdependencies among assets within a", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 20, + "total_chunks": 80, + "char_count": 1140, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8f8ded89-3841-49e5-ae92-b7e051e9ba7c", + "text": "single market and are primarily used to enhance predictive performance through richer representation In contrast, our framework constructs a directed bipartite graph across two distinct markets, where edges are formed through time-ordered predictive screening rather than contemporaneous The resulting graph serves as a feature-selection mechanism for stock-level cross-market return prediction, explicitly exploiting the non-overlapping trading sessions between the U.S. and 2.3 Machine Learning in Finance Driven by increasing data availability and computational power, the application of machine learning in finance has expanded substantially in recent years. Compared to classical time-series and econometric models, such as ARIMA and GARCH, machine learning approaches are often considered better suited to high-dimensional and nonlinear settings. A survey by Rundo et al. (2019) documents that machine-learning-based systems demonstrate superior overall performance compared to traditional Another survey by Kelly et al. (2023) highlights how machine learning methods have", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 21, + "total_chunks": 80, + "char_count": 1078, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bec353b5-5f54-4ea8-b97e-07990e9b1d21", + "text": "become established in empirical financial research. Key applications include forecasting asset returns, volatility estimation, fraud detection, and algorithmic trading. Forecasting asset returns remains", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 22, + "total_chunks": 80, + "char_count": 202, + "word_count": 23, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "61c40f48-7e12-42d0-b98a-66d7b4e5e52c", + "text": "inherently difficult due to low signal-to-noise ratios, structural instability, and nonlinear dependence Moreover, evidence of predictive gains is often sensitive to model specification and feature These challenges partly motivate the adoption of flexible machine learning methods and the incorporation of richer information sets. Given recent developments in machine learning, its applications in finance can be grouped into several major categories: traditional machine learning methods, deep learning methods, and large-language-model-based methods. For traditional machine learning methods, Huang et al. (2005) employ support vector machines (SVM) to predict the direction of weekly price movements. Kumar and Thenmozhi (2006) investigate", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 23, + "total_chunks": 80, + "char_count": 742, + "word_count": 97, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1929343c-cd24-49e8-bee3-ca216b355794", + "text": "the application of SVM and Random Forests (RF) in predicting the direction of a market index. and Trisedya (2015) incorporate sentiment information and use a basic linear regression model for stock price prediction. Thenmozhi and Sarath Chand (2016) predict stock prices of several major indices using support vector regression. Yang and He (2026) propose a novel proxy and apply Extreme Gradient Boosting (XGBoost) to predict return directions in the Chinese market. For deep learning methods, Chen et al. (2015) use an LSTM model for sequence learning and Chinese stock return forecasting. Wang (2024) investigates the performance of neural network models in predicting stock returns. A survey by Gao et al. (2024) highlights the expanding use of", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 24, + "total_chunks": 80, + "char_count": 748, + "word_count": 116, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e2d493f-0580-414a-8694-be284ae6a7ab", + "text": "deep neural networks, convolutional neural networks, recurrent neural networks, and other advanced architectures in financial contexts. For large-language-model-based methods, Nie et al. (2024) review how Large Language Models", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 25, + "total_chunks": 80, + "char_count": 226, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ff6834c8-365e-439e-8a68-46ca5dc4eaa9", + "text": "(LLMs) are applied in finance. Ding et al. (2023) demonstrate the effectiveness of LLMs in forecasting Chen et al. (2023) propose a framework that integrates ChatGPT and GNN to forecast Chen et al. (2024) investigate the ability of ChatGPT for stock return forecasting. Despite these advances, most existing studies focus on single-market return prediction and rely primarily on information drawn from within the same market. Additional information is often shown to improve predictive performance, yet it is typically incorporated in contemporaneous or symmetric Very few studies employ machine learning methods in stock-level cross-market forecasting environments characterized by asynchronous trading sessions and explicitly time-ordered information Differing from existing studies, our framework combines directed bipartite screening with a second-stage machine learning prediction step, enabling systematic exploitation of cross-market dependencies, temporal ordering, and asymmetric predictive structure.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 26, + "total_chunks": 80, + "char_count": 1010, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7d92c35e-1724-458d-aa9d-1677ba65feb1", + "text": "The stock data used in this study cover several of the world's largest markets by market capitalization, including the New York Stock Exchange (NYSE), Nasdaq, the Shanghai Stock Exchange (SSE), and the Shenzhen Stock Exchange (SZSE). Daily U.S. stock data are sourced from the Center for Research in Security Prices1 (CRSP), while daily Chinese stock data are sourced from the Wind The data span the period from 2014 through 2021. This selection of data enables us to investigate the transferability of signals across the world's largest and most liquid equity markets operating under non-overlapping trading sessions. In this paper, we rely on the market excess return of a stock, defined as the difference between the raw return of its price and the return of an exchange-traded fund (ETF) representing overall stock market performance. We use both pvCLCL returns and OPCL returns in one market to forecast OPCL returns in the other market (see Section 4.2 for a more detailed justification of this The pvCLCL logarithmic raw return for stock i on day t can be calculated by:", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 27, + "total_chunks": 80, + "char_count": 1077, + "word_count": 177, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9a15afdc-e7dd-4a07-8dc3-4be825a6d088", + "text": "while the OPCL logarithmic raw return for stock i on day t can be calculated by: p(t)i,cl\nR(t)i,OPCL = log . (2)\np(t)i,op Here p(t)i,cl and p(t)i,op denote the closing and opening price of stock i on day t respectively. market excess return of stock i on day t can be defined as: r(t)i = R(t)i,pvCLCL −R(t)ETF,pvCLCL (3) for pvCLCL returns, or\nr(t)i = R(t)i,OPCL −R(t)ETF,OPCL (4) We use SPY as the market ETF in the U.S. and 513500.SH in China. We select 500 stocks with the highest average market capitalizations over the years covered in", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 29, + "total_chunks": 80, + "char_count": 540, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f2681eea-25ea-4e3a-b3a9-7e12174a6d18", + "text": "the dataset from each country.3 Unless otherwise specified, all returns mentioned in the following contents refer to market excess returns. To mitigate the influence of extreme values and potential outliers, we apply winsorization to the training-sample returns of each stock, replacing observations below the 0.5th percentile with the 0.5th percentile value and those above the 99.5th percentile with the 99.5th percentile value.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 30, + "total_chunks": 80, + "char_count": 430, + "word_count": 62, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e70e026-7de9-44cf-a838-dcb3e676f57d", + "text": "This study aims to predict individual stock returns using a cross-market directed bipartite graph. The prediction framework consists of two main stages. First, we build a directed bipartite graph using return data from two markets within a look-back training window. This graph identifies cross-market predictive links: if a directed edge connects two stocks, the stock at the source of the edge is treated as a predictor for forecasting the returns of the stock at the destination. 3This universe selection relies on full-sample information (average market capitalization over 2014–2021) and\ntherefore introduces a mechanical look-ahead component. We adopt it as a pragmatic way to focus on continuously\ntraded, highly liquid stocks and reduce missing observations. However, we caution that a fully investable design\nwould require time-t reconstitution based solely on lagged market capitalization information. Importantly, our main\nqualitative finding is directional asymmetry (the influence of the U.S. market on the Chinese market being stronger\nthan the reverse effect), which is unlikely to be driven solely by this selection procedure. Nonetheless, we consider\ntime-local universe formation as a valuable extension. second stage, we apply various machine learning methods to forecast returns based on the identified", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 31, + "total_chunks": 80, + "char_count": 1322, + "word_count": 193, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "68740ef1-b6c3-4e0c-be0f-92930af12394", + "text": "4.1 Directed Bipartite Graph A graph can be defined as G = (V, E), where V represents the vertex set and E represents the edge G is called bipartite if V can be divided into two disjoint sets X and Y such that all edges have one endpoint in X and another in Y. We denote a directed edge from vi to vj as eij with associated For a bipartite graph G, the biadjacency matrix B is defined where rows correspond to nodes in X, columns correspond to nodes in Y, and each entry bij contains the weight wij of edge We represent two different markets, the source market X and target market Y, as two vertex sets, where stocks in each market are interpreted as nodes. Edges originate from nodes in X and For a specific period of time w, which is the look-back training window in the", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 32, + "total_chunks": 80, + "char_count": 772, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4fbf60b1-ef5b-44b3-8b71-d7b0515e23e7", + "text": "experiment, the daily return vector of the jth stock in market X is x = [r(t−l−w) , r(t−l−w+1) , ..., r(t−l−1) ]⊺, Xj Xj Xj where r(t) is the return of the jth stock on day t. The daily return vector of the ith stock in market Xj Y is\ny = [r(t−w) , r(t−w+1) , ..., r(t−1) ]⊺. The lag parameter l captures the temporal ordering induced by non-overlapping trading sessions, ensuring that returns in the source market precede those in the target market. study the calculation with t uses the trading calendar rather than the natural calendar.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 33, + "total_chunks": 80, + "char_count": 539, + "word_count": 99, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "24c489a2-1a15-4861-af59-77b44f2966c2", + "text": "This time-ordered screening procedure induces a directed bipartite graph, where nodes in the source market X are connected to nodes in the target market Y whenever statistically significant predictive links are detected within the rolling training window. Figure 1 provides a schematic illustration of this bipartite structure. For each ordered pair (Xj, Yi), we estimate a univariate linear regression of y on x within the Figure 1: Schematic illustration of the directed bipartite graph linking source-market stocks to target-market stocks based on significant predictive relationships. We quantify such relationship using the t-statistic from regression, defined as tβ = . (5)\nse/pPwi=1 (xi −¯x)2 cov(x, y)\nHere, β is the slope coefficient of the simple linear regression, given by β = and se denotes\nvar(x)\nr SSE\nthe standard error of the regression se = SSE denotes the sum of squared residuals, w −2. SSE = X (yi −ˆyi)2, where ˆy = βx + α, and α = ¯y −β ¯x. Here, ˆyi is the fitted value of yi, and α\ni=1\nis the intercept of the regression line. The use of pairwise univariate screening serves primarily as a computationally tractable sparsification device rather than as a formal structural inference procedure. Similar marginal screening", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 34, + "total_chunks": 80, + "char_count": 1245, + "word_count": 203, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2c11e816-9d9d-4951-9699-0325e502a416", + "text": "approaches are common in high-dimensional predictive settings where the objective is feature selection rather than causal identification (see, e.g., Fan and Lv (2008); Hastie et al. (2009)). recognize that testing across a large number of stock pairs raises multiple-testing considerations and may introduce spurious edges in finite samples (cf. Harvey et al. (2015)). discovery rate or multiple-comparison corrections could be applied. However, our primary goal is to construct a predictive graph that enhances out-of-sample forecasting performance rather than to perform statistical inference on individual edges. We therefore treat the screening step as a", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 35, + "total_chunks": 80, + "char_count": 658, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fa30f7d7-7080-4a0f-ab74-5a8e3f8df85a", + "text": "model-selection heuristic and assess its validity through out-of-sample forecasting performance and Figure 2 shows the return time series for an example pair of U.S. and Chinese technology stocks, CDNS (pvCLCL returns) and 002410.XSHE (OPCL returns), smoothed with a three-day moving average for visualization purposes. Here l = 1 and w = 250. The t-statistic from the regression of 002410.XSHE on CDNS is high during the period shown, illustrating a statistically significant cross-market predictive relation within the training window under the linear screening specification. Figure 2: Example time series of U.S. pvCLCL returns for CDNS and Chinese OPCL returns for\n002410.XSHE over the rolling training window. The series are shown for illustrative purposes to", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 36, + "total_chunks": 80, + "char_count": 765, + "word_count": 113, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ba15688-187f-42c8-b803-3a0161676b80", + "text": "highlight cross-market co-movement underlying the detected predictive link. In our setting, either the U.S. or the Chinese market can be treated as the source market X, with the target market of prediction serving as market Y. After performing the regression t-test above for all ordered stock pairs in market X and Y, we set a threshold to filter the resulting t-statistics by We introduce an explicit threshold parameter, denoted by τ, to facilitate later reference. In our experiments, we set τ = 2, and select edges whenever |tβ| > τ, corresponding approximately to conventional significance levels under standard asymptotic approximations. of the t-statistics for x and y is larger than τ, we select the return of Xj on day t −l to predict the return of Yi on day t. This selection forms a directed edge in the graph pointing from Xj to Yi.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 37, + "total_chunks": 80, + "char_count": 845, + "word_count": 144, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac5e30a6-b44a-4de6-a21a-80bf77c5d565", + "text": "Note that the thresholding step is used purely as a sparsification mechanism, aimed at denoising the signal and improving computational tractability rather than constituting a formal multiple-testing A sample directed bipartite graph is shown in Figure 1, where Xj and Xk on day t −l are selected to predict Yi on day t. This construction yields a time-lagged cross-market predictive network that can be naturally interpreted as a directed bipartite graph. Figure 3 presents a section of the heatmap corresponding to the biadjacency matrix of the U.S.–Chinese stock network on 21 October 2021.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 38, + "total_chunks": 80, + "char_count": 593, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "15fa9f41-4b71-42be-94eb-fca7c88fb492", + "text": "To illustrate the structure more clearly, we select 25 representative stocks from each sector. For sectors containing fewer than 25 stocks in the original dataset, all available stocks are included, resulting in 254 U.S. stocks and 235 Chinese stocks in this Each row corresponds to a Chinese stock, while each column represents a U.S. stock. The colour intensity represents the value of the t-statistic, and black grid lines delineate sectoral This visualization shows that cross-market predictive connectivity is not restricted to within-sector interactions (which would lead to a block-diagonal structure), thereby motivating a", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 39, + "total_chunks": 80, + "char_count": 630, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0378f441-67a1-46f7-ab6f-389fbc7a20ee", + "text": "flexible cross-market predictive framework. Figure 3: Heatmap of the directed biadjacency matrix for a representative trading day. correspond to Chinese stocks and columns to U.S. stocks, grouped by sector. Each entry represents", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 40, + "total_chunks": 80, + "char_count": 228, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "efb222cd-0764-462b-b005-c8e6d1573cc3", + "text": "the t-statistic from the rolling-window regression of Chinese returns on lagged U.S. returns. intensity reflects the magnitude and sign of the predictive relationship. To summarize cross-market structure over time, we average the daily biadjacency matrices across the full sample period, obtaining an aggregate representation of predictive linkages. then compute, for this time-averaged matrix, the median of absolute t-statistics within each sectorby-sector block of the corresponding heatmap (Figure 4). This aggregation highlights systematic sectoral dependencies rather than stock-specific effects. For example, the financial services sector in the Chinese market exhibits strong predictive links with the utilities sector in the U.S. market.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 41, + "total_chunks": 80, + "char_count": 746, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4c7f14b6-bd3e-40dc-889e-459a7875c93a", + "text": "Figure 4: Sector-level heatmap of the absolute median t-statistic in the time-averaged biadjacency matrix of the directed cross-market graph. Rows correspond to Chinese sectors and columns to U.S. Each entry reports the median absolute predictive strength across all stock pairs within the corresponding sector-by-sector block. We also examine the cross-market relations over time.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 42, + "total_chunks": 80, + "char_count": 381, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "014586c3-3d7e-4ba3-a8c2-f2f83ce09eab", + "text": "Figure 5 shows how the in-degree of all nodes in set Y evolves over time. For each day, the 25th, 50th, and 75th percentiles of the in-degree distribution are computed across all target nodes.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 43, + "total_chunks": 80, + "char_count": 192, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b82a2fe-f347-4400-8f77-79afbc304c9d", + "text": "The blue curves represent the number of U.S. pvCLCL nodes selected to predict Chinese OPCL returns, while the red curves correspond to the number of Chinese pvCLCL nodes selected to predict U.S.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 44, + "total_chunks": 80, + "char_count": 194, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c734477c-aed1-4a52-bdc3-e7819770abfb", + "text": "As time progresses, the in-degree in both directions increases, suggesting strengthening cross-market predictive connectivity over the sample period. 4.2 Predictive Analysis with Machine Learning In order to predict the return of stock Yi on day t, we use training data from day t −w to day t −1 for market Y and from day t −l −w to day t −l −1 for market X. Since we wish to predict", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 45, + "total_chunks": 80, + "char_count": 383, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1ccf5ff1-1b41-4c10-a51a-e738f83dd32c", + "text": "Figure 5: The figure shows the 25th, 50th, and 75th percentiles of the in-degree distribution of\ntarget nodes by day. US-CN represents the number of U.S. pvCLCL nodes selected to predict Chinese OPCL returns, while CN-US represents the number of Chinese pvCLCL nodes selected to r(t) by using information from market X, we select n stocks, i.e., X1, X2, ..., Xn from market X, Yi corresponding to those stocks that exhibit the strongest cross-market predictive associations with Yi\naccording to the t-statistic defined above. Their daily returns on day t −l are r(t−l) , r(t−l) , ..., r(t−l) . The data used for training and prediction are illustrated in Figure 6. All predictor selection is performed within the rolling training window to avoid look-ahead bias.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 46, + "total_chunks": 80, + "char_count": 762, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "45ba1268-e285-4e16-a6a0-badb08aef843", + "text": "The U.S. market is open from 9:30am to 4:00pm U.S. Eastern Time (ET), while the Chinese market is open from 9:30am to 11:30am, and 1:00pm to 3:00pm China Standard Time (UTC+8). There is no overlap between the two trading periods, as shown in the time zone diagram in Figure 7, under the standard time difference. Note that adjusting for daylight saving time does not result in any overlap between the trading sessions. We predict OPCL returns for both countries. l = 1 when predicting Chinese stocks using the latest information from the U.S. market, and l = 0 in the reverse direction. This timing structure ensures that predictor information from the source", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 47, + "total_chunks": 80, + "char_count": 659, + "word_count": 112, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "76b4ff97-31cd-485e-8ec8-6d2820ad1e21", + "text": "market is fully observable prior to the opening of the target market. We build the forecasting model as follows: r(t) , r(t−l) , ..., r(t−l) ; θ) + ϵ(t)i . (6) Xn = Fi(r(t−l)X1 X2 Yi Figure 6: Schematic illustration of the rolling training and prediction framework.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 48, + "total_chunks": 80, + "char_count": 265, + "word_count": 47, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd8bba4f-beb7-4dde-a007-d4bf0066009d", + "text": "For each target\nstock Yi, returns over the look-back window [t −w, t −1] are regressed on lagged source-market\nreturns over [t−l−w, t−l−1]. The bottom row represents the out-of-sample prediction of r(t) using Yi\nsource returns observed at t −l, thereby preserving temporal ordering and eliminating look-ahead Figure 7: Timeline of opening and closing times for the U.S. and Chinese stock markets. non-overlapping trading sessions induce a natural temporal ordering of information, with U.S. day\nt −1 close preceding Chinese day t trading, and Chinese day t close preceding U.S. day t trading.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 49, + "total_chunks": 80, + "char_count": 592, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e9d88fe8-f748-4302-8c61-b22c68286b36", + "text": "Here, the function Fi represents the different machine learning methods we use, and θ refers to the parameters that are estimated for each machine learning model. The aim is to identify a model that can generate accurate out-of-sample predictions of r(t) so that a high SR can be achieved. We applied a total of ten machine learning models to forecast returns. They include: Ordinary", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 50, + "total_chunks": 80, + "char_count": 383, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2e595ca3-6ba7-4556-8c63-616429670e96", + "text": "Least Squares (OLS), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge Regression (RIDGE), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LGBM), Random Forests (RF), Adaptive Boosting (AdaBoost), ensemble by results average (ensemble-avg) and ensemble by results median (ensemble-med). models spans linear, regularized, kernel-based, tree-based, and ensemble approaches, allowing us to assess whether cross-market predictive gains depend on model class or are robust across specifications. We describe each model in detail below. All models are estimated within each rolling training window and evaluated out-of-sample to ensure temporal validity.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 51, + "total_chunks": 80, + "char_count": 713, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fb570bad-ce92-475d-8d6d-f371677f676e", + "text": "• Ordinary Least Squares (OLS): The main idea of OLS is to estimate regression coefficients by choosing parameter values that minimize the sum of squared residuals between observed and predicted values. Specifically, the model is defined as: r(t) + ϵ(t)i . (7) Yi = αi + X βijr(t−l)Xj\nj=1 The linear model is fit with an objective of minimizing the residual sum of squares (RSS): min ∥y −αi1 −Xw∥22 . (8) αi,w y ∈Rd is the vector of returns corresponding to r(t) , X ∈Rd×n is the matrix of predictors Yi\nwhere each row is [r(t−l) , . . . , r(t−l) ], and w = [βi1, . . . , βin]⊤is the associated coefficient vector. Here d = w, which is the length of training window, i.e., the number of time points.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 52, + "total_chunks": 80, + "char_count": 699, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fd9c76a9-ebc5-4202-a543-1534adf850ca", + "text": "• Least Absolute Shrinkage and Selection Operator (LASSO): The OLS method often leads to low bias but high variance (Hastie et al., 2009). Shrinkage methods are introduced to mitigate this problem, and LASSO is one of them. It uses ℓ1-norm regularization to impose a penalty on the size of regression coefficients (Hastie et al., 2009). The objective function is min ∥y −αi1 −Xw∥22 + λ∥w∥1 . (9) αi,w 2d Here λ is the regularization parameter. • Ridge Regression (RIDGE): RIDGE is another type of shrinkage method. regularization to the linear least squares loss function. The objective function is given by: min ∥y −αi1 −Xw∥22 + λ∥w∥22 . (10) αi,w", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 53, + "total_chunks": 80, + "char_count": 648, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "01e147db-341c-429c-af96-6519f1cbefcd", + "text": "• Support Vector Machine (SVM): SVMs can tackle complex learning problems while retaining the analytical simplicity of linear models. With kernel functions, this method avoids direct computation in high-dimensional spaces, enabling nonlinear learning using a linear algorithm in the feature space (Hearst et al., 1998). We use the radial basis function kernel throughout our experiment.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 54, + "total_chunks": 80, + "char_count": 386, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d038fe68-b438-4b1a-959f-1a83084cca1f", + "text": "The goal is to minimize the following dual optimization problem with respect to the Lagrange multipliers: n n n\nmin X X αjαkyjykK(xj, xk) − X αj\nα 2\nj=1 k=1 j=1\ns.t. 0 ≤αj ≤C, j = 1, 2, . . . , n. Here αj is the Lagrange multiplier, C is a hyperparameter that controls the trade-off between the flatness of the function and the amount by which deviations larger than ϵ are tolerated, in the training window, and xj is its K(xj, xk) is the kernel function, yj is r(t−w+j−1)Yi\ncorresponding vector of predictors [r(t−w+j−1−l) , . . . , r(t−w+j−1−l) ]⊺. • Extreme Gradient Boosting (XGBoost): XGBoost is a scalable end-to-end tree boosting method (Chen and Guestrin, 2016). It implements parallel and distributed computing to", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 55, + "total_chunks": 80, + "char_count": 722, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e9f24c4-1654-4392-af99-670f8d8cebea", + "text": "The model is defined by the following equation: ˆyj = ϕ(xj) = X fk(xj), fk ∈F, (12) where F is the space of regression trees, and fk is one independent tree. The objective function min X l(ˆyj, yj) + X Ω(fk), (13)\nj k where\nΩ(f) = γM + 2λ∥w∥2. (14) Here l is a differentiable convex loss function, Ω(f) is the regularization term, M is the number of leaves, w is the leaf weight, and γ and λ are the corresponding regularization parameters. • Light Gradient Boosting Machine (LGBM): LGBM is another gradient boosting method that improves computational efficiency compared with standard gradient boosting tree algorithms (Ke et al., 2017). Two key techniques employed by LGBM are Gradient-Based One-Side Sampling and Exclusive Feature Bundling.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 56, + "total_chunks": 80, + "char_count": 743, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2fc4b03a-6cf8-4a80-b541-e414ff16ae5c", + "text": "The former retains instances with large gradients and randomly samples those with small gradients. The latter combines mutually exclusive sparse features, which never take nonzero values at the same time, into a single combined feature, effectively reducing computational complexity (Ke et al., 2017). • Random Forests (RF): Random forests consist of an ensemble of decision trees. node of each tree, the algorithm randomly selects a subset of features to consider for splitting. Each tree is grown using bootstrap sampling of the data (Breiman, 2001). de-correlated trees, the final prediction is obtained by averaging their outputs (Hastie et al., • Adaptive Boosting (AdaBoost): AdaBoost combines multiple weak learners to form a Each weaker learner is trained to correct the errors made by the previous The algorithm iteratively reweights training observations based on their absolute prediction errors, so that more emphasis is given to instances with larger errors from earlier The final prediction is obtained by aggregating the weak learners, summing their probabilistic predictions (Freund and Schapire, 1997).", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 57, + "total_chunks": 80, + "char_count": 1119, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b7be54e9-09e3-4594-9716-03dd40a7185b", + "text": "We choose a decision tree regressor as the base learner in our experiment. • Ensemble by results average (ensemble-avg): For each stock and each day, we take the average of the prediction results from the eight methods above as the final output. • Ensemble by results median (ensemble-med): Similar to ensemble-avg, we take the median of the prediction results from the eight methods as the final output for each stock on In this section, we conduct an extensive set of experiments to evaluate the cross-market predictability of individual stock returns and examine the economic relevance of the proposed graph-based All results are obtained using a rolling-window estimation scheme and evaluated strictly 5.1 Evaluation Metrics We use Profit and Loss (PnL) and Sharpe Ratio (SR) to evaluate the performance of forecasting We abstract from liquidity-optimized portfolio construction and explicit transaction cost modeling, and therefore interpret reported SRs as pre-cost measures of predictive strength rather than implementable performance.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 58, + "total_chunks": 80, + "char_count": 1042, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "02c82191-8145-493e-8055-53639541490f", + "text": "• Profit and Loss (PnL): The PnL on day t is calculated with the following equation: PnL(t) = X sign(s(t)i ) · r(t)i · b(t)i . (15) Here s(t)i denotes the predicted return of stock i on day t, and r(t)i denotes the actual return\nof stock i on day t. b(t)i = min(0.001 × mdv(21)i , L) is the amount of capital deployed on stock\ni, where mdv(21)i denotes the median daily traded volume of stock i over the 21-day interval preceding day t, and L is the maximum limit to the bid. The parameter L controls the maximum position size. This position-sizing rule serves as a coarse liquidity proxy, limiting exposure in less actively traded names.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 59, + "total_chunks": 80, + "char_count": 638, + "word_count": 118, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "962a72bc-0732-428b-9a5b-b14d31a6aab0", + "text": "Throughout our experiment, L is set to 100,000 USD for the U.S. market prediction and 1,500,000 CNY for the Chinese market prediction. • Sharpe Ratio (SR): After computing daily PnLs of all stocks, we calculate the mean and standard deviation of the daily PnL vector with length T, denoted as µ(T)PnL and σ(T)PnL, where T is the length of predicting period in our experiment. The annualized SR is given by: µ(T)PnL √ SR = · 252. (16)\nσ(T)PnL Here the scaling accounts for the fact that there are 252 trading days in a calendar year and annualizes daily PnL variability. Several practical limitations should be noted. First, the graph is obtained via large-scale pairwise screening and therefore may include spurious edges in the presence of multiple testing.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 60, + "total_chunks": 80, + "char_count": 758, + "word_count": 127, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3a515644-89b7-4e31-a70d-d183571392d4", + "text": "economic evaluation abstracts from transaction costs, market impact, short-sale constraints, and other trading frictions, so reported SRs reflect pre-cost predictive performance. implement liquidity-weighted portfolio construction or dynamic capacity controls; position sizes are capped but not optimized with respect to market depth. Consequently, the trading design is stylized rather than fully implementable. As emphasized by Cartea et al. (2025), ignoring stock-level capacity constraints can substantially overstate the implementable value of predictive strategies. Our objective is to isolate and quantify directional cross-market predictive asymmetries rather than to construct a production-ready trading strategy. 5.2 Experimental Setup We use a 250-day training window and update both the graphs and the predictive models every 10 Prediction begins on the first trading day of 2016 and ends on the last trading day of 2021.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 61, + "total_chunks": 80, + "char_count": 933, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c8b9c66e-2afd-4e7d-9bff-d793d44cd354", + "text": "All models are re-estimated using a rolling-window scheme to ensure strict out-of-sample evaluation and avoid look-ahead bias. 5.2.1 Graph-Based Cross-Market Prediction. • Predicting the Chinese Market with the U.S. market: We let market X denote the U.S. market, and market Y denote the Chinese market. We use the most recent available U.S.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 62, + "total_chunks": 80, + "char_count": 341, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "13ba6967-c91c-484f-b77b-325252cfebfe", + "text": "returns as predictors to forecast Chinese returns, i.e., l = 1, reflecting the non-overlapping trading sessions and the temporal ordering of information flow. • Predicting the U.S. Market with the Chinese market: We let market X denote the Chinese market, and market Y denote the U.S. market. We also use the most recent available returns for forecasting, i.e., l = 0, since Chinese trading concludes before the U.S. market", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 63, + "total_chunks": 80, + "char_count": 423, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "174ba370-bb0f-452f-9988-b531969a35ab", + "text": "opens on the same calendar day. • Non-Graph-Based Same-Market Baseline: For each target stock, the previous 25 days of daily return data are used as predictive features. The training window remains 250 days and", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 64, + "total_chunks": 80, + "char_count": 210, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c7d2bc93-730f-4e7a-a57d-7db968180206", + "text": "models are updated every 10 days to ensure comparability with the graph-based specifications. This baseline model can be described with the following equation: r(t) , r(t−24) , ..., r(t−1) ; θ) + ϵ(t)i . (17) Yi = Fi(r(t−25)Yi Yi Yi Note that the predictive features r(t−25) , r(t−24) , ..., r(t−1) can be either all pvCLCL returns or Yi Yi Yi\nall OPCL returns, while r(t) is an OPCL return. • Graph-Based Same-Market Baseline: Based on the methodology described in Section 4.1, this baseline sets markets X and Y identical, so that for each stock its predictors are drawn from the same market. The return values of predictors are one-day ahead of the response This specification isolates the incremental contribution of cross-market information relative to graph-based modeling per se.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 65, + "total_chunks": 80, + "char_count": 786, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5af56a6f-fdf8-4055-9dff-a70c68b4cdd8", + "text": "We begin by evaluating the economic performance of the cross-market forecasting framework. Portfolio sorts based on model-implied signals are standard in the return predictability literature. For each day t, stocks are ranked by the absolute value of their predicted returns, |ˆr(t)i |. how performance varies with signal strength, we construct six nested quantile portfolios: • quantile 1 (qr1): all stocks; • quantile 2 (qr2): top 80% of stocks ranked by |ˆr(t)i |; • quantile 3 (qr3): top 60%; • quantile 4 (qr4): top 40%; • quantile 5 (qr5): top 20%; • quantile 6 (qr6): top 10%. These portfolios are nested, so that qr6 ⊂qr5 ⊂qr4 ⊂qr3 ⊂qr2 ⊂qr1. This construction allows us to assess whether stronger model signals translate into improved risk-adjusted performance. Importantly, the ranking at day t is based solely on model predictions available at that date and does not use realized returns, thereby avoiding look-ahead bias in portfolio formation.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 66, + "total_chunks": 80, + "char_count": 956, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7aeebc4-b1a2-47cb-a2e6-4ca9cc6edac9", + "text": "first document that cross-market information and graph-based modeling contribute to improved forecasting performance. (a) Predictors: U.S. pvCLCL returns. (b) Predictors: U.S. Figure 8: Sharpe Ratios for forecasting Chinese OPCL returns using U.S. pvCLCL and OPCL returns as predictors. Figure 8 and Figure 9 display the results of forecasting in two different directions. to Figure 8, when predicting Chinese stocks with U.S. stocks, RIDGE, LGBM, ensemble-avg and ensemble-med yield strong performance.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 67, + "total_chunks": 80, + "char_count": 503, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "22c7fd02-f5df-484c-ab85-81316bcc232b", + "text": "SVM appears less effective for this task, since its SRs are mostly lower than one. For other forecasting methods and most quantiles, SRs exceed one, with some approaching two.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 68, + "total_chunks": 80, + "char_count": 175, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3808af1f-c181-4a94-ac03-9fd026e820cc", + "text": "Notably, the ensemble-average and ensemble-median methods maintain robust and stable performance, often comparable to the best individual models, highlighting the benefit of model diversification. Using U.S. pvCLCL returns as features performs better than using OPCL returns to predict Chinese OPCL returns. The cumulative PnL plots of each method for the former are shown in Figure 10, where the upward-sloping trajectories indicate economically meaningful profitability.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 69, + "total_chunks": 80, + "char_count": 472, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba1b564b-46c7-48f3-84ff-3ec6895af6f4", + "text": "In contrast, as shown in Figure 9, when predicting the U.S. stocks with Chinese stocks, SRs are substantially lower across methods and quantiles. Therefore the Chinese market exerts weaker predictive influence than the U.S. market in cross-market return prediction. Since the performance is stronger when predicting Chinese stocks using U.S. pvCLCL returns, we focus on this setting in", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 70, + "total_chunks": 80, + "char_count": 385, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9f342849-7167-4b6f-b47e-5af4cd49919e", + "text": "subsequent experiments and analyses. (a) Predictors: Chinese pvCLCL returns. (b) Predictors: Chinese OPCL returns. Figure 9: Sharpe Ratios for forecasting U.S. OPCL returns using Chinese returns as predictors. Figure 11 shows the results of predicting Chinese stocks with Chinese stocks based on graph Figure 12 summarizes SRs across specifications, comparing graph-based cross-market approaches, graph-based single-market approaches, and the non-graph-based baseline, with pvCLCL returns used as predictors. Figure 13 reports the corresponding performance differentials (deltas), computed as SRs of graph-based approaches minus those of the non-graph-based baseline. results indicate that graph-based same-market approaches outperform non-graph-based same-market approaches for most machine learning models under most quantiles, especially for OLS, LGBM, and Turning to the incremental value of cross-market information, combining cross-market information with graph information yields the strongest overall performance, outperforming approaches that use graph structures with same-market information only, as well as the non-graph-based baseline", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 71, + "total_chunks": 80, + "char_count": 1147, + "word_count": 144, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4f5683ae-1707-4283-bfab-04c0af55de92", + "text": "relying solely on same-market information. Figure 10: Cumulative daily Profit and Loss (PnL) from forecasting Chinese OPCL returns using U.S. pvCLCL returns as predictors. Each panel corresponds to a different machine learning model, and coloured curves represent nested quantile portfolios ranked by the absolute value of predicted (a) Predictors: Chinese pvCLCL returns. (b) Predictors: Chinese OPCL returns. Figure 11: Sharpe Ratios for forecasting Chinese OPCL returns using Chinese returns as predictors. Figure 12: Comparison of Sharpe Ratios across graph-based cross-market, graph-based samemarket, and non-graph-based baseline specifications, using pvCLCL returns as predictors. denotes using Chinese stocks to forecast Chinese stocks, while US-CN denotes using U.S. stocks to forecast Chinese stocks. Figure 13: Performance differentials (Sharpe Ratio deltas) relative to the non-graph-based baseline, using pvCLCL returns as predictors. US-CN denotes using U.S. stocks to forecast Chinese stocks, while CN-CN denotes using Chinese stocks to forecast Chinese stocks.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 72, + "total_chunks": 80, + "char_count": 1075, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cf796388-cfd6-4d91-ac88-fd6877470c5c", + "text": "5.4 Sensitivity Analysis We next evaluate the robustness of predictive performance to perturbations in graph structure and temporal alignment when forecasting Chinese returns using U.S. pvCLCL returns. a feature-replacement test, where selected informative stocks are randomly substituted with other Additionally, we assess temporal sensitivity by varying the recency of input data, using features from earlier days (e.g., t −2, t −3, etc.) instead of the most recent day t −1. First, we conduct a feature-replacement experiment. Based on graphs built for predicting returns on day t in the Chinese market using returns on day t −1 in the U.S. market, we maintain the same", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 73, + "total_chunks": 80, + "char_count": 672, + "word_count": 104, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8eebdfc-4269-4f40-b13e-1608013551ff", + "text": "in-degree of each target node to preserve graph sparsity while randomly changing some of their Only previously unconnected nodes are considered as replacements for the original We randomly replace 20%, 40%, 60%, 80% and all of the edges. level, we obtain the median of the results from all the 10 methods. As shown in Figure 14a, SRs generally decline as a larger fraction of edges is replaced, indicating that predictive gains depend critically on the economically meaningful structure captured by the graph rather than on generic diversification effects. The deterioration is strongest in lower and intermediate quantiles, whereas the highest quantile (qr6) exhibits comparatively greater resilience.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 74, + "total_chunks": 80, + "char_count": 702, + "word_count": 106, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0c26ebaf-6f8b-4a1a-9c3a-ca4ba4926fc6", + "text": "Second, we assess temporal sensitivity by varying the recency of input data. graphs built for predicting returns on day t in the Chinese market using returns on day t −1 in the U.S. market, we look into forecasting performance as the temporal gap increases (e.g., two-day, three-day, or longer gaps) between the predictor window and the target return window. in Section 4.2, when we forecast r(t) , the predictors are given by [r(t−l) , r(t−l) , ..., r(t−l) ].", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 75, + "total_chunks": 80, + "char_count": 460, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3ee54225-e512-4b6f-93ba-df3e20955347", + "text": "Here we set Yi X1 X2 Xn l = 2, 3, ... when predicting Chinese stocks. For each quantile level, we also obtain the median of the results from all the 10 methods. Figure 14b shows that SRs generally decline as l increases, consistent with the hypothesis that cross-market predictive content decays with time. The decline is again less pronounced for qr6,", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 76, + "total_chunks": 80, + "char_count": 352, + "word_count": 62, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "81593ed2-805d-4528-96de-c3d4c112398c", + "text": "suggesting that large-magnitude signals may capture more persistent cross-market effects. stabilization beyond Lag 4 likely reflects weekly trading-cycle effects. Taken together, these experiments confirm that predictive performance depends critically on both the structural accuracy of the graph and the recency of cross-market information. (a) Effect of graph randomization (fraction of edges (b) Effect of increasing temporal lag l.\nreplaced). Figure 14: Median forecasting performance under graph randomization (a) and increasing temporal Panel (a) reports Sharpe Ratios as a function of the fraction of replaced edges while preserving Panel (b) reports Sharpe Ratios as the lag parameter l increases, measuring the effect of 6 Conclusion and Future Research This paper investigates cross-market return forecasting at the individual stock level. graph-based architecture that enables structured information transmission across markets and use it to construct cross-market predictive features. Building on this framework, we implement a range", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 77, + "total_chunks": 80, + "char_count": 1045, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "af69a623-bc16-43f6-a22e-badb167794f1", + "text": "of machine learning models to forecast OPCL returns for each stock. Empirically, we find that combining cross-market information with graph-based feature selection delivers superior performance relative to both graph-based same-market approaches and non-graphbased baselines. The predictive relationship is asymmetric: U.S. stocks are substantially more informative for forecasting Chinese returns than the reverse. In particular, U.S. pvCLCL returns exhibit stronger predictive power for Chinese OPCL returns than U.S. OPCL returns, highlighting", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 78, + "total_chunks": 80, + "char_count": 546, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "93362c22-2a73-4404-99d8-2d07cd251f66", + "text": "the importance of overnight information transmission. Sensitivity analyses confirm that preserving the economically meaningful bipartite graph structure is crucial for achieving strong risk-adjusted Moreover, forecasting performance deteriorates as the temporal gap between predictor and target returns widens, emphasizing the value of recency. Several directions for future research emerge. First, extending the analysis to additional regions, including European and other Asian markets, would help assess the generalizability of cross-market", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 79, + "total_chunks": 80, + "char_count": 543, + "word_count": 68, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "10af183f-aadf-4348-aef6-1c0bdd4255f8", + "text": "Second, GNNs could be applied directly to the constructed bipartite graph to learn nonlinear cross-market dependencies. Finally, recent advances in time-series-specialized large language models may offer an alternative framework for modeling structured cross-market Conflicts of Interest The authors declare that they have no competing interests.", + "paper_id": "2603.10559", + "title": "A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting", + "authors": [ + "Jing Liu", + "Maria Grith", + "Xiaowen Dong", + "Mihai Cucuringu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10559v1", + "chunk_index": 80, + "total_chunks": 80, + "char_count": 346, + "word_count": 45, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10562_semantic.json b/data/chunks/2603.10562_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..65cb5dd219987e1be8b66af995bd82730790e012 --- /dev/null +++ b/data/chunks/2603.10562_semantic.json @@ -0,0 +1,439 @@ +[ + { + "chunk_id": "aa081173-b902-4f6b-88d1-f1681a46d4fa", + "text": "Quantization Robustness of Monotone Operator Equilibrium Networks James Li1, Philip H.W. Leong1 and Thomas Chaffey1", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 0, + "total_chunks": 23, + "char_count": 115, + "word_count": 15, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d2a839b4-d714-4ed5-b522-b43764a1facd", + "text": "Abstract— Monotone operator equilibrium networks are them suitable as controllers with formal stability and roimplicit-layer models whose output is the unique equilibrium bustness guarantees [7], [8], and they can be realized in\nof a monotone operator, guaranteeing existence, uniqueness, energy-efficient analog hardware [9]. A MonDEQ layer's\nand convergence. When deployed on low-precision hardware,\nwell-posedness is captured by a single spectral margin: the weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation smallest eigenvalue m of a symmetric matrix constructed\nof the underlying monotone inclusion. Convergence of the from the layer's weights (defined formally in Section II).\nquantized solver is guaranteed whenever the spectral-norm Having m > 0 ensures that the implicit equation defining\nweight perturbation is smaller than the monotonicity mar- the layer has a unique equilibrium and that the numerical2026 gin; the displacement between quantized and full-precision\nsolver converges to it. Because quantization perturbs this equilibria is bounded in terms of the perturbation size and\nmargin; and a condition number characterizing the ratio of matrix and hence its eigenvalues, the monotonicity margin\nthe operator norm to the margin links quantization precision to m provides a natural handle for analyzing quantization error.Mar forward error. MNIST experiments confirm a phase transition Thus far, MonDEQs have only been treated in full-precision\nat the predicted threshold: three- and four-bit post-training\n11 quantization diverge, while five-bit and above converge. The arithmetic; to the best of our knowledge, what happens to the convergence guarantee under quantization has not been backward-pass guarantee enables quantization-aware training,\nwhich recovers provable convergence at four bits. analyzed. Contributions\nNeural networks now underpin modern machine learn- The contributions of this paper are as follows.\ning, from vision and language to decision and control. 1) We formalize quantization error in a MonDEQ as\nContemporary models often contain millions or billions of a bounded spectral-norm perturbation of the weight\nparameters, increasing compute and memory demands and matrix and derive the induced perturbation of the[math.OC] constraining deployment in embedded and latency-sensitive monotonicity margin and Lipschitz constant (Theosettings. This motivates quantization, which reduces mem- rem 2, Section IV-A).\nory footprint and accelerates training and inference by rep- 2) We give explicit conditions under which the quantized\nresenting weights and activations at low-bit precision [1]. MonDEQ retains existence, uniqueness, and linear conLower precision enables efficient integer arithmetic but intro- vergence of its equilibrium (Corollary 1, Section IVduces quantization (rounding) errors that grow as bit-width A).\ndecreases. Analytic bounds relating quantization error to a 3) We bound the fixed-point displacement between quannetwork's robustness and stability would let bit-width be tized and full-precision equilibria and derive the associselected based on deployment requirements rather than by ated condition number (Theorems 3–4, Section IV-B).\ntrial and error. 4) We show that the backward solve inherits the same\nThis motivates the question of whether quantization error convergence guarantees as the forward solve under\ncan be bounded at the model level. At present, there is no quantization (Theorem 5, Section IV-C).\ngenerally applicable bound on quantization error; instead, We demonstrate these contributions empirically across bitonly architecture-specific analyses exist [2], [3]. Progress widths from 3 to 32 bits on MNIST (Section V). To support\ntherefore requires restricting attention to architectures with reproducible research, the code is available at https://arXiv:2603.10562v1\ntractable convergence guarantees — a requirement familiar github.com/JLi-Projects/mondeq-quant.\nin control, where quantized feedback has been modeled as\nB.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 1, + "total_chunks": 23, + "char_count": 4075, + "word_count": 560, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cd0e23f8-0de5-4e5b-956e-579bcaec0590", + "text": "Related Work a sector-bounded perturbation and stability is analyzed via\nsmall-gain conditions [4]. Monotone operator equilibrium Quantization theory. Standard quantization modeling treats\nnetworks (MonDEQs) [5] are a class of deep equilibrium the quantized weight matrix as a bounded perturbation of\nmodels (DEQs) [6] that enforce monotonicity of the un- its full-precision counterpart [1], [10]. Post-training quantiderlying operator, guaranteeing existence, uniqueness, and zation (PTQ) applies a fixed quantizer after training, while\nlinear convergence of the equilibrium via operator splitting. quantization-aware training (QAT) incorporates the quantizer\nThe built-in monotonicity constraints of MonDEQs make into the training loop via a straight-through estimator [11].", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 2, + "total_chunks": 23, + "char_count": 776, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f87eabe7-f002-4763-9b46-bc35de208792", + "text": "Inexact operator splitting. Operator splitting methods such\n1All authors are with the School of Electrical and Computer as forward–backward and Peaceman–Rachford admit inexact\nEngineering, The University of Sydney, NSW, Australia. Emails: jali4795@uni.sydney.edu.au, {philip.leong, variants in which bounded per-step errors are tolerated while\nthomas.chaffey}@sydney.edu.au. preserving convergence [12], [13]. In Section IV, we apply", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 3, + "total_chunks": 23, + "char_count": 433, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "210b9d34-f066-4d4a-846b-4dc00b464896", + "text": "these results to quantization-induced errors in the MonDEQ Let G : Rn ⇒Rn be a maximal monotone operator and\nsolver and derive new bounds on equilibrium displacement let JαG := (I + αG)−1 denote its resolvent for any α > 0.\nand the associated condition number. Considering the nonlinear fixed point iteration\nNumerical error analysis. Beuzeville et al. [14] show that\nzk+1 = JαG (I −αF) zk := Φ(zk; ϑ),feedforward networks are backward stable under floatingpoint rounding. Jonkman et al. [15] model quantized com- suppose it has a fixed point z⋆. We call the mapping from the\nmunication in distributed optimization as an inexact Kras- input x to fixed point z⋆a monotone operator equilibrium\nnosel'ski˘ı–Mann iteration. network (MonDEQ). Pabbaraju et al. [16] derive inputThe following equivalence is established in [5].\noutput and weight-output Lipschitz bounds for MonDEQs,\nbut their perturbation bound assumes the perturbed margin is Theorem 1. Define a MonDEQ as in Definition 1.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 4, + "total_chunks": 23, + "char_count": 983, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "41533cbc-2590-48f1-b490-fa76acf73b72", + "text": "Then\nknown and does not address quantization-specific structure, z⋆∈Fix(Φ) ⇐⇒0 ∈F(z⋆) + G(z⋆).\nconvergence conditions, or condition number. Theorem 1 reduces computation of the MonDEQ output\nII. PRELIMINARIES to solving the monotone inclusion 0 ∈F(z⋆) + G(z⋆). This reformulation is useful because the splitting algorithms\nWe collect notation and standard definitions from mono- of monotone operator theory apply directly and converge\ntone operator theory that are used throughout the paper. linearly when F is strongly monotone. We work in Rn with the Euclidean norm ∥·∥2 and denote The following parameterization enforces sym(I −W) ⪰\nthe spectral norm of a matrix by ∥· ∥2. The symmetric and mI, so that F is m-strongly monotone.\nskew-symmetric components of a matrix A are sym(A) :=\n12(A + A⊤) and skw(A) := 12(A −A⊤). Proposition 1. sym(I −W) ⪰mI if and only if there exist\nMonotone operators. Given an operator F : Rn →Rn, A, B ∈Rn×n such that W = (1−m)I −A⊤A+B −B⊤.\nits graph, denoted gra(F), is defined as {(x, y) | x ∈ Proof. Direct computation [5]. An operator F : Rn →Rn is said to be\nmonotone if ⟨F(x) −F(y), x −y⟩≥0 for all x, y ∈Rn, The margin m is determined by the parameterization of\nand maximal if its graph is not properly contained in the W. Because m = λmin(sym(I −W)) is an explicit function\ngraph of any other monotone operator. Given m, L > 0, an of W, perturbing the weight matrix perturbs m in a way\noperator F : Rn →Rn is said to be m-strongly monotone if that can be bounded analytically. Since m > 0 is both\n⟨F(x) −F(y), x −y⟩≥m∥x −y∥22 for all x, y ∈Rn, and necessary and sufficient for well-posedness, bounding how\nL-Lipschitz if ∥F(x) −F(y)∥2 ≤L∥x −y∥2 for all x, y ∈ quantization perturbs m directly determines whether the\nRn. For the affine operator F(z) = (I −W)z −(Ux + b), quantized network remains well-posed.\nthe strong monotonicity margin (also referred to simply as\nIV. QUANTIZATION IN A MONDEQ\nthe margin) is m = λmin(sym(I −W)) and the Lipschitz\nHere, quantization replaces floating-point weights withconstant is L = ∥I −W∥2 [17], [18].\nfixed-point (low-bit) approximations, reducing memory and Resolvents.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 5, + "total_chunks": 23, + "char_count": 2148, + "word_count": 369, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f3f316cb-0b09-46f3-9426-077fcf45ec94", + "text": "For a maximal monotone operator G, the\nenabling efficient integer arithmetic at the cost of increasedresolvent JαG := (I + αG)−1 is single-valued, firmly\nrounding error. We analyze the resulting error as a pertur-nonexpansive, and hence 1-Lipschitz. The reflected resolvent\nbation of the weight matrix W W = W + ∆W [10],is RαG := 2JαG −I [18]. →f\nThe forward–backward iteration zk+1 = JαG((I−αF)zk) bounding its effect on well-posedness, the equilibrium point,\nand the backward pass used for training.converges linearly for any α ∈(0, 2m/L2) with contrac- √ We use symmetric uniform (mid-tread) quantization: fortion modulus rFB = 1 −2αm + α2L2 [17], [18]. The\nb-bit representation with weights in [−1, 1], the quantizerPeaceman–Rachford iteration zk+1 = (2JαG −I)(2JαF −\nQ∆(w) = ∆· round(w/∆) has step size ∆= 21−bI)zk converges linearly for any α > 0 with contraction\nq 4αm and worst-case elementwise error ∆/2. Uniform quantiza-modulus ρPR = 1 − (1+αL)2 [17], [18]. tion is standard for weight compression because the evenly\nspaced levels map directly to fixed-point integer formats, en- III. MONOTONE OPERATOR EQUILIBRIUM NETWORKS\nabling hardware-accelerated matrix arithmetic; non-uniform\nMonotone operator equilibrium networks (MonDEQs) [5] schemes such as logarithmic quantizers [4] sacrifice this\ncompute their output as the fixed point of a splitting map property. Since each entry√ of ∆W is bounded by ∆/2, we\nderived from a monotone inclusion. We summarize the key have ∥∆W∥2 ≤(∆/2) n2. This motivates modeling weight\ndefinitions. quantization as a bounded perturbation [10]. Let W ∈Rn×n, U ∈ Definition 2. Given a MonDEQ as in Definition 1, its\nRn×d and b ∈Rn be parameters collected in a vector ϑ ∈ quantized counterpart replaces W with W = W + ∆W, f\nRr. Let σ : R →R be the componentwise activation on Rn. ∥∆W∥2 ≤εW . Define the affine map\nFor the symmetric uniform quantizer with step size ∆=\nF(z) := (I −W)z −(Ux + b), z ∈Rn. 21−b at b bits, εW = n∆/2. Weight quantization introduces a deterministic perturba- Corollary 1. If εW < m, the quantized forward–backward\ntion to the weight matrix.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 6, + "total_chunks": 23, + "char_count": 2107, + "word_count": 344, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "958be3d3-a0ab-423d-b7b8-4b7d5bc8a412", + "text": "This raises the question of how map eΦFB := JαG ◦(I −α eF) is a contraction with modulus\nlarge the perturbation can be before the equilibrium ceases rFB(α; em, eL).to exist.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 7, + "total_chunks": 23, + "char_count": 173, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "762f92af-6809-4df8-8b78-62d32257a3dc", + "text": "In practice, each iterate also incurs computational\nerrors such as finite-precision arithmetic or activation round- Proof. Replace (m, L) by (em, eL) from Theorem 2 in the\nforward–backward convergence rate.\ning, so the computed iterates obey zk+1 = eΦ(zk) + δk with\nbounded per-step errors δk. Together, the weight perturbation In words, weight quantization slows convergence but does\n∆W and the iterate errors δk model the two sources of error not break it: the solver still reaches a unique equilibrium, and\nin a quantized MonDEQ. the next subsection bounds how far that equilibrium moves. Margin Perturbation and Well-Posedness B. Equilibrium Displacement\nThe following theorem shows that weight perturbation The next result bounds how far the quantized equilibrium\nreduces the monotonicity margin by at most from the full-precision equilibrium z⋆. ∥∆W∥2.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 8, + "total_chunks": 23, + "char_count": 858, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40dfbc9f-a3d0-40a1-b8df-7d8f01c715d7", + "text": "This ez⋆movestheorem is a special case of [19, Theorem 4]. Assume F(z) = (I −W)z −(Ux + b) is mTheorem 2. Define a MonDEQ in accordance with Defi- strongly monotone and G : Rn ⇒Rn is monotone. With W fnition 1 with weights W satisfying Proposition 1. Let W f given as in Definition 2, suppose ∥∆W∥2 < m (in particular\nbe the quantized weights with perturbation ∥∆W∥2 ≤εW . Then the strong monotonicity margin of W is bounded em > 0). Let em f\nbelow by eF(z) := (I −fW)z −(Ux + b) = F(z) −∆Wz.\nem ≥m −∥∆W∥2, Let z⋆and the (unique) solutions of the full- ez⋆denote\nand the Lipschitz constant eL of Wf satisfies |L−∥∆W∥2| ≤ precision and quantized inclusions\neL ≤L + ∥∆W∥2. 0 ∈F(z⋆) + G(z⋆), 0 + ∈eF(ez⋆) G(ez⋆). Since W = W + ∆W, we have sym(I W) = f −f Thensym(I −W) −sym(∆W). By the Rayleigh-quotient char- ≤∥∆W∥2 (1)acterization of extreme eigenvalues [20], ∥ez⋆−z⋆∥2 m ∥ez⋆∥2.\nem = ∥x∥2=1min x⊤ sym(I −W) −sym(∆W) x In particular, if ∆W = 0 then ez⋆= z⋆. Pick g⋆∈G(z⋆) with F(z⋆)+g⋆= ≥min x⊤sym(I −W) x −max x⊤sym(∆W) x and eg⋆∈G(ez⋆) ∥x∥2=1 ∥x∥2=1 0 and 0. Subtracting and using eF(ez⋆)+eg⋆= eF(z) = F(z)−\n≥m −∥sym(∆W)∥2 ≥m −∥∆W∥2, ∆Wz, then taking the inner product with δz := ez⋆−z⋆gives\nwhere the last step uses ∥sym(∆W)∥2 ≤∥∆W∥2. For ⟨F(ez⋆) −F(z⋆), δz⟩−⟨∆W ez⋆, δz⟩+ ⟨eg⋆−g⋆, δz⟩= 0.the Lipschitz constant, the triangle and reverse triangle\nBy m-strong monotonicity of F, the first term is ≥m∥δz∥22;inequalities give\nby monotonicity of G the third is ≥0. L −∥∆W∥2 | ≤eL = ∥I −fW∥2 ≤L + ∥∆W∥2.\nm∥δz∥22 ≤⟨∆W ez⋆, δz⟩≤∥∆W∥2 ∥ez⋆∥2 ∥δz∥2 It follows from the bound ≥m −∥∆W∥2 that if em\n∥∆W∥2 < m, then em > 0: when the weight perturba- by Cauchy–Schwarz. Dividing by ∥δz∥2 (the case δz = 0 istion is smaller than the monotonicity margin, the quantized trivial) yields (1).\noperator remains strongly monotone and well-posedness is\nThe bound (1) depends on rather than ∥z⋆∥2 be-preserved. The Lipschitz bound cuts both ways: the upper ∥ez⋆∥2\ncause the perturbation acts through the shifted fixed point.\nbound means quantization can slow convergence, while the\nAn explicit bound in terms of ∥z⋆∥2 alone is given inlower bound means the quantized operator may be betterCorollary 3.\nconditioned if the perturbation reduces ∥I −W∥2.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 9, + "total_chunks": 23, + "char_count": 2229, + "word_count": 390, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d2e29e4-ae63-45da-b130-73b78d83950b", + "text": "In the The preceding results concern weight quantization, which\nworst case, the condition number eκ = eL/em degrades from shifts the equilibrium itself. In practice, a second error sourceboth sides (m shrinks and L grows), so the solver requires arises: iterate quantization, where finite-precision arithmetic\nmore iterations and the equilibrium becomes more sensitive or activation rounding introduces per-step residuals during\nto further perturbation. In practice, the margin bound is the solver iteration. The next result extends the convergence\nthe binding constraint. Since sym(I −W) = mI + A⊤A, guarantee to this combined setting.\nthe margin m (the minimum eigenvalue) is exactly attained\nwherever A⊤A has a zero eigenvalue, making it directly Corollary 2. Let eΦ be a quantized map as in Corollary 1,\nexposed to perturbation. In contrast, the Lipschitz constant with contraction modulus r ∈(0, 1) and fixed point ez⋆. Then\nL = ∥I −W∥2 is robust to elementwise rounding errors. We\nstate the contraction result for forward–backward splitting; lim sup ∥zk ≤lim supk→∞∥δk∥2 . k→∞ −ez⋆∥2 1 −r\nan analogous result holds for Peaceman–Rachford splitting\nwith ρPR(α; em, eL). If P∞ k=0 ∥δk∥2 < ∞, then zk →ez⋆exactly. Follows from standard inexact contraction re- ∥z⋆−ez⋆∥2 / ∥z⋆∥2 ≤κrel ηW to first order, where ηW :=\nsults [18]. ∥∆W∥2 / ∥W∥2 is the relative weight perturbation. For the\ntrained MNIST model in Section V (m = 0.227, ∥W∥2 =\nIn practice, bounded per-step errors (e.g. from finite- 1.72), this gives κrel ≈7.6: a 1% relative weight perturbation\nprecision arithmetic) do not destroy convergence: the solver causes at most roughly 7.6% relative displacement.\nreaches a neighborhood of ez⋆whose radius is controlled by To ensure stability under quantization, it suffices to verify\nthe error magnitude and contraction rate. The summability\nthat the actual perturbation satisfies ∥∆W∥2 < m. This is\ncondition P ∥δk∥2 < ∞holds, for example, when an the condition of Theorem 2: it guarantees that the quantized\nadaptive quantizer increases precision at each iteration so operator retains strong monotonicity (and hence a unique\nthat ∥δk∥2 decays geometrically [15]. equilibrium with guaranteed convergence).", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 10, + "total_chunks": 23, + "char_count": 2213, + "word_count": 343, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "06ce1284-d593-4c0a-b66c-08ad789c56b0", + "text": "Since ∥∆W∥2 ≤\nThe total error decomposes as ∥zk −z⋆∥2 ≤∥zk −ez⋆∥2 + εW , a sufficient pre-deployment check is εW < m, or\n∥ez⋆−z⋆∥2: the first term is governed by iterate errors (Corol- equivalently ηW < m/ ∥W∥2 in relative terms. Moreover,\nlary 2), the second by weight displacement (Theorem 3).\na single check (em > 0) guarantees convergence of bothWith summable iterate errors, the first term vanishes, and the forward and backward passes (Theorem 5).\ntotal error is determined by the displacement bound alone. For feedforward networks, the computed output is the\nThe bound (1) measures displacement in absolute terms. exact output of a network with perturbed weights [14], but\nWe now derive a relative bound and extract the condition the error accumulates through L layers as O(Lu). For Monnumber, which separates the problem's inherent sensitivity DEQs, contractivity bounds the error regardless of iteration\nfrom the perturbation size.\ncount: the quantized equilibrium ez⋆is exact for the perturbed\nCorollary 3. Under the hypotheses of Theorem 3, if operator I −fW, and the displacement is controlled by the\n∥∆W∥2 < m then condition number. The results so far establish convergence and displacement\n∥z⋆−ez⋆∥2 ≤ ∥∆W∥2 . (2) bounds for the forward pass — the computation of the\nFor training, however, we also need the ∥z⋆∥2 m −∥∆W∥2 equilibrium ez⋆.\n∥∆W ∥2 backward pass (implicit differentiation through the equilibProof. From Theorem 3, ∥ez⋆−z⋆∥2 ≤ m ∥ez⋆∥2. rium) to converge. The following subsection shows that the\nbackward inclusion has the same linear part I −W as theSubstituting ∥ez⋆∥2∥∆W ∥2 ≤ ∥z⋆∥2 + ∥ez⋆−z⋆∥2 gives\nforward problem, so it inherits the same margin, Lipschitz∥ez⋆−z⋆∥2 ≤ m (∥z⋆∥2 + ∥ez⋆−z⋆∥2).∥∆W ∥2 Rearranging, m ∥z⋆∥2, which constant, and convergence guarantees.(1 −∥∆W∥2 /m) ∥ez⋆−z⋆∥2 ≤\nyields (2) since ∥∆W∥2 < m. Backward Pass Under Quantization\nCorollary 3 gives a global bound: the relative displacement\nTraining a MonDEQ requires computing gradients of the\nis at most ∥∆W∥2 /(m −∥∆W∥2), which depends only loss with respect to the parameters ϑ = (W, U, b), which\non the perturbation size and margin. For example, at 8 bits\ninvolves implicit differentiation through the equilibrium con-\n(∥∆W∥2 = 0.035, m = 0.227), the bound gives 18%; dition 0 ∈F(z⋆; ϑ) + G(z⋆). Differentiating with respect to\nthe empirical relative error is much smaller (Section V).\na scalar parameter component ϑ yields a backward inclusion\nAs ∥∆W∥2 →0, the bound linearizes to ∥∆W∥2 /m, whose linear part is I −W, the same operator that governs\nrecovering the condition number scaling of Theorem 4.\nthe forward problem [5]. More precisely, the backward\nThe sensitivity of the equilibrium to small weight pertur- dz⋆\nsensitivity p := dϑ solves 0 ∈(I −W)p−r+Gb(p), wherebations is captured by the condition number [10], [21]. dW dU db\nr := dϑ z⋆+ dϑ x + dϑ and Gb ∈∂CG(z⋆) is a Clarke\nTheorem 4. For an unquantized MonDEQ with margin m > generalized Jacobian with sym(Gb) ⪰0 [22]. Since the\n0, the absolute condition number margin m and Lipschitz constant L are determined entirely\nby I −W, the backward pass inherits the same convergence\nz⋆(fW) −z⋆(W) 2 guarantees as the forward pass. The following theorem shows κabs := lim sup that this structure is preserved under weight quantization. ∥∆W ∥2→0 ∥∆W∥2\nsatisfies κabs ≤∥z⋆∥2 /m.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 11, + "total_chunks": 23, + "char_count": 3339, + "word_count": 546, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "61e4cc3d-1019-4fc4-89e3-278817076143", + "text": "Let Wf = W + ∆W with ∥∆W∥2 < m, and\nϑ) + let Define dez⋆dϑ ez⋆solve 0 ∈eF(ez⋆; G(ez⋆). ep := , er :=Proof. W From Theorem /m. / ∥∆W∥2 df ⪰ x+ sym( let with 3, ∥z⋆−ez⋆∥2 ≤∥ez⋆∥2 dϑ dϑ dϑ,db and eGb eGb) ez⋆+ dU ∈∂CG(ez⋆)By Corollary /m), so ≤∥z⋆∥2 /(1 −∥∆W∥2 0. Then solves 3, ∥ez⋆∥2 / ∥∆W∥2 ≤ ∥z⋆∥2 /(m −∥∆W∥2). Taking ep∥z⋆−ez⋆∥2\n∥∆W∥2 →0 gives κabs ≤∥z⋆∥2 /m. 0 ∈(I −fW)ep −er + eGb(ep), (3)\nIn words, the equilibrium's sensitivity to weight pertur- and the splitting method converges to ep with the perturbed\nbation is governed by the ratio of its magnitude to the parameters (em, eL) from Theorem 2. In particular, if the\nmonotonicity margin. The corresponding relative condition forward pass converges (em > 0), then the backward passnumber, which measures sensitivity to relative perturbations, also converges with the same contraction modulus; a single\nis κrel = κabs · ∥W∥2 / ∥z⋆∥2 ≤ ∥W∥2 /m, so that margin check suffices for both passes. Differentiating eF(ez⋆; ϑ) + eg⋆= 0 with respect to ϑ 1750\nyields (3). The backward operator (I −fW)p −er has the 3b 10b\nsame linear so it inherits the from 1500 4b 12b part as eF, eL) same ( em, 5b 1250Theorem 2. Since is monotone by assumption, the same eGb 6b 16b32bsplitting method converges. Iterations 1000 7b W 2/m = 1\nTheorem 5 validates quantization-aware training (QAT): 750\nwhenever the forward pass converges under quantized 500\nweights, gradients can be computed at the same precision 0 1 2 3 4 5\nand with the same iteration budget. No additional solver W 2/m\nresources are required for the backward pass. 10 3\nThe gradient error under quantization has two sources:\nthe displaced equilibrium (z⋆→ez⋆) and the perturbedweight matrix (W →fW). By Theorem 5, the backward residual 10 4\nsensitivity ep solves a monotone inclusion with the same\nlinear operator (I −fW), so the backward equilibrium exists Final\nand can be computed by the same splitting method. Since\nboth sources introduce perturbations of size O(∥∆W∥2) (the 10 5\nweight perturbation directly, and the equilibrium displace- 0 1 2 3 4 5\n∂ℓ = W 2/mment via Theorem 3), the chain rule gives ∂W∂ℓ − W 2 ∂f\nO(∥∆W∥2). Margin stability certificate.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 12, + "total_chunks": 23, + "char_count": 2163, + "word_count": 380, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4f3cc6f-3ddb-4a6d-943e-a7661b307500", + "text": "Iterations to convergence (top) and final\nV. NUMERICAL EXPERIMENTS residual (bottom) vs. normalized perturbation ∥∆W∥2 /m; each point is\none bit-width (3–32 bits). The dashed line marks the sufficient condition\nWe validate the theoretical predictions of Section IV on ∥∆W∥2 /m = 1.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 13, + "total_chunks": 23, + "char_count": 281, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "231fe232-de8b-44bb-aeee-2fcf07cbfad2", + "text": "Circles: converged (relative residual < 10−5); crosses:\na single-layer MonDEQ with n = 100 hidden units, trained did not converge within 2000 iterations.\non MNIST using Adam (lr = 10−3, 15 epochs, step decay\nγ = 0.1 at epoch 10). Unlike [5], which fixes m as a 6-bit 16-bit\n0.6 8-bit y = x (bound)\nhyperparameter, we treat m as learnable via a softplus repa- 12-bit\nrameterization ensuring m > 0. The trained model achieves\n98.22% test accuracy with margin m = 0.227, Lipschitz 0.4 displacement\nconstant L = 1.845, and condition number κ = L/m =\n8.13. Post-training quantization (PTQ) applies symmetric 0.2\nuniform quantization with step size ∆= 21−b and per- Empirical\ntensor scaling to the weight matrix W, without calibration or 0.0\nbias correction [1], [11]. Quantization-aware training (QAT) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7\nretrains from random initialization with the same architecture Theoretical bound\nand hyperparameters, using a straight-through estimator to\nFig. 2. Displacement bound validation (Theorem 3) at 6, 8, 12,\npass gradients through the quantizer [11]. In both cases, and 16 bits. Each point is one test sample (x-axis: theoretical bound\nthe deployed model uses W f = Q(W), so the perturbation (∥∆W∥2 /m) ∥ez⋆∥2; y-axis: empirical displacement ∥ez⋆−z⋆∥2). Points\nmodel of Definition 2 applies.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 14, + "total_chunks": 23, + "char_count": 1315, + "word_count": 213, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43761f40-8918-4166-ac1d-809c4d1a5bd7", + "text": "The forward–backward solver below the dashed line (y = x) satisfy the bound.\nterminates when the relative residual falls below 10−5 or\nafter 2000 iterations. Margin stability certificate. Figure 1 tests the conver- QAT vs. Theorem 5 guarantees that the backward\ngence condition ∥∆W∥2 < m from Theorem 2 across solve converges whenever the forward solve does; this makes\nbit-widths from 3 to 32 bits. The transition from non- QAT well-defined, since it requires differentiating through\nconvergence to convergence aligns with the predicted thresh- the equilibrium.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 15, + "total_chunks": 23, + "char_count": 562, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f76be2e0-eee0-440a-9d54-2de13fc6b719", + "text": "Figure 3 compares PTQ and QAT at 4, 6,\nold ∥∆W∥2 /m = 1: 3-bit (∥∆W∥2 /m = 5.36) and 4- and 8 bits. At 4 bits, PTQ fails (em = −0.142). QAT succeeds\nbit (2.66) fail to converge, while 5-bit and above converge. by learning weights that satisfy em = 0.006 > 0 (Figure 3,\nThe 5-bit case (∥∆W∥2 /m = 1.25) illustrates that the right), achieving 96.78% accuracy, though at the cost of a\ncondition is sufficient but not necessary: the actual margin smaller margin (m = 0.184 vs. 0.227). At 6 and 8 bits,\nem = 0.045 > 0, so the quantized operator remains strongly both methods converge, with PTQ achieving slightly highermonotone and the solver converges despite the sufficient accuracy (98.25% and 98.29%) because it inherits the larger\ncondition being violated. The iteration count reflects the de- float margin.\ngraded margin: 5-bit requires ∼1730 iterations (near the 2000 Displacement bound validation. The preceding expericap), while 8-bit converges in ∼450. At 8 bits, weight storage ments test convergence; we now test the accuracy of the\nis reduced 4× compared with single-precision floating-point, converged equilibrium. Theorem 3 bounds the displacement\nwith negligible accuracy change (98.24% vs. 98.22%). ∥ez⋆−z⋆∥2 ≤ (∥∆W∥2 /m) ∥ez⋆∥2. PTQ 2.5 Threshold errors. QAT PTQ\n(%) 98 2.0 QAT REFERENCES\n96 2/m 1.5 [1] M.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 16, + "total_chunks": 23, + "char_count": 1319, + "word_count": 220, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5ba365b1-b871-4c86-b3e4-a45da36947fe", + "text": "Bondarenko, M. van\nW Baalen, and T. Blankevoort, \"A white paper on neural networkaccuracy 94 1.0 quantization,\" 2021, arXiv:2106.08295.\n[2] Y. Sun, \"QEBVerif: Quantization Error BoundTest 92\nAided Verification of Neural Networks,\" in Computer Verification, X 0.5\nC. Cham: Springer Nature Switzerland, 2023,\n90 0.0\npp. 413–437. 4 6 8 4 6 8\nBit depth Bit depth [3] A. Cohen, \"Quantization with Guaranteed FloatingPoint Neural Network Classifications,\" Proc. OOPSLA2, pp. 340:1893–340:1920, Oct. 2025. PTQ at 4, 6, and 8 bits. Left: test accuracy (%; a red X [4] M. Xie, \"The sector bound approach to quantized feedback\nindicates PTQ non-convergence at 4 bits).", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 17, + "total_chunks": 23, + "char_count": 658, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1439f04c-4303-4f38-9c8f-6d770a5e48fb", + "text": "Right: ∥∆W∥2 /m; the dashed control,\" IEEE Transactions on Automatic Control, vol. 50, no. 11,\nline marks ∥∆W∥2 /m = 1. pp. 1698–1711, 2005.\n[5] E. Kolter, \"Monotone operator equilibrium networks,\" in Advances in Neural Information Processing Systems,\nbackward solver terminates at finite tolerance, the computed vol. 33, 2020.\n[6] S. Koltun, \"Deep equilibrium models,\" in\nAdvances in Neural Information Processing Systems, vol. 32, 2019.equilibrium approximates but does not equal the true ez⋆; the\ntheorem applies to the latter. Figure 2 evaluates the bound [7] M. Manchester, \"Lipschitz bounded\non 2,560 randomly sampled test inputs at 6, 8, 12, and equilibrium networks,\" 2020, arXiv:2010.01732.\n[8] ——, \"Recurrent Equilibrium Networks: Flexible Dynamic Models\n16 bits. The bound is satisfied in 99.1% (6-bit) to 91.3% With Guaranteed Stability and Robustness,\" IEEE Transactions on\n(16-bit) of samples, with the empirical displacement 3–5× Automatic Control, vol. 69, no. 5, pp. 2855–2870, May 2024.\nbelow the bound on average. The violation rate increases [9] T.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 18, + "total_chunks": 23, + "char_count": 1068, + "word_count": 160, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2872d488-970c-4d98-b1b4-0fc50cc16d11", + "text": "Chaffey, \"Circuit realization and hardware linearization of monotone\noperator equilibrium networks,\" Sep. 2025, arXiv:2509.13793.\nat higher bit-widths because ∥∆W∥2 shrinks and the bound [10] N. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed.\ntightens, while the absolute solver error from finite tolerance SIAM, 2002.\nremains roughly constant. Corollary 3 gives a relative bound: [11] B. Kalenichenko, \"Quantization and training of neural networks\nat 16 bits, ∥∆W∥2 /(m −∥∆W∥2) = 0.057%; at 6 bits, for efficient integer-arithmetic-only inference,\" in 2018 IEEE/CVF\nthis rises to 154%, which is vacuous — the relative bound Conference on Computer Vision and Pattern Recognition (CVPR),\nbecomes non-trivial around 8 bits. 2018, pp. 2704–2713.\n[12] J. Bertsekas, \"On the Douglas—Rachford splitting\nmethod and the proximal point algorithm for maximal monotone\nVI. CONCLUSIONS operators,\" Mathematical Programming, vol. 55, no. 1, pp. 293–318,\nWe have analyzed the effect of weight quantization on Apr. 1992.\n[13] P. Pesquet, \"Proximal Splitting Methods in\nmonotone operator equilibrium networks through spectral Signal Processing,\" in Fixed-Point Algorithms for Inverse Problems\nperturbation of the monotone inclusion. The monotonicity in Science and Engineering, H. L.\nmargin m emerges as the single quantity governing ro- Combettes, V. New York,\nNY: Springer, 2011, pp. 185–212.\nbustness to quantization: convergence of the forward and [14] T. Mary, \"Deterministic and\nbackward solvers is guaranteed provided ∥∆W∥2 < m (The- probabilistic rounding error analysis of neural networks in floatingorem 2), the equilibrium displacement satisfies ∥ez⋆−z⋆∥2 ≤ [15] pointJ. A. arithmetic,\"Jonkman, T.IMASherson,JournalandofR.NumericalHeusdens,Analysis,\"Quantisation2025. Effects\nin Distributed Optimisation,\" in 2018 IEEE International Conference(∥∆W∥2 /m) ∥ez⋆∥2 (Theorem 3), and the relative condition\nnumber κrel = ∥W∥2 /m links bit-width to forward error on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp.\n(Theorem 4). Experiments confirm a phase transition at the 3649–3653.\n[16] C.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 19, + "total_chunks": 23, + "char_count": 2112, + "word_count": 290, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "61eb1e91-ef83-43ec-adc7-c9243c34cb00", + "text": "Kolter, \"Estimating Lipschitz\npredicted threshold and show the displacement bound holds constants of monotone deep equilibrium models,\" in International\nin 91–99% of test samples with a conservative factor of Conference on Learning Representations, 2021.\n3–5×. Quantization-aware training recovers convergence at [17] E. Boyd, \"A primer on monotone operator methods,\"\nApplied and Computational Mathematics, vol. 15, no. 1, pp. 3–43,\n4 bits where post-training quantization fails, enabled by the 2016.\nbackward-pass guarantee of Theorem 5. [18] H. Combettes, Convex Analysis and Monotone\nThe present analysis is limited to uniform symmetric Operator Theory in Hilbert Spaces, ser. CMS Books in Mathematics. Cham: Springer International Publishing, 2017.\nquantization of a single-layer MonDEQ.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 20, + "total_chunks": 23, + "char_count": 791, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f8f640f4-50e8-42d9-b0c6-a94349a5d4a5", + "text": "Natural extensions [19] A. Rockafellar, \"Radius Theorems\ninclude per-channel and mixed-precision schemes, multi- for Monotone Mappings,\" Set-Valued and Variational Analysis, vol. 27,\nlayer architectures, and margin-aware regularization during no. 3, pp. 605–621, Sep. 2019.\n[20] R. Johnson, Matrix Analysis. Cambridge:\nquantization-aware training to enforce a target bit-width a Cambridge University Press, 1985.\npriori. An important open question is whether the behavioral [21] T.", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 21, + "total_chunks": 23, + "char_count": 481, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "47ed03b5-8e77-48f5-b21a-8569937ddf74", + "text": "Beuzeville, \"Backward error analysis of artificial neural networks\nguarantees of MonDEQ-based controllers [7], [8] remain with applications to floating-point computations and adversarial attacks,\" Ph.D. dissertation, Universit´e de Toulouse, 2024.\nvalid under weight quantization; the present perturbation [22] P. Pesquet, \"Deep neural network structures\nbounds are a first step toward such results. solving variational inequalities,\" Set-Valued and Variational Analysis,\nvol. 28, pp. 491–518, 2020. Generative AI was used to assist with the experimentation code, finding references, and checking for grammatical", + "paper_id": "2603.10562", + "title": "Quantization Robustness of Monotone Operator Equilibrium Networks", + "authors": [ + "James Li", + "Philip H. W. Leong", + "Thomas Chaffey" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10562v1", + "chunk_index": 22, + "total_chunks": 23, + "char_count": 612, + "word_count": 79, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10563_semantic.json b/data/chunks/2603.10563_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..b73b9ab9daaff2f0d97e0ed6f33c20c9dfd611fd --- /dev/null +++ b/data/chunks/2603.10563_semantic.json @@ -0,0 +1,306 @@ +[ + { + "chunk_id": "fb687547-cdbb-4983-b063-05b23ca12d22", + "text": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data\nAugmentation Viktorija Pol,aka1, Ivo Pascal De Jong1, Andreea Ioana Sburlea1 1Faculty of Science and Engineering, University of Groningen, Groningen, The Netherlands E-mail: victoria_polaka@proton.me", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 0, + "total_chunks": 16, + "char_count": 270, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59332693-6e96-4dd4-87da-1f48ae88f6cc", + "text": "ABSTRACT: This paper addresses the challenge of gen- and generate diverse yet plausible synthetic samples that\nerating synthetic electroencephalogram (EEG) covari- extend beyond the convex hull.\nance matrices for motor imagery brain-computer inter- However, standard VAEs assume Euclidean geometry,\nface (MI-BCI) applications. We aim to de- creating a conflict when working with the Riemannian\nvelop a generative model capable of producing high- SPD manifold structure of EEG covariance matrices; ap-2026\nfidelity synthetic covariance matrices while preserving plying standard Euclidean operations on this curved mantheir symmetric positive-definite nature. We ifold causes geometric distortions (e.g., the \"swelling efpropose a Riemannian geometry-preserving variational fect\") [5]. We address this by proposing the RiemannianMar autoencoder (RGP-VAE) integrating geometric mappings geometry-preserving VAE (RGP-VAE) designed to pre-\n11 with a composite loss function combining Riemannian serve geometric integrity, utilizing parallel transport [8] distance, tangent space reconstruction accuracy and gen- to align data and thus enable the model to learn subjecterative diversity. The model generates valid, invariant features. The focus is specifically on the chalrepresentative EEG covariance matrices, while learning lenging and practically relevant problem of cross-subject\na subject-invariant latent space. Synthetic data proves generalization, with the aim to reduce the need for extenpractically useful for MI-BCI, with its impact depend- sive calibration [1]. Accordingly, this paper aims to: (1)\ning on the paired classifier. This work establish if a Riemannian geometry-preserving VAE can[cs.LG] introduces and validates the RGP-VAE as a geometry- generate valid synthetic EEG covariance matrices, and\npreserving generative model for EEG covariance matri- (2) evaluate whether this synthetic data improves crossces, highlighting its potential for signal privacy, scalabil- subject MI-BCI performance.\nity and data augmentation. METHODS\nINTRODUCTION\nData and Preprocessing: We use the dataset from\nWhile Riemannian geometry-based classifiers currently Faller et al. [9], containing 13-channel EEG recordings\ndominate MI-BCI competitions, their advancement to- from 12 subjects performing a two-class motor imagery\nwards mainstream applications is hindered by data task (right hand versus both feet). The data loading procescarcity and inter-subject variability, which necessitates dure, using the \"Mother of All BCI Benchmarks\" framelengthy calibration sessions [1–4]. Deep learning alterna- work [10], resulted in a total of 5572 trials across the 12\ntives have yet to surpass these geometric pipelines, possi- subjects (398 or 597 trials per individual).\nbly explained by the limited availability of subject-level The EEG trials are bandpass filtered (8–30 Hz) to capture\ndata [3]. To overcome these limitations, we propose a sensorimotor rhythms.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 1, + "total_chunks": 16, + "char_count": 2959, + "word_count": 401, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ed18b3c-bc2a-42a3-b833-3e242210d5f7", + "text": "The raw voltage signals were thenarXiv:2603.10563v1\nnovel data augmentation framework tailored to the spe- scaled to microvolts (106). To address non-stationarity,\ncific geometric properties of EEG covariance matrices, exponential moving standardization (EMS) [11] is apwhich are symmetric positive-definite (SPD), i.e., sym- plied. EMS can be seen applied to data for training deep\nmetric with strictly positive eigenvalues. learning models as these models tend to be sensitive to\nPrevious work exploring data augmentation directly on the input scale [12]. Finally, trials are converted into\nthe SPD manifold by geometrically interpolating between spatial covariance matrices in R13×13 using the oracle\nexisting covariance matrices of the same class has suc- approximating shrinkage estimator [13], yielding wellcessfully boosted BCI classification accuracy in data- conditioned SPD matrices.\nscarce scenarios for SSVEP and ERP tasks [5]. However, To address the inherent variability between individuals'\nthis approach is fundamentally limited to the convex hull EEG signal characteristics, which manifests as geometric\nof the original data and thus cannot generate plausible differences in their location on the Riemannian manifold,\nvariations that exist in unexplored regions of the man- parallel transport [8] was applied. This technique geoifold. A variational autoencoder (VAE) [6], which can metrically transports matrices from each subject-specific\nlearn a latent representation of a manifold [7], may of- reference mean to a global (class) reference mean via a\nfer an alternative to overcome this convex-hull limitation congruence transformation. Figure 1: An overview of the proposed RGP-VAE, illustrating the integration of a standard VAE with geometric operations on the SPD\nmanifold. An input SPD matrix Xi is first projected onto the tangent space at a reference point Pref using the logarithmic map logPref\n(Eq. 3). This tangent representation Si is then vectorized to serve as the encoder input Htangent. The encoder maps this input to a\nlatent distribution parameterized by µ and log(σ2), from which a latent vector zi is sampled and passed to the decoder to produce the\nreconstructed vector Hdecoded. The vector is unvectorized back into a tangent space representation ˆSi, which is finally mapped back\nonto the SPD manifold via the exponential map (expPref) (Eq. 4) to produce the reconstructed SPD matrix ˆXi. Model Architecture: Conceptually building on prior rithmic map [1, 2]:\nwork on Riemannian variational autoencoders for\nP1/2ref log P−1/2ref XiP−1/2ref P1/2ref . (3)manifold-valued data [14], the modified VAE (Fig. 1) Si = logPref(Xi) =\nlearns a latent representation z from SPD matrices by\nThis is implemented via batched whitening followed bybridging the curved manifold M and the Euclidean space\nthe matrix logarithm to support numerical stability byrequired by neural networks. The manifold of symmetric\ncentring operations around the identity.positive-definite matrices is defined as M = {X ∈RN×N | The resulting batch S = {S1,...,SB} consists of sym-X = X⊤,X ≻0}, where N is the number of EEG channels\nmetric matrices in the tangent space at Pref and each Siand X ≻0 indicates that the matrix is composed of strictly\nis vectorized by using only the upper-triangular elementspositive eigenvalues [2].", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 2, + "total_chunks": 16, + "char_count": 3334, + "word_count": 497, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "54504e78-a399-419c-be42-b453628ea935", + "text": "The proposed architecture reforming the batch of vectors Htangent ∈RB×Dspd (withlies on a class-specific reference point Pref, calculated as\nDspd = N(N +1)/2) as input to the encoder.the Riemannian Fréchet mean of the training class. UnThe encoder maps this batch of vectors to the param-like arithmetic mean, which may not yield a valid SPD\neters (M,logσ2) of the latent distribution, where M =matrix, the Fréchet Mean guarantees a valid point on the\nmanifold by minimizing the sum of squared Riemannian {µ1,...,µB} and logσ2 = {logσ21,...,logσ2B}. The encoder consists of five sequential blocks (linear →batchdistances to all other matrices in a set {Xi}Mseti=1 : normalization →LeakyReLU [17]) with dimensions\nMset Dspd →32 →64 →16 →32 →64, followed by two sepaG = argmin ∑ d2r (P,Xi) (1) rate linear projections to produce µi,logσ2i ∈RDlat where\nP∈M i=1 Dlat = 64. Batch normalization stabilizes training by reducing internal covariate shift [18], while LeakyReLUwhere P is the candidate SPD matrix over which the minactivations are used to preserve the network's represen-imization occurs and dr(P,Xi) is the affine-invariant Rietational capacity for tangent space vectors preventingmannian metric (AIRM)[5, 15, 16] which defines the dispermanently inactive neurons [17]. The batch of latenttance between two SPD matrices as:\nvectors Z = {z1,...,zB} ∈RB×Dlat is sampled via the\ndr(P1,P2) = ∥log(P−1/21 P2P−1/21 )∥F (2) reparameterization trick: zi = µi +εi ⊙exp(0.5·logσ2i ),\nwhere εi ∼N (0,I), allowing gradients to flow back\nThe model processes a batch of aligned SPD covariance through M and logΣ2 during training.\nmatrices X = {X1,...,XB}, with batch size B = 128, se- The decoder MLP mirrors the encoder structure, mapping\nlected as a balance between computational load and the the batch of latent vectors Z back to a batch of decoded\nrequirements for the diversity loss. Each Xi is projected vectors Hdecoded ∈RB×Dspd, which is subsequently unvecto the tangent space—a local Euclidean approximation— torized into a batch of symmetric matrices ˆS′ ∈RB×N×N.\nat the class-specific reference point Pref using the loga- The decoder output ˆS′i is explicitly re-symmetrized via ˆS′′i = (ˆS′i +(ˆS′i)T)/2 to eliminate any asymmetries. To return to the manifold, we apply the Exponential Map to\neach matrix:\nexp P−1/2ref ˆS′iP−1/2ref P1/2ref (4) ˆXi = expPref(ˆS′i) = P1/2ref", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 3, + "total_chunks": 16, + "char_count": 2378, + "word_count": 366, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a072d68-b509-4019-a4ee-930d75874df3", + "text": "Numerical instability caused by floating-point arithmetic\ncan violate the strict SPD constraints, therefore validity\nis enforced throughout the model architecture and parallel transport. During the matrix exponential computation,\neigenvalues are conditionally scaled (threshold T = 20) to\nprevent overflow: if λmax > T, all eigenvalues are scaled\nby T/λmax. Throughout all geometric operations, we\nFigure 2: 2D UMAP visualization of the latent space of the\nmaintain a numerical threshold ε = 10−6. If the minimum RGP-VAE for right-hand movement data. Points are colored by\neigenvalue λmin of any intermediate or output matrix falls Subject ID; their significant overlap indicates the learning of a\nbelow ε, we add (ε −λmin)I to shift all eigenvalues above subject-invariant representation.\nthe threshold, ensuring positive-definiteness. Training and Optimization: The network is optimized 10−4 and a weight decay parameter of 1 × 10−6. Trainusing a loss function Ltotal balancing reconstruction accu- ing is further regularized with gradient clipping (max\nracy, latent space regularization, and diversity: norm=1.0) and learning rate reduction (factor=0.5) after\n20 epochs of stagnation. Ltotal = (Lmanifold +Ltangent)+βLKL +γLdiversity (5)\nData Generation and Evaluation Protocol: leave-oneThe reconstruction term combines Lmanifold which en- subject-out cross-validation (LOSO-CV) is employed;\nforces geometric fidelity using the AIRM distance in each fold, class-specific RGP-VAEs are trained on\n(Eq. 2): aligned data from N −1 subjects to test generalization to\n1 B unseen individuals. Two synthetic generation strategies\nLmanifold = ∑ dr(Xi, ˆXi) (6) are evaluated: Posterior sampling encodes each training B i=1\nmatrix Xi, samples zi via reparameterization, and decodes\nand Ltangent, which minimized the normalized Euclidean to create variations preserving core characteristics of each\nerror between original and decoded tangent vectors: sample (1:5 real-to-synthetic ratio). Prior sampling draws\nz ∼N (0,I) directly to generate novel samples beyond\n1 B ∑j(hdecoded,i,j −htangent,i,j)2 training convex hull (5000 per class). (7) the Ltangent = ∑ ∑j h2tangent,i,j +ε B i=1 Three classifiers—minimum distance to mean (MDM),\nk-nearest neighbors (KNN), and support vector classifier\nwhere ε = 10−6 for numerical stability. Meanwhile, la- (SVC)—are trained and evaluated on held-out test subtent space regularization is achieved through KL diver- jects under three conditions: (1) baseline using only origgence LKL toward a standard Gaussian prior: inal training data, (2) augmented with synthetic data, and\n(3) synthetic-only training to assess standalone quality. B \" Dlat # 1 accuracy, averaged across all folds, serves asLKL = ∑ −0.5 ∑ (1+logσ2i,k −µ2i,k −exp(logσ2i,k)) Balanced B i=1 k=1 the primary metric due to its robustness against class im-\n(8) balances from potential artifact removal. We apply KL cost annealing [19], linearly increasing β Synthetic data quality is assessed by verifying SPD propfrom 0.0001 to 0.2 during training to prevent posterior erties (symmetry and positive-definiteness), comparing\ncollapse while maintaining reconstruction fidelity. statistical variance (element-wise and global) between\nThe diversity loss Ldiversity encourages sample diversity real and synthetic matrices, and measuring geometric\nby maximizing the geometric volume of generated tan- spread via mean pair-wise Riemannian distances within\ngent vectors. Since the determinant of a covariance ma- each class. A scrambled-label diagnostic test confirms\ntrix quantifies the generalized variance (i.e., the volume that performance degrades to chance level, indicating no\nspanned by data points), maximizing it promotes wider spurious correlations.\nspatial coverage in the tangent space.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 4, + "total_chunks": 16, + "char_count": 3789, + "word_count": 535, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfa54a00-0fe6-48a4-ac50-9008238416ab", + "text": "The loss minimizes\nthe negative log-determinant of the batch covariance of Source code and additional information can be found at\ndecoded tangent space vectors Hdecoded ∈RB×Dspd: https://641e16.github.io/RGP-VAE/. Ldiversity = −logdet(Cov(HTdecoded)+εcovI) (9)\nRESULTS\nwith εcov = 10−6 for numerical stability, weighted by an\nempirically determined γ = 0.035 . We validate the proposed RGP-VAE through an assessThe AdamW optimizer [20] is employed for a fixed 100 ment of the generated synthetic data fidelity, addressing\nepochs with an empirically found learning rate of 1 × the fundamental question of whether the model can pro- Figure 3: Distribution of accuracy improvement for each classifier using the prior generator. The plot shows the percentage point\ndifference between the 'Augmented' and 'Synthetic-Only' conditions relative to the 'Baseline' across all subjects. The red line signifies\nthe mean whilst the blue line is the median.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 5, + "total_chunks": 16, + "char_count": 943, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1efbdc95-0796-418d-83ef-8d29b7229fa4", + "text": "Figure 4: Distribution of accuracy improvement for each classifier using the posterior generator, showing similar trends to the prior\ngenerator but with more pronounced fluctuations. duce valid and realistic covariance matrices, and with a sen γ maintained statistical variance close to the origicomparison against a standard VAE approach. To address the lower geometric diversity of synthetic\nnal analysis quantifies the impact of this data on cross- samples, the noise vector is scaled by εi = 2.2 during\nsubject classification performance to determine the prac- generation, increasing the mean intra-class Riemannian\ntical value of the proposed method. distance to ≈1.95, closely matching the original data's\nLatent Space Structure: UMAP [21] visualization spread (2.03) without distorting statistical properties.\n(Fig. 2) reveals that latent codes organize into a unified Cross-Subject Classification Performance: The impact\nstructure where subjects are heavily intermingled rather of data augmentation was evaluated by comparing classithan clustered by individual. This suggests the model fication accuracies under different augmentation condilearned a largely subject-invariant representation—a crit- tions using Wilcoxon signed-rank tests with Bonferroni\nical property enabled by parallel transport alignment— correction (p < 0.0083). As detailed in Tab. 2, data\nimplying generated samples will reflect generalized task augmentation produced divergent effects. For the KNN\npatterns rather than subject-specific details. classifier, augmentation consistently and significantly\nFidelity Assessment: Across all folds, 100% of syn- improved performance. Posterior-based synthetic-only\nthetic matrices from both prior and posterior generators training yielded the largest gain (+3.49%, p = 0.002),\npassed symmetry and positive-definiteness verification while augmented training provided +2.45% (p = 0.002).\nchecks, confirming the effectiveness of the architecture's Prior generation produced similar but slightly smaller\ngeometric constraints and numerical stabilisation steps. significant benefits (+3.00% synthetic-only, p < 0.001;\nTab. 1 compares the statistical variance and geometric +2.19% augmented, p = 0.003). In contrast, SVC perforspread of synthetic data relative to the original. The cho- mance significantly degraded with augmentation (up to", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 6, + "total_chunks": 16, + "char_count": 2357, + "word_count": 314, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fcfa3616-c017-4919-ad70-a31bc8fc73e0", + "text": "Table 1: Fidelity analysis of synthetic data averaged across 12 folds. The table compares the statistical variance ratio and the mean\nintra-class Riemannian distance, showing that the synthetic data distribution is valid. Generator Statistical Variance Geometric Diversity Original Synthetic (Ratio) Original Synthetic Prior 0.208 0.221 (1.061) 2.032 1.946\nPosterior 0.208 0.221 (1.063) 2.032 1.918 Table 2: Average balanced accuracy (%) across 12 subjects for all training conditions and generators with corresponding p-values. Baseline Augmented Scenario Synthetic-Only Scenario\nGenerator Classifier\nAcc. (%) Acc. (%) Improvement p-value Acc. (%) Improvement p-value", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 7, + "total_chunks": 16, + "char_count": 668, + "word_count": 89, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9ff79d4f-a2b8-45ee-9da9-5218bd15bcd5", + "text": "MDM 59.52±5.52 58.92±5.40 -0.59% 0.092 58.36±5.03 -1.16% 0.043\nPrior KNN 53.19±4.00 55.38±4.17 +2.19% 0.003 56.19±4.19 +3.00% < 0.001\nSVC 60.67±5.33 57.43±6.32 -3.24% 0.016 56.75±6.37 -3.92% 0.002 MDM 59.52±5.52 58.83±5.29 -0.69% 0.092 58.95±5.51 -0.57% 0.151\nPosterior KNN 53.19±4.00 55.64±4.13 +2.45% 0.002 56.68±4.06 +3.49% 0.002\nSVC 60.67±5.33 57.18±6.57 -3.48% 0.007 56.66±6.25 -4.01% 0.002 -4.01%, p = 0.002), while MDM remained largely unaf- means rather than spanning the full outlier range of realfected.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 8, + "total_chunks": 16, + "char_count": 513, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "55a78e4b-c181-4ba7-addc-7a39be55d558", + "text": "Figs. 3 and 4 illustrate subject-wise distributions, world data.\nrevealing high variability: KNN augmentation yielded Classifier-Dependent Utility: The impact of the syngains up to +7.8% for subject no.3 (prior generation, syn- thetic data was highly divergent, revealing that data augthetic only condition). A scrambled label test confirmed mentation utility is not universal but classifier-dependent.\nclassifiers learned meaningful features, yielding chance- Augmentation yielded statistically significant improvelevel accuracy (≈50%) on randomized data. ments for the KNN classifier, with posterior sampling\nStandard VAE Comparison: To validate the Rieman- boosting performance up to +3.49% (p = 0.002). KNN\nnian framework, we compared the proposed RGP-VAE likely benefits because the prototypical synthetic samples\nagainst a standard Euclidean VAE. The standard VAE densify the class manifolds, creating more dense and refailed to generate valid data, with > 40% of outputs in liable local neighbourhoods for distance-based classificaevery fold violating positive-definiteness. Conversely, performance significantly degraded for\naugmenting with the valid portion of its data significantly the SVC (up to −4.01%, p = 0.002). The reduced didegraded MDM performance (−9.49%, p < 0.001) and versity of synthetic data likely caused the SVC to learn\noffered no statistically significant benefit to KNN or decision boundaries too narrowly fitted around class cenSVC. This confirms that the proposed architecture's ge- tres, reducing generalization to boundary-case real samometric constraints are essential for generating valid and ples. Meanwhile, performance for the MDM classifier reuseful SPD matrices in this domain. mained stable—a positive result compared to the standard\nVAE, which caused a massive degradation (−9.49% unDISCUSSION der posterior, synthetic-only condition). Unlike the naive\nEuclidean approach that failed to even generate valid\nThis study investigated whether the proposed RGP-VAE SPD matrices, the RGP-VAE preserved the SPD validity\ncould generate high-fidelity EEG covariance matrices to and successfully learnt Riemannian class means. Beyond\nimprove cross-subject MI-BCI classification. immediate classification impacts, synthesizing this data\nGenerative Fidelity and Validity: A primary contribu- holds broader practical value; it provides a mechanism to\ntion of this work is confirming that the RGP-VAE frame- test pipeline scalability, mitigates data scarcity—possibly\nwork inherently generates valid SPD matrices—a non- for data hungry models—and enables privacy protection\ntrivial task where standard Euclidean VAEs failed (pro- by avoiding raw signal sharing.\nducing > 40% invalid matrices). This success is at- Future Research Directions: This study provides a\ntributable to the underlying Riemannian geometry that foundational proof of concept that opens avenues for fuenforces the SPD constraint by design. Parallel Trans- ture research. Building on these findings, future work\nport enabled the model to learn a subject-invariant latent may explore advanced manifold sampling techniques,\nspace, a critical property for cross-subject generalization. such as Riemannian Hamiltonian VAEs or Riemannian\nWhile valid, the synthetic data exhibited a slightly ele- Monte Carlo sampling, to capture complex latent disvated statistical variance (ratio ≈1.06) but reduced geo- tributions more faithfully [22, 23]. Additionally, as\nmetric diversity (ratio ≈0.95). With the chosen parame- demonstrated by vEEGNet [24], integrating the RGPters (γ = 0.035, εi = 2.2), the model generated more pro- VAE's geometric constraints and subject-invariance with\ntotypical samples concentrated near the class geometric discriminative frameworks could potentially yield la-", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 9, + "total_chunks": 16, + "char_count": 3782, + "word_count": 517, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05d3fba9-35be-497b-a551-d560d6831afd", + "text": "Deep learning with convosubject-invariant and class-discriminative. lutional neural networks for brain mapping and decoding\nof movement-related information from the human EEG. CONCLUSION CoRR. 2017;abs/1703.05051.\n[12] Zhu H, Forenzo D, He B. On the deep learning modThis paper developed and validated a novel Rieman- els for EEG-based brain-computer interface using motor\nnian Geometry-Preserving VAE (RGP-VAE) for generat- imagery. IEEE Transactions on Neural Systems and Reing synthetic EEG covariance matrices in the challenging habilitation Engineering. 2022:1–1.\ncross-subject MI-BCI context. The RGP-VAE is not only [13] Chen Y, Wiesel A, Eldar YC, Hero AO. Shrinkcapable of consistently generating valid SPD matrices— age algorithms for MMSE covariance estimation. IEEE\novercoming the limitations of standard VAEs—but also Transactions on Signal Processing. 2010;58(10).\nclosely matches the original data diversity. The high- [14] Miolane N, Holmes SP.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 11, + "total_chunks": 16, + "char_count": 960, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8a44244d-314a-4f27-8c74-59fce7d4a854", + "text": "Learning weighted subfidelity synthetic data can maintain or even significantly manifolds with variational autoencoders and riemannian\nimprove classification performance for specific classi- variational autoencoders. 2020 IEEE/CVF Conference\nfiers. However, divergent classifier results highlight gen- on Computer Vision and Pattern Recognition (CVPR).\nerative capabilities on the SPD manifold do not guarantee 2019:14491–14499.\nuniversal downstream improvements. [15] Fletcher PT, Lu C, Pizer SM, Joshi S. Principal geodesic analysis for the study of nonlinear statisREFERENCES tics of shape. IEEE Transactions on Medical Imaging.\n2004;23(8):995–1005.\n[16] Moakher M. A differential geometric approach to\n[1] Congedo M, Barachant A, Bhatia R. Rieman- the geometric mean of symmetric positive-definite matrinian geometry for EEG-based brain-computer inter- ces. SIAM Journal on Matrix Analysis and Applications.\nfaces; a primer and a review. Brain-Computer Interfaces. 2005;26(3):735–747.\n2017;4(3):155–174. [17] Maas AL, Hannun AY, Ng AY. Rectifier nonlinear-\n[2] Yger F, Berar M, Lotte F. Riemannian approaches ities improve neural network acoustic models. In: Proin brain-computer interfaces: A review. IEEE Transac- ceedings of the 30th International Conference on Mations on Neural Systems and Rehabilitation Engineering. chine Learning. 2013.\n2017;25(10):1753–1762. [18] Ioffe S, Szegedy C. Batch normalization: Accelerat-\n[3] Chevallier S et al. The largest eeg-based bci repro- ing deep network training by reducing internal covariate\nducibility study for open science: The moabb benchmark. shift. CoRR. 2015;abs/1502.03167.\n2024. arXiv: 2404.15319 [eess.SP]. [19] Bowman SR, Vilnis L, Vinyals O, Dai A, Jozefow-\n[4] Blankertz B, Dornhege G, Krauledat M, Müller KR, icz R, Bengio S.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 12, + "total_chunks": 16, + "char_count": 1790, + "word_count": 241, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "84d8dbac-b5da-4fa1-a7a6-8f2b51f62339", + "text": "The non-invasive Berlin Brain-Computer In- space. In: Proceedings of the 20th SIGNLL Conference\nterface: fast acquisition of effective performance in un- on Computational Natural Language Learning. Associtrained subjects. NeuroImage. 2007;37(2):539–550. ation for Computational Linguistics: Berlin, Germany,\n[5] Kalunga E, Chevallier S, Barthélemy Q. Data aug- Aug. 2016, 10–21.\nmentation in Riemannian space for Brain-Computer In- [20] Loshchilov I, Hutter F. Fixing weight decay reguterfaces. In: STAMLINS 2015 proceedings.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 14, + "total_chunks": 16, + "char_count": 525, + "word_count": 68, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6e88dd35-02c4-4b6a-9f13-1422cd23f618", + "text": "Lille, France, larization in adam. CoRR. 2017;abs/1711.05101. Jun. 2015. [21] McInnes L, Healy J, Melville J. Umap: Uni-\n[6] Kingma DP, Welling M. Auto-encoding variational form manifold approximation and projection for dibayes. 2013. arXiv: 1312.6114 [stat.ML]. [Online]. mension reduction. Journal of Open Source Software. Available: https://arxiv.org/abs/1312.6114 2018;3(29):861.\n[7] Shao H, Kumar A, Fletcher PT. The rieman- [22] Chadebec C, Allassonnière S. Data augmentation\nnian geometry of deep generative models. CoRR. with variational autoencoders and manifold sampling. In:\n2017;abs/1711.08014.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 15, + "total_chunks": 16, + "char_count": 606, + "word_count": 78, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "75679800-afcc-4fda-b389-d77be2fb9e99", + "text": "Deep Generative Models, and Data Augmentation, La-\n[8] Yair O, Ben-Chen M, Talmon R. Parallel trans- belling, and Imperfections. Springer, 2021, 184–192.\nport on the cone manifold of SPD matrices for do- [23] Chadebec C, Thibeau-Sutre E, Burgos N, Allassonmain adaptation. IEEE Transactions on Signal Process- nière S.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 16, + "total_chunks": 16, + "char_count": 318, + "word_count": 48, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "585584e2-4e04-409b-9e86-2a139383f4c1", + "text": "Data augmentation in high dimensional low saming. 2019;67(7):1797–1811. ple size setting using a geometry-based variational au-\n[9] Faller J, Vidaurre C, Solis-Escalante T, Neuper C, toencoder. IEEE Transactions on Pattern Analysis and\nScherer R. Autocalibration and recurrent adaptation: To- Machine Intelligence. 2023;45(3):2879–2896.\nwards a plug and play online ERD-BCI. IEEE Transac- [24] Zancanaro A, Cisotto G, Zoppis I, Manzoni SL.\ntions on Neural Systems and Rehabilitation Engineering. Veegnet: Learning latent representations to reconstruct\n2012;20(3):313–319. eeg raw data via variational autoencoders. In: Informa-\n[10] Aristimunha B et al. Mother of all BCI Benchmarks. tion and Communication Technologies for Ageing Well\nVersion 1.0.0. 2023. [Online]. Available: https : / / and e-Health. Springer Nature Switzerland: Cham, 2024,\ngithub.com/NeuroTechX/moabb 114–129.", + "paper_id": "2603.10563", + "title": "Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation", + "authors": [ + "Viktorija Poļaka", + "Ivo Pascal de Jong", + "Andreea Ioana Sburlea" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10563v1", + "chunk_index": 17, + "total_chunks": 16, + "char_count": 881, + "word_count": 117, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10564_semantic.json b/data/chunks/2603.10564_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..9e4657eb15e865128360ab55e7fda05787f89276 --- /dev/null +++ b/data/chunks/2603.10564_semantic.json @@ -0,0 +1,485 @@ +[ + { + "chunk_id": "5c04873e-e6e5-4537-9ea4-65a2ba1775e5", + "text": "Yuanhao Li, Haozhe Wang, Geyong Min, Nektarios Georgalas, and Wang Miao Abstract—The integration of Generative AI models into AI- Within the AI-native paradigm, Reinforcement Learning\nnative network systems offers a transformative path toward (RL) has been advocated as a key enabler for achieving\nachieving autonomous and adaptive control. However, the ap- the closed-loop autonomy required in 6G operations [6].\nplication of such models to continuous control tasks is impeded\nThe agent-based nature of RL is particularly effective for by intrinsic architectural limitations, including finite context\nwindows, the lack of explicit reward signals, and the degradation complex tasks like Radio Access Network (RAN) slicing [7],\nof the long context. This paper posits that the key to unlock- [8], where an agent must perform continuous environment\ning robust continuous control is enabling agents to internalize perception, precise resource allocation decisions, and multiexperience by distilling it into their parameters, rather than objective optimization [9]. However, the deployment of RL\nrelying on prompt-based memory. To this end, we propose a novel2026 in dynamic networking environments is severely hindered by self-finetuning framework that enables agentic systems to learn\ncontinuously through direct interaction with the environment, the reward engineering bottleneck [10]. Designing an effective\nbypassing the need for handcrafted rewards. Our framework reward function for RAN slicing requires the reconciliation\nimplements a bi-perspective reflection mechanism that generates of multiple conflicting performance metrics, including latency,Mar\nautonomous linguistic feedback to construct preference datasets throughput, energy efficiency, and fairness, under strict system\nconstraints [7]. Achieving reliable performance and the opti-11 fromtuninginteractionprocess distillshistory.long-horizonA subsequentexperiencespreference-basedinto the model'sfineparameters. We evaluate our approach on a dynamic Radio mal trade-off requires laborious manual tuning and extensive\nAccess Network (RAN) slicing task, a challenging multi-objective trial-and-error effort [11], [12], which limits the scalability\ncontrol problem that requires the resolution of acute trade-offs and generalization of RL solutions across diverse network\nbetween spectrum efficiency, service quality, and reconfiguration environments. This bottleneck raises a critical question: can\nstability under volatile network conditions. Experimental results\nwe develop agents that adapt to complex network tasks without show that our framework outperforms standard Reinforcement[cs.AI] Learning (RL) baselines and existing Large Language Model relying on handcrafted rewards?\n(LLM)-based agents in sample efficiency, stability, and multi- The recent convergence of Generative AI and autonomous\nmetric optimization. These findings demonstrate the potential systems has introduced a new frontier for general-purpose\nof self-improving generative agents for continuous control tasks, decision-making by enabling Large Language Model (LLM)\npaving the way for future AI-native network infrastructure.\nto leverage expansive world knowledge for sophisticated rea- Index Terms—AI-Native Networks, RAN Slicing, Autonomous\nNetwork Control, Generative Agents, Self-Finetuning soning and prompt-based adaptation [13], [14]. LLMs can be\nprompted to generate structured actions and plan sequences\nI. INTRODUCTION in complex environments without task-specific training or exThe transition toward 6G wireless systems marks a fun- plicit reward supervision [15].", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 1, + "total_chunks": 23, + "char_count": 3614, + "word_count": 468, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4b023c4-ed9a-4dbe-bc9f-99d81182cedb", + "text": "However, harnessing LLMs for\ndamental paradigm shift in network architecture, driven by continuous network control poses fundamental challenges. A\ntransformative applications such as holographic telepresence, primary issue is their proneness to hallucination in partially obthe Internet of Everything (IoE), and autonomous vehicular servable environments [16]. Moreover, they lack mechanisms\nnetworks [1], [2].", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 2, + "total_chunks": 23, + "char_count": 410, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b8ba686-8db7-4a22-af5d-28434cad9830", + "text": "These applications impose unprecedented to learn from mistakes or adapt their behavior over time. While\nrequirements for latency, throughput, and scalability, necessi- recent efforts utilize interaction history and self-reflection on\ntating networks capable of persistent adaptation [3]. To meet past decisions to refine agent behavior and reduce hallucina-arXiv:2603.10564v1\nthese demands, AI-native architecture has emerged as a key tion [17], [18], these methods are severely constrained by finite\nenabler for future networks [4]. Unlike traditional \"add-on\" ap- context window and Long Context Degradation [19] , which\nproaches that apply AI as a supplementary component, an AI- prevents true continual learning and confines these agents to\nnative system integrates intelligence directly into the network short-horizon, episodic tasks, falling short of the persistent\ninfrastructure as a core element. This deep integration enables continuous control demanded by AI-Native network systems.\nreal-time autonomous control across the entire protocol stack To address these limitations, we propose a self-finetuning\n[5], transforming the network into a truly self-optimizing framework that enables LLM agents to continuously adapt by\nsystem capable of dynamic adaptation to ever-changing traffic internalizing interaction history into model parameters rather\npatterns, resource availability, and user demands [5]. than relying on ever-expanding prompt-based memory. The\nlearning process is embedded directly into the interaction loop,\nY.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 3, + "total_chunks": 23, + "char_count": 1536, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5fedd37-0513-4c53-8237-ffca5c582bdf", + "text": "Ming are with the Department of Computer Science, where step-level and trajectory-level reflections are used to\nFaculty of Environment, Science and Economy, University of Exeter, Exeter,\nEX4 4QF, UK. Email: {yl1118, h.wang3, g.min}@exeter.ac.uk. form an internalized prior. This enables self-generated oral\nNektarios Georgalas is with the British Telecom, UK. improvement signals in place of environment-specific handRL (Actor-Critic) Reflexion Self-Finetuning (ours) Actor Actor Actor Observation Observation Insight Observation KTO Action Action Action\nSelf-reflection\nEnv Loss with Env All trajectories Env Preference\nAdvantage Finetuning Transaction history Rewards Critic Evaluator Reflector\ntrajectory Trajectory Feedback Trajectory Reflection : Neural Network : Flagship LLM : Lightweight LLM : Database : Trainable : Frozen : RIC : Station Fig. 1: This figure compares three control algorithms (RL Actor-Critic, Reflexion, and Self-Finetuning), each organized into four\nkey functional modules, color-coded for clarity: action-generating Actor (gray), interactive Environment (green), performanceevaluating module (blue), and Actor updating mechanism (yellow). RL updates the Actor via Advantage-based loss; Reflexion\nleverages self-reflection and a trajectory database to inject insights and history into the Actor's prompt; Self-Finetuning generates\ntraining data through Reflection and directly improves the Actor via KTO Preference Finetuning. crafted rewards, supporting continuous control and sustained in Open RAN architectures, showcasing RL's capability in\nadaptation in dynamic AI-Native network environments. handling complex resource partitioning problems [7]. Zhang\nTo realize this self-finetuning framework, we make the investigated RL-based power control methods for cognitive\nfollowing key contributions: radio networks, highlighting their effectiveness in spectrum\nsharing scenarios [22]. While RL has demonstrated state-of- • We formalize a novel Reflective Markov Decision Prothe-art performance in network control tasks, designing effec- cess (R-MDP) and Actor-Reflector (AR) framework that\ntive reward functions remains a significant challenge. Network bridges the gap between sequential optimization in RL\nenvironments involve multiple competing objectives—such and semantic reasoning capabilities of generative agents.\nas latency, throughput, energy efficiency, and fairness—that • We design a bi-perspective reflection mechanism that\nmust be carefully balanced in the reward structure [7], [8], integrates localized step-level feedback from the Actor\n[21], [22]. The complexity of these trade-offs often leads to with global trajectory-level reflections from the Reflector\nlaborious trial-and-error processes to identify optimal reward to facilitate dynamic policy adjustment without relying\nformulations [9], [11]. This not only increases training time on handcrafted reward functions.\nbut also requires substantial domain expertise to ensure stable • We propose Refine-from-Reflection (RfR), a novel fineconvergence and desirable policy behavior.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 4, + "total_chunks": 23, + "char_count": 3079, + "word_count": 396, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c7e4c8e-e819-4300-b385-379cd6f44b30", + "text": "Recent studies re- tuning framework that distills the agent's experiences by\nveal this challenge persists in practice, which shows even after converting reflection-labeled trajectories into preference\nextensive tuning, reward functions often remain suboptimal, datasets, to internalize the agent's decision-making experover 90% of RL practitioners relying on manual trial-and-error tise into model parameters through Kahneman-Tversky\napproaches and nearly 90% acknowledging their final reward Optimization (KTO) [20], effectively overcoming the\ndesigns fail to achieve optimal performance [12]. context window limitations.\n• We conduct an extensive empirical evaluation of our B. LLM Agent\nframework on a challenging dynamic RAN slicing task\nLLM have recently been explored as autonomous agents and demonstrate that it outperforms standard RL basefor decision-making tasks. Approaches such as Reflexion lines, achieving superior performance with significantly\n[18] and ExpeL [17] enhance LLM adaptability via self- fewer environment interactions.\nreflection and trajectory feedback. However, these methods\nII. RELATED WORK remain limited by the finite context window and long context\ndegradation [19], preventing effective use of long-term history. RL for AI-Native Networking As a result, current LLM agents are better suited for shortRL has emerged as a powerful approach for addressing horizon, episodic tasks and struggle in continuous control\nvarious network optimization challenges [6]. He proposed settings. This is a major limitation for AI-native network\na blockchain-based deep RL framework for healthcare data control, where tasks like RAN slicing or bitrate adaptation\noffloading, demonstrating effective resource allocation in edge require continuous decision-making grounded in long-horizon\ncomputing environments [21]. Zangooei developed a con- experience.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 5, + "total_chunks": 23, + "char_count": 1871, + "word_count": 251, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "728ddfc7-e1d6-4ff9-9d2c-0b2851f8f835", + "text": "Existing interaction histories are often truncated or\nstrained multi-agent RL solution for dynamic network slicing summarized, hindering sustained learning and generalization. NetLLM [23] is an early attempt to apply LLMs to net- to adapt to dynamic service demands. This multi-objective\nworking tasks, combining multimodal encoders and efficient tension defines the core challenge of efficient RAN slicing.\nadaptation to achieve strong performance. However, it relies Spectrum Efficiency (SE) serves as a key metric for quanon supervised learning from static expert trajectories, without tifying radio resource utilization.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 6, + "total_chunks": 23, + "char_count": 624, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6254016a-d0ec-4063-9288-46943258e53a", + "text": "At time step t, let Pt\ninteractive or continual learning capabilities. In contrast, our denote the set of packets successfully received by the User\nwork introduces a self-finetuning LLM agent that learns di- Equipments (UEs) in the slice, the spectrum efficiency SEt is\nrectly from environment interaction. By leveraging reflection computed as:\nand preference-based updates, it continuously distills longterm experience into model parameters, enabling sustained P p∈Pt |p|\nlearning beyond the limitations of context length. RAN SLICING RESOURCE MANAGEMENT\nwhere |p| denotes the size of packet p, τ represents the\nA. System Model\ndecision interval, and bt indicates the allocated bandwidth for\nWe consider an AI-driven RAN slicing framework for 6G the slice at time step t.\nnetworks, leveraging a state-of-the-art AI-RAN architecture The service quality of a slice is quantified by the cumulative\nwith a central controller deployed on the RAN Intelligent Packet QoS (PQoS) violation V , which counts timesteps\nController (RIC) to enable adaptive resource allocation across where any packet's QoS metric (e.g., latency) falls below\nmultiple network slices [7]. Within this architecture, the sys- its requirement. A packet is considered to exceed its QoS\ntem employs an LLM agent to manage the slice resources, requirement if its metric vector Mpt violates the threshold\nwhich are structured as Physical Resource Blockss (PRBs) Θ, indicated by χtp = I(Mpt ̸|= Θ). A timestep is therefore\nin time-frequency grids. The controller dynamically adjusts counted as exceeding its QoS requirement if any packet within\nthe inter-slice PRB allocation per decision interval based on it triggers such an event. The cumulative Packet QoS violation\nmonitored performance metrics [7]. To ensure isolation, PRBs over the measurement window Tm is given by:\nare strictly segregated between slices, allowing independent\noperation within allocated resources. The framework operates in a closed-loop manner: LLM Tm Tm\nagents continuously evaluate slice demands and submit deci- V = X vt = X I ∃p ∈Pt : χpt = 1 (2)\nsions to the controller, which optimizes the PRB distribution t=0 t=0\nfor the subsequent decision interval.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 7, + "total_chunks": 23, + "char_count": 2197, + "word_count": 338, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fd4ecc34-2f71-49cb-a883-877f3101e7ee", + "text": "This dynamic approach\nThe overhead of resource allocation is measured by theadapts to traffic fluctuations while maintaining efficient reResource Reconfiguration Times metric C, which counts thesource utilization. Intra-slice scheduling is handled by a pronumber of time steps where the bandwidth allocation forportional fair scheduler [24], as our focus remains on interslice changed. Let ct = I(bt ̸= bt−1) denote a binaryslice allocation. The resource allocation process operates with\nindicator marking whether a reconfiguration occurs at timestephigh dynamism, continuously adjusting to real-time traffic\nt, The cumulative reconfiguration count over the measurementvariations and evolving network conditions. Through a closedwindow Tm is given by:loop framework of monitoring, decision-making, and allocation, it enables adaptive and efficient resource management,\nensuring that each slice's performance requirements are ful- Tm Tm\nfilled while maximizing overall network efficiency. C = X ct = X I(bt ̸= bt−1) (3)\nB.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 8, + "total_chunks": 23, + "char_count": 1021, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7bbe6d68-fdba-4a66-a3c0-c3e1f5d4ef57", + "text": "Problem Definition t=0 t=0\nWe formulate the RAN slicing task as a Multi-Objective Lower C values indicate more stable resource allocations, diOptimization Problem (MOOP) with three conflicting goals: rectly corresponding to lower system reconfiguration overhead.\nmaximizing resource utilization, ensuring service quality, and These metrics enable the formulation of an MOOP for\nminimizing reconfiguration overhead. Efficient utilization im- control policy design. Specifically, we seek a policy π ∈Π that\nproves throughput under bandwidth constraints; service quality simultaneously: (1) maximizes average spectrum efficiency\nrequires adaptive allocation to meet diverse QoS needs; and PTmt=0 SE , (2) minimize reconfiguration times through C, while\nminimizing reconfiguration overhead is critical, as frequent Tm\n(3) minimizing average PQoS violation times V . The optimizaspectrum reallocations trigger virtual resource adjustments that\ntion objective is formally expressed as:\nintroduce operational costs and potential disruptions [8]. These\nobjectives form conflicting trade-offs: Maximizing resource\nutilization requires frequent allocation adjustments, increasing ( \" Tm #\nmax lim Eπ X SEt ,reconfiguration overhead and service instability, while pri- π∈Π Tm→∞\noritizing service quality through over-provisioning can lead t=0 (4)\n\" Tm # \" Tm #)\nto underutilization during low demand. Similarly, minimiz- −Eπ X vt , −Eπ X cn\ning reconfiguration overhead through rigid allocations fails\nt=0 t=0 METHODOLOGY Algorithm 1 Actor–Reflector Inference and Training Loop\n1: repeatA. Reflective Markov Decision Process\n2: Initialize empty history H ←∅\nTraditional reinforcement learning is commonly formulated 3: while trajectory not terminated do\nas a Markov Decision Process (MDP), defined by a tuple 4: Observe current state st\n5: Build input sequence: It ←PROMPT(Ht−1, st)⟨S, A, P, R, γ⟩, where S is the set of states, A is the set\n6: LLM inference: obtain output Ot ←π(It)\nof actions, P is the state transition probability, R is the 7: Extract: (ψt, at, ϕt) ←EXTRACTOR(Ot)\nreward function, and γ ∈[0, 1) is the discount factor. In 8: Execute action at, receive feedback vector Mt\nthis framework, an agent interacts with an environment by 9: Append (st, at, ψt, ϕt, Mt, It, Ot) to H\nobserving a state st ∈S, taking an action at ∈A, receiving a 10: end while\n11: Initialize empty labeled history H′ ←∅scalar reward rt = R(st, at), and transitioning to the next state\n12: Pass full history H to Reflector\nst+1 ∼P(· | st, at). The agent's objective is to learn a policy 13: for each step t in H do\nπ(a | s) that maximizes the expected return E [Pt γtrt]. While 14: (ℓt, ˆat) ←Rφ(st, at, ψt, ϕt, Mt, H)\nthis formalism supports many advances in sequential decision- 15: Append (st, at, ψt, ϕt, Mt, It, Ot, ℓt, ˆat) to H′\nmaking, it is not directly suited for LLM-based agents, which 16: end for\n17: Fine-tune Actor: π′ ←PREF-FINETUNE(π, H′)operate on structured prompts rather than scalar rewards.\n18: Update policy: π ←π′\nTo better align the decision-making process with the struc- 19: until performance converges\nture and capabilities of LLMs, we propose the Reflective MDP\n(R-MDP), a novel formalism designed for LLM agents. In RMDP, the agent-environment interaction is reformulated as a\nB. Actor-Reflector Framework\nsequence of tuples: ⟨S, A, Ψ, Φ, M, P ′⟩\nwhere: The Actor-Critic (AC) architecture [9] is a foundational\nRL framework that separates policy and value estimation into • S is the state space, representing environment observatwo components: the Actor and the Critic. As shown in Fig tions,\n1 (left), the Actor represents the policy πθ(at | st), which • A is the action space,\nselects actions based on the current state. The Critic estimates • Ψ is the space of step-level reflections, representing\nthe state-value function V π(st), which predicts the expected natural language reflections on the previous step,\nlong-term return from state st and provides a learning signal • Φ is the space of step-level analyses, summarizing or\nto guide the Actor's policy updates.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 9, + "total_chunks": 23, + "char_count": 4077, + "word_count": 641, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba94a1b2-9079-4c88-8e32-d038a13197c0", + "text": "In the AC framework, the justifying the current decision,\nActor is updated by minimizing the loss: • M is the space of environment feedback vectors (e.g.,\nmetrics like latency, throughput),\n′ ′ • P is the transition function, P : S × A →S,\nLAC-actor = −log πθ(at | st) · A(st, at)\nAt each timestep t, the agent observes the current state st ∞ !\nand constructs a prompt using the trajectory history Ht−1 = = −log πθ(at | st) · X γkrt+k −V (st) (6)\n{(s0, a0, ψ0, ϕ0, M0), . . . , (st−1, at−1, ψt−1, ϕt−1, Mt−1)}, k=0\nwhich contains all previous states, actions, reflections,\nanalyses, and environment feedbacks. Conditioned on st and which encourages the policy to increase the probability of\nHt−1, the policy π generates a triplet (ψt, at, ϕt), where actions whose returns exceed the current value estimate.\nψt ∈Ψ is a reflection on the previous step, at ∈A is the This structure allows the Actor to improve behavior through\ncurrent action, and ϕt ∈Φ is a brief analysis of the current feedback provided by the Critic's value predictions.\ndecision. The action at is then executed in the environment, While the AC architecture relies on scalar value estimation\nleading to a new state st+1 = P ′(st, at), and the environment to guide policy improvement, it is not naturally aligned with\nreturns a feedback vector Mt ∈M, consisting of task-specific the strengths of LLMs in reasoning, reflection, and languagemetrics. These metrics are not used to compute a scalar reward based supervision.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 10, + "total_chunks": 23, + "char_count": 1486, + "word_count": 258, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a31a6f30-6e32-41f5-9151-31c4d7956fff", + "text": "To better integrate LLMs into sequential\nbut are instead recorded as part of the trajectory, enabling decision-making and solve the R-MDP, we propose the AR\nsubsequent global reflection and policy improvement. architecture, an RL-style framework that mirrors the structure\nThe R-MDP optimization objective follows the standard of AC, but replaces the Critic with a Reflector that provides\nMDP formulation but replaces scalar rewards with language- interpretable and semantic-level feedback over full trajectories.\nderived feedback: 1) Actor: As shown in Fig 1 (right), the Actor is implemented as an LLM policy π, which embeds the current state\n\" T # st and interaction history Ht−1 from step 0 to step t −1 into\nπ∗= arg max Eπ X γtrlang(st, at) (5) a prompt-formatted input sequence It = PROMPT(Ht−1, st). π\nt=0 The model outputs a structured sequence Ot = π(It), from\nwhere rlang(·) is an implicit reward function derived from which a triplet is extracted: a reflection on the previous step\nthe natural language feedback instead of scalar rewards in ψt, the current action at, and an analysis of the current\ntraditional RL. decision ϕt. After executing the action, the environment returns a task-specific metric vector Mt, which, along with Algorithm 2 Preference Fine-Tuning of Actor (RfR)\n(st, at, ψt, ϕt, It, Ot), is appended to the history H. Require: Labeled history H′, base policy π, rollout count m,\n2) Reflector: Unlike AC, where the Critic estimates a scalar maximum fine-tuning steps n\nvalue and updates the policy via gradients, the Reflector R in 1: # Perform n KTO iterations\n2: for i = 1 to n doAR operates after each trajectory to perform trajectory-level\n3: Initialize empty fine-tuning dataset D ←∅\nassessment. It evaluates every step in the recorded history 4: Initialize flag promising ←False\nusing environment feedback and language-level signals and 5: # Reflector-labeled data\nassigns a quality label ℓt ∈{True, False}. For suboptimal 6: Append all (It, Ot, ℓt, ˆat) in H′ to D\ndecisions, the Reflector proposes improved actions ˆat. The full 7: # refine-rollout data\n8: for each ht = (It, Ot, ℓt, ˆat) in H′ dotrajectory is thus converted into a labeled dataset with step-\n9: if labelt == False then\nwise annotations, which is used in the subsequent fine-tuning 10: for j = 1 to m do\nstage to adapt the LLM policy. 11: Ojt ←π(It)\n3) Bi-Perspective Reflection: The Actor's step-level reflec- 12: Extract a′t ←EXTRACTOR(Ojt)\ntion mechanism (ψt, ϕt) operates through in-context learn- (True, a′t = ˆat,\n13: λjt =ing within the LLM's input sequence. By embedding past False, otherwise.\nreflections and analyses directly into the prompt as short- 14: D ←D ∪{(It, Ojt, λjt)}\nterm memory, the Actor dynamically adjusts its policy with- 15: end for\nout weight updates. Each new action at is conditioned on 16: if P(ˆat|It) > ρ then\n17: Do not rollout ht in next iterationa finite history window Ht−1 in the input sequence. This\n18: end if\napproach leverages the LLM's inherent ability to perform 19: end if\nmeta-reasoning over provided examples: recent (ψt, ϕt) pairs 20: end for\nserve as in-context \"demonstrations\" that guide the current 21: Fine-tune π using dataset D via KTO\ndecision, analogous to few-shot prompting in language tasks. 22: end for\nThe limited context window naturally enforces a recency bias,\nprioritizing recent experiences while gradually forgetting older\ninteractions, which is a property aligned with online adaptation then internalizes these preferences through fine-tuning rather\nin dynamic environments. than gradient updates, maintaining the value-maximization\nThe Reflector's trajectory-level reflection mechanism en- principle while operating entirely in the language domain.\nables the Reflector to optimize decisions through retrospective Algorithm 1 details the AR's inference and learning loop.\nanalysis of complete trajectory histories H. Leveraging the Lines 3–10 describe step-wise interaction: the Actor builds a\nLLM's reasoning capacity over this extended context, the prompt from the state and history, generates action, step-wise\nReflector identifies improved actions ˆat for each state st. This reflection and analysis, and receives environmental feedback,\nprocess formalizes as: stored for future reasoning. Lines 13–16 show the Reflector's\ntrajectory evaluation: it reviews each step, labels actions as\nT effective or suboptimal using environment feedback and verbal \" #\nˆat = arg max E X γk−trlang(sk, ak) st = s, at = a, H reflection, and suggests better actions. Then, the labeled histry\na∈A k=t H′ is used to finetuning the Actor (line 17).\n(7)\nC.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 11, + "total_chunks": 23, + "char_count": 4613, + "word_count": 732, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fd7a2a4c-0dbf-4944-b085-879e93a8a41f", + "text": "Refine-from-Reflection (RFR) Fine-tuning framework\nunlike step-level reflection that only observes past information,\nthis full-trajectory view allows the Reflector to assess how After the Reflector processes an trajectory and produces the\nindividual actions contribute to long-term outcomes, analogous labeled history H′, the system enters the fine-tuning phase.\nto value function estimation in RL but operating through We propose the RfR framework to construct dataset and finenatural language reasoning. tune the Actor, which operates multiple iterations, and in each\nThe trajectory-level reflection mechanism reinterprets the iteration, a new preference dataset D is constructed based on\nAC paradigm through language-mediated optimization. The dataset consists of two components:\nclassical RL, the Critic provides scalar value estimates to guide 1) Reflector-labeled examples: As shown in Algorithm 2 (line\nthe Actor's gradient-based policy updates, thereby increasing 6), we directly extract preference examples from H′ where\nthe probability of high-value actions. Our framework preserves actions labeled as effective by the Reflector are treated as\nthis structure but replaces the Critic's numerical output with positive samples, while suboptimal actions are treated as\nthe Reflector's semantic analysis. Instead of backpropagating negative samples. These form the base dataset derived from\nadvantage estimates, the Reflector examines complete trajecto- trajectory-level reflection.\nries to generate natural language assessments. These linguistic 2) Refine-rollout examples: To enhance sample efficiency and\nsignals serve the same theoretical role as value function utilize the LLM's generative capacity, we perform multiple\nto identify preferable actions, but avoid reward engineering rollouts on each negative sample as demostrated in Algocomplexities by deriving improvement signals directly from rithm 2 (line 8-20). For each input prompt It associated with\nthe LLM's reasoning about decision consequences. The Actor a suboptimal action, the Actor LLM is sampled m times to TABLE I: Traffic Model Parametersgenerate alternative outputs. If any sampled output yields an\nimproved action (i.e., one that matches or aligns with the Metric GBR Traffic Non-GBR Traffic\nReflector's suggestion), it is treated as an additional positive\nActive UE count 20 4\nsample; otherwise, it is marked negative. If the probability of Transmission duration Exp(mean = 15 sec) Exp(mean = 15 sec)\ngenerating improved actions exceeds threshold ρ, subsequent Idle duration Exp(mean = 15 sec) Exp(mean = 15 sec)\niterations omit further rollouts on this sample to prevent over- Bit rate 0.5 Mb/s 2 Mb/s\nPacket Size 512 bytes 512 bytes\nfitting.These rollout-derived examples are then merged with QoS Requirement Delay < 10 ms Delay < 50 ms\nthe Reflector-labeled examples to construct the full preference\ndataset for the current fine-tuning round. TABLE II: Radio Channel Parameters\nTo optimize the LLM policy using the constructed preference dataset, we adopt the KTO [20] algorithm. Unlike Parameter Value\npairwise preference objectives such as DPO [25], KTO sup- Transmission power 30 dBm\nports unbalanced datasets by directly modeling the absolute Base station antenna gain 0 dB\nBase station antenna pattern Antenna Model in 3GPP TR 38.901\npreference likelihood of each sample using prospect-theory. Noise figure 5 dB\nThe KTO loss is defined as: Carrier frequency 2120 MHz\nPropagation model Urban Propagation Loss Model\nLKTO(π′, π) = Ex,y∼D [λy −v(x, y)] (8)", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 12, + "total_chunks": 23, + "char_count": 3541, + "word_count": 513, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e9a4f75a-a078-4755-8b65-aaaf8e213132", + "text": "where: model weights rather than relying on external memory or\nretrieval mechanisms. Meanwhile, the rollout-derived examples capture the gen- ( λp · σ (β · (rθ(x, y) −z0)) , if y ∼ypositive | x\nv(x, y) = erative flexibility of LLMs. Even when the initial output for\nλn · σ (β · (z0 −rθ(x, y))) , if y ∼ynegative | x a given prompt is poor, the model may still be capable of\n(9)\nproducing better actions through sampling. By identifying and\nπ′(y | x) reinforcing these successful alternative outputs, the fine-tuning\nrθ(x, y) = log (10) process increases the likelihood of generating desirable actions π(y | x)\nfor challenging decision points, while reducing the chance of\nrepeating suboptimal behavior. This helps refine the policy's z0 = KL (π′(y′ | x) ∥π(y′ | x)) (11)\npreference boundary in ambiguous or high-variance situations. Here, x corresponds to the input prompt I, and y is the Together, these two data sources enable KTO to effectively\ngenerated output sequence O. The policy π(y | x) thus models align the model with reflective preferences, which is why\nthe likelihood of generating output O given the prompt I. we name this framework RfR. The terminology carries dual\nπ′ is the current policy, π is the reference model (typically significance: first, it reflects the two-stage data generation prothe original frozen LLM), and σ(·) is the sigmoid function. cess where reflector-labeled data spawns refine-rollout samples\nThe KL term z0 captures the policy shift from the refer- through iterative improvement; second, it captures the funence model. The utility function v(x, y) applies asymmetric damental paradigm where the entire model refinement stems\ngain/loss scaling using coefficients λp, λn, and sensitivity from reflective processes. The base dataset encodes stable,\nβ. KTO naturally handles unbalanced preference labels and trajectory-level decision quality distilled from reflection, while\nencourages the policy to prefer positive outputs by the weights the rollout samples expand the model's behavioral capacity\nλp and λn, which are defined as: through reflection-driven exploration without requiring additional environment interaction.\nmax(Npositive, Nnegtive) V. EXPERIMENT λD = (12)\nNpositive A. Simulation Environment Settings\nmax(Npositive, Nnegtive)\nλU = (13) To evaluate the effectiveness and performance of our proNnegtive posed framework, we conducted experiments in a custom\nwhere Npositive and Nnegtive denote the number of samples Python-based RAN slicing simulator. The simulator leverages\nin the positive and negative datasets, respectively. the ns-3 packet-level engine to create a realistic network\nThe combination of base and rollout-derived preference data environment. We focus on the challenging and dynamic task\nplays a complementary role in optimizing the LLM policy of inter-slice spectrum resource allocation, a canonical multiunder the KTO objective.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 13, + "total_chunks": 23, + "char_count": 2899, + "word_count": 440, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "37e7a2b5-1b22-4010-9da2-d39de0e9ae45", + "text": "The base dataset, directly extracted objective control problem in 6G. The traffic is generated using\nfrom the Reflector-labeled trajectory history H, allows the on-off application models to simulate stochastic user activity\nmodel to learn which actions are effective or suboptimal within the network slices. In this model, the on and off\nwithin a given trajectory. Fine-tuning on this dataset en- durations follow exponential distributions, introducing realistic\nables the policy to internalize decision-making experience in randomness into the activity patterns of user equipments\na durable way, embedding trajectory-level insights into the (UEs). During the on period, UEs transmit at a constant bit", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 14, + "total_chunks": 23, + "char_count": 701, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f9fb8216-42ff-4aa3-84c3-fae1151e9ea2", + "text": "TABLE III: Performance comparison of different algorithms.rate. As shown in Table I, the parameters of the on-off model,\nThe best and second-best results for each objective are markedincluding the average on and off times and the bit rates, were\nin bold and underline.configured to reflect two distinct traffic modes. The radio channel was configured based on standard prop- Algorithms Avg. Utility\nagation models, including the urban propagation loss model\nSF (ours) 5.354 21.091 8.561 25702.2\nspecified in 3GPP TR 38.901. Key parameters such as trans- Reflextion 5.299 29.454 8.630 25314.69\nmission power, noise figure, and carrier frequency are de- DQN 5.219 46.204 15.911 22519.1\ntailed in Table II. Frequency-selective fading was introduced PPO 3.587 51.411 1.997 19277.2\nSAC 5.748 44.775 59.967 11704.3\nto capture realistic channel variability, using pre-generated\nfading traces that emulate typical mobility scenarios such as\npedestrian and vehicular models. The decision-making cycle was set to 100 ms, during\nwhich the simulator captured relevant performance metrics and rt(st, at) = α · SEtn −ct · Preconf −vt · PQoS (15)\ndynamically updated the resource allocation decisions.\nwhere α weights the spectral efficiency SE, ct indicates reconB. Baseline algorithms\nfiguration occurrences as shown in (3) with penalty Preconf,\nWe evaluate our method against two categories of baselines and vt indicates PQoS violation as shown in (2) with penalty\nto ensure thorough comparison. First, we implemented three PQoS. This reward formulation explicitly trades off three key\nstate-of-the-art RL algorithms using the Ray RLlib framework objectives: maximizing spectral efficiency while minimizing\n[26]: Deep Q-Network (DQN) [27], Soft Actor-Critic (SAC) both frequent reconfigurations and service violations.\n[28] and Proximal Policy Optimization (PPO) [29]. These\ncarefully selected baselines provide broad coverage of modern C.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 15, + "total_chunks": 23, + "char_count": 1927, + "word_count": 281, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0d6277f1-2f04-41b4-8d99-156073b259a1", + "text": "Experiment Result\nRL techniques, spanning value-based, policy-based, and max- In the RAN slicing continuous control task, the performance\nimum entropy paradigms to ensure a thorough evaluation of of different algorithms is evaluated in a multi-objective optiour method's performance across different aspects of network mization context over 300-step trajectories of environmental\ncontrol optimization. interaction for comparative analysis across three core metrics:\nSecond, we adapt the Reflexion framework as the primary Mean Spectral Efficiency (Mean SE), Reconfiguration Times,\nLLM-agent baseline. Its tripartite architecture is preserved: and PQoS violation times. As illustrated in Fig. 2, we present\nthe Actor is implemented using Qwen3-4B [30], and both the performance trajectories of RL baselines (SAC, PPO,\nthe trajectory evaluator and the self-reflection modules are DQN) over 80 training rounds, alongside the final perforinstantiated with DeepSeek-R1 [31]. mance of Reflexion and the proposed Self-Finetuning method. For controlled comparison, our Self-Finetuning framework Despite RL algorithms collecting 20 trajectories per round\nadopts the same backbone models. The Actor also uses (totaling 1,600 for training), their convergence and stability in\nQwen3-4B, and the Reflector is implemented with DeepSeek- multi-objective optimization remain suboptimal. This design ensures comparable action-generation ca- SAC exhibits significant volatility during training, with unstapability across agents and allows the observed performance ble oscillations in Episode utility, making it difficult to form a\ndifferences to be attributed to our architectural and learning- stable policy (Fig. 2 (a)). While PPO performs well in PQoS\nmechanism innovations rather than model capacity. violation times control (consistently maintaining low violation\nFor all algorithms, the system state at time step t is times in Fig. 2 (c)), its Mean SE is relatively poor, and frequent\nrepresented as: resource reconfigurations (Fig. 2 (d)) incur substantial system\noverhead. DQN attains a relatively high overall utility score,\nst = [at−1, SEt, µt, δt, ϵt] (14) despite exhibiting no standout performance on any individual\nwhere at−1 is the previous action (current PRB allocation), metric. SEt is the current spectral efficiency, µt represents the In contrast, Self-Finetuning achieves superior comprehenthroughput of arriving traffic, δt denotes the increment in sive performance with just one training iteration and a single\nqueued packet size, and ϵt indicates the size of dropped trajectory collection. As shown in Fig. 2, it has the highest\npackets. This compact state representation provides the agent utility score. In individual metrics, it excels in Mean SE,\nwith complete information about current network conditions stability, and PQoS violation times control.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 16, + "total_chunks": 23, + "char_count": 2860, + "word_count": 402, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1b4dc827-5a76-459e-958f-01df306eccc9", + "text": "Statistical data in\nand resource demands. Table III further corroborates this: Self-Finetuning achieves\nThe action space represents the number of available PRBs a Mean SE of 5.354, a slight improvement over Reflexion;\nfor allocation, with the agent determining the optimal resource its Reconfiguration Times are only 21.091, a 59% reduction\ndistribution based on the observed state. The action at each compared to PPO and 28.4% lower than Reflexion; meantime step directly influences the network's resource utilization while, its PQoS violation times is comparable to Reflexion,\nand performance metrics. outperforming DQN and SAC while slightly trailing PPO,\nFor RL baselines, the multi-objectives utility function at which exclusively optimizes PQoS violation times. These\ntime step t is defined as: results demonstrate that even with minimal environmental Fig. 2: Performance comparison of RL baselines, Reflexion, and our Self-Finetuning method across multiple objectives (b) KTO iteration 1 (c) KTO iteration 2 (d) KTO iteration 3 (e) KTO iteration 4 (a) Performance before and after training\n(f) KTO iteration 5 (g) KTO iteration 6 (h) Reward convergencein one iteration. Fig. 3: Training dynamics of Self-Finetuning using one trajectory. (a) illustrates improved PQoS violation stability, 33% fewer\nreconfigurations, and higher spectral efficiency; (b–g) show KTO reward evolution in each iteration; (h) Reward convergence\nover KTO iterations, as chosen and rejected rewards both approach zero, indicating policy stabilization. interaction samples, Self-Finetuning can efficiently learn a result, Self-Finetuning is able to perform continual adaptation\nbalanced control policy, validating its generalization capability in continuous control settings, overcoming the context window\nfor multi-objective optimization. limitations and long context degradation of LLMs and learning\nprogressively from extended historical trajectories. Reflexion, while achieving moderate SE and PQoS violation\ntimes performance, incurs higher reconfiguration costs than To further illustrate the training dynamics and sample efSelf-Finetuning. This can be attributed to its reliance on long ficiency of the proposed Self-Finetuning framework, we anainteraction histories, which often prevent the evaluator from lyze the learning trajectory within a single training iteration,\ndistilling effective strategies from accumulated experiences. as shown in Fig. 3. Despite using only one environmentConsequently, the Reflexion agent's performance primarily generated trajectory, the framework performs six successive\nstems from the inherent reasoning capability of Qwen3-4B, KTO fine-tuning iterations by augmenting the dataset with\nrather than adaptive learning from the environment. refine-rollout samples. In each KTO iteration, multiple new\ncandidate actions are generated for previously suboptimal In contrast, the Reflector in Self-Finetuning operates at\ndecisions, enabling the agent to explore and reinforce alter-the trajectory level, systematically analyzing each step and\nnative behaviors without additional environment interaction.proposing improved actions based on holistic evaluations. This recursive exploitation of a single trajectory via rollout-This step-by-step reflection with trajectory-wide perspective\nbased preference optimization is the core mechanism behindenables the agent to extract meaningful insights even from\nthe sample efficiency of Self-Finetuning.long and complex interaction sequences of continuous control\ntask. By leveraging the RfR mechanism, these insights are Subplots (b)–(g) of Fig. 3 show the KTO reward curves for\nconverted into preference-labeled datasets, which are then chosen and rejected samples across the six KTO iterations.\nused to fine-tune the Actor via the KTO algorithm.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 17, + "total_chunks": 23, + "char_count": 3806, + "word_count": 514, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "02a033bc-20cb-4088-95a1-94fd7fe7df20", + "text": "Unlike These curves reflect how well the fine-tuned policy aligns with\nprompt-based adaptation in Reflexion, this preference-driven the Reflector's preferences: the chosen reward corresponds\nfine-tuning directly embeds learned decision patterns into the to the model's confidence in preferred decisions, while the\nmodel weights, allowing the Actor to internalize behavioral rejected reward captures its tendency to produce suboptimal\npriors and effectively compress long-term experiences. During the first KTO iteration, the reward gap between chosen and rejected samples is the widest—chosen [6] L. Sun et al., \"Advanced deep learning models for\nrewards are the highest and rejected rewards are strongly neg- 6g: Overview, opportunities, and challenges,\" IEEE Access, vol. 12, pp.\n133 245–133 314, 2024.\native—indicating that the model learned a substantial amount [7] M. Rouili et al., \"Flexible ran slicing\nfrom the initial preference dataset. As KTO iterations progress, in open ran with constrained multi-agent reinforcement learning,\" IEEE\nthe rewards of both groups gradually converge toward zero, as Journal on Selected Areas in Communications, vol. 42, no. 2, pp. 280–\n294, 2024.\nseen in Fig. 3(h), suggesting diminishing returns in preference [8] X. Wu et al., \"Ai-assisted network-slicing based\nlearning. This convergence reflects that the single trajectory next-generation wireless networks,\" IEEE Open Journal of Vehicular\nhas been fully exploited: the model has internalized nearly all Technology, vol. 1, pp. 45–66, 2020.\n[9] R. Barto, Reinforcement Learning: An Introduction,\nactionable information available from that episode, and further 2nd ed. MIT Press, 2018.\nrollout samples contribute limited new knowledge. [10] S. Lyu et al., \"A large language model-driven reward\nThe effect of this training process on actual task perfor- design framework via dynamic feedback for reinforcement learning,\"\nKnowledge-Based Systems, p. 114065, 2025.\nmance is visualized in Fig. 3(a), which compares the key met- [11] Y. Wang et al., \"Eureka: Human-level reward design\nrics—PQoS violation times, reconfiguration times, and average via coding large language models,\" arXiv preprint arXiv:2310.12931,\nSE—before and after this single training iteration. Notably, 2023.\n[12] S.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 18, + "total_chunks": 23, + "char_count": 2279, + "word_count": 325, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "97291efa-28e1-4848-b2f9-29f297b2a093", + "text": "Shah et al., \"The perils of trial-and-error reward\nreconfiguration frequency decreases by approximately 33%, design: misdesign through overfitting and invalid task specifications,\" in\nindicating improved policy stability and reduced operational Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37,\noverhead. PQoS violation times become more stable, reflecting no. 5, 2023, pp. 5920–5929.\n[13] L. Feng et al., \"A survey on large language model\nenhanced consistency in meeting service-level requirements, based autonomous agents,\" Frontiers of Computer Science, vol. 18, no. 6,\nwhile average SE shows a slight improvement. These results p. 186345, 2024.\ndemonstrate that even with minimal interaction, the Self- [14] G. Jiang et al., \"Voyager: An open-ended embodied\nagent with large language models,\" Transactions on Machine Learning\nFinetuning agent can make meaningful policy improvements Research, 2024.\nthrough structured reflection and preference-based fine-tuning, [15] S. Yu et al., \"React: Synergizing reasoning and\nunderscoring the method's efficiency in continuous control acting in language models,\" in International Conference on Learning\nRepresentations (ICLR), 2023.\nenvironments. [16] L. Ma et al., \"A survey on hallucination in large\nlanguage models: Principles, taxonomy, challenges, and open questions,\"\nVI. CONCLUSION ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55,\n2025. This paper presents a Self-Finetuning framework that en- [17] A. Xu et al., \"Expel: Llm agents are experiential\nables LLM-based agents to autonomously and continuously learners,\" in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 632–19 642.\nlearn in complex continuous control tasks like RAN slicing. [18] N. Gopinath et al., \"Reflexion: Language agents\nUnlike traditional RL methods, our approach requires no hand- with verbal reinforcement learning,\" Advances in Neural Information\ncrafted reward functions and achieves superior performance in Processing Systems, vol. 36, pp. 8634–8652, 2023.\nmulti-objective RAN slicing resource allocation. By leveraging [19] N. Hewitt et al., \"Lost in the middle: How language\nmodels use long contexts,\" arXiv preprint arXiv:2307.03172, 2023.\ntrajectory-level reflection and preference-based fine-tuning, the [20] K. Muennighoff et al., \"Kto: Model alignment\nagent effectively extracts and internalizes long-horizon expe- as prospect theoretic optimization,\" arXiv preprint arXiv:2402.01306,\nriences, enabling sample-efficient continual policy improve- 2024.\n[21] Q. Fang et al., \"A blockchain-based scheme for\nment.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 19, + "total_chunks": 23, + "char_count": 2629, + "word_count": 360, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "62cb476e-0229-421d-b19a-1cd42d587509", + "text": "While slow inference speed of LLMs currently hinders secure data offloading in healthcare with deep reinforcement learning,\"\nreal-time deployment, future work will explore techniques IEEE/ACM Transactions on Networking, vol. 32, no. 1, pp. 65–80, 2023.\nsuch as imitation learning or policy distillation to transfer [22] H. Huangfu et al., \"Power control based on deep\nreinforcement learning for spectrum sharing,\" IEEE Transactions on\nknowledge into lightweight models suitable for deployment in Wireless Communications, vol. 19, no. 6, pp. 4209–4219, 2020.\npractical network systems. In addition, advancements in model [23] D. Qiao et al., \"Netllm: Adapting large language\noptimization techniques (e.g., quantization) and hardware ac- models for networking,\" in Proceedings of the ACM SIGCOMM 2024\nConference, 2024, pp. 661–678.\nceleration are expected to further alleviate this limitation. [24] S. Baker, LTE-the UMTS long term evolution:\nfrom theory to practice. John Wiley & Sons, 2011.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 20, + "total_chunks": 23, + "char_count": 990, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e38af993-cdd2-4323-a078-6dc24c9c25ac", + "text": "Mitchell et al., \"Direct preference optimization: Your language model is secretly a reward model,\" Advances in\n[1] H. Molisch et al., \"6g wireless systems: Vision, neural information processing systems, vol. 36, pp. 53 728–53 741, 2023.\nrequirements, challenges, insights, and opportunities,\" Proceedings of [26] E. Nishihara et al., \"Rllib: Abstractions for distributed\nthe IEEE, vol. 109, no. 7, pp. 1166–1199, 2021. reinforcement learning,\" in International conference on machine learn-\n[2] W. Chen, \"A vision of 6g wireless systems: ing.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 21, + "total_chunks": 23, + "char_count": 541, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "053595e6-9d82-49ee-8717-4ce38da5a24f", + "text": "PMLR, 2018, pp. 3053–3062. Applications, trends, technologies, and open research problems,\" IEEE [27] V. Silver et al., \"Human-level control through\nNetwork, vol. 34, no. 3, pp. 134–142, 2020. deep reinforcement learning,\" Nature, vol. 518, no. 7540, pp. 529–533,\n[3] M. Mezzavilla et al., \"Toward 6g networks: 2015. Use cases and technologies,\" IEEE Communications Magazine, vol. 58, [28] T.", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 22, + "total_chunks": 23, + "char_count": 392, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e1c7fb9b-eeec-4e63-9e42-3f72d488911e", + "text": "Abbeel et al., \"Soft actor-critic: Off-policy\nno. 3, pp. 55–61, 2020. maximum entropy deep reinforcement learning with a stochastic actor,\"\n[4] W. Li et al., \"Ai-native network slicing for 6g in International conference on machine learning. PMLR, 2018, pp.\nnetworks,\" IEEE Wireless Communications, vol. 29, no. 1, pp. 96–103, 1861–1870.\n2022. [29] J. Dhariwal et al., \"Proximal policy optimization\n[5] Y. Lin et al., \"Toward native artificial intelligence in algorithms,\" arXiv preprint arXiv:1707.06347, 2017.\n6g,\" in 2022 IEEE International Symposium on Broadband Multimedia [30] A. Yang et al., \"Qwen3 technical report,\" arXiv preprint\nSystems and Broadcasting (BMSB), 2022, pp. 1–6. arXiv:2505.09388, May 2025. Zhang et al., \"Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,\" arXiv preprint", + "paper_id": "2603.10564", + "title": "Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents", + "authors": [ + "Yuanhao Li", + "Haozhe Wang", + "Geyong Min", + "Nektarios Georgalas", + "Wang Miao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10564v1", + "chunk_index": 23, + "total_chunks": 23, + "char_count": 829, + "word_count": 117, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10573_semantic.json b/data/chunks/2603.10573_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..0df9d4acdeb25f934a4233ed85c5aa064da69591 --- /dev/null +++ b/data/chunks/2603.10573_semantic.json @@ -0,0 +1,524 @@ +[ + { + "chunk_id": "e94ba457-9d62-4fc1-b353-7f79ae20d5f5", + "text": "Published at Latent & Implicit Thinking Workshop @ ICLR 2026 IMPLICIT STATISTICAL INFERENCE IN TRANSFORMERS: APPROXIMATING LIKELIHOOD-RATIO TESTS INCONTEXT Faris Chaudhry & Siddhant Gadkari\nDepartment of Computer Science\nImperial College London\n{fc522,svg21}@imperial.ac.uk ABSTRACT2026\nIn-context learning (ICL) allows Transformers to adapt to novel tasks without\nweight updates, yet the underlying algorithms remain poorly understood. We\nadopt a statistical decision-theoretic perspective by investigating simple binaryMar hypothesis testing, where the optimal policy is determined by the likelihood-ratio\n11 test.tic interpretabilityNotably, this setupwhereprovidesthe targeta mathematicallyalgorithmic groundrigoroustruthsettingis known.for mechanis-By training Transformers on tasks requiring distinct geometries (linear shifted means vs.\nnonlinear variance estimation), we demonstrate that the models approximate the\nBayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing[cs.LG] heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for\nlinear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive\nstatistical estimators rather than simple similarity matching. In-context learning (ICL) refers to the remarkable ability of models (particularly Transformers) to\nadapt to novel tasks at inference time using only a finite context of input-output examples, without\nexplicit parameter updates (Brown et al., 2020; Vaswani et al., 2023). While ICL is now a standard\ncapability of large language models, its underlying algorithmic mechanism remains a subject of debate. Does the model merely retrieve and average similar examples, or does it construct a principled\nlearning algorithm on the fly?arXiv:2603.10573v1\nRecent work in controlled synthetic environments has demonstrated that Transformers can recover\nclassical algorithms (such as linear regression, decision trees, and automata) purely from context (Garg et al., 2023; Zhang et al., 2023).", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 0, + "total_chunks": 29, + "char_count": 2423, + "word_count": 315, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1e887c7f-18d3-46e2-8d43-409b76e0dccd", + "text": "These findings suggest that ICL may implement statistically optimal procedures when the task structure allows. However, existing analyses often focus\non regression problems with fixed functional forms, emphasizing asymptotic convergence rather\nthan the precise nature of the decision rule applied at the level of individual episodes. In this work, we adopt a statistical decision-theoretic perspective. We study ICL in binary hypothesis testing, a fundamental framework where optimal decision rules are fully characterized by the\nNeyman-Pearson lemma (Lehmann & Romano, 2005). For simple hypotheses, the log-likelihood\nratio (LLR) is a minimal sufficient statistic, and any Bayes-optimal decision rule must be a monotone function of it. This provides a sharp notion of optimality and identifiability: recovering the LLR\nup to a monotone (or affine) transformation is both necessary and sufficient for optimal prediction. More importantly, this establishes a testbed for mechanistic interpretability where the ground truth\nis known, addressing a known challenge in mechanistic interpretability (Sharkey et al., 2025). Published at Latent & Implicit Thinking Workshop @ ICLR 2026 By training Transformers on dynamic discrimination tasks where the optimal statistic varies across\nepisodes (e.g., linear vs. quadratic), we test whether the model learns to infer and apply the appropriate sufficient statistic from context alone, rather than relying on fixed similarity heuristics.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 1, + "total_chunks": 29, + "char_count": 1476, + "word_count": 209, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8796a8cc-f4ca-4479-a8f9-20403a5a5f0f", + "text": "We\nview this work as a first step toward a broader decision-theoretic understanding of ICL. ICL as implicit inference. A growing body of literature interprets ICL as a form of implicit\nBayesian inference. Xie et al. (2022) propose that ICL can be modeled as Bayesian inference over a\nhidden variable concept space, while Li et al. (2023) and Zhang et al. (2023) demonstrate that Transformers can approximate posterior predictive distributions for specific function classes. Closest to\nour work, Bai et al. (2023) analyze Transformers as statisticians in the context of Markov chains,\nfinding that they can approach Bayes-optimal error rates. We extend this perspective by explicitly\ncharacterizing the geometry of the decision boundary (linear vs. quadratic) and linking the model's\ninternal representations to the Neyman-Pearson optimal statistic. Algorithmic induction and optimization. An alternative perspective views ICL as an optimization process., Aky¨urek et al. (2023), Dai et al. (2023), and von Oswald et al. (2023) have argued\nthat self-attention layers can implement steps of gradient descent (GD) during the forward pass. While the \"ICL as GD\" hypothesis explains how models improve with more examples, it does not\nexplicitly guarantee statistical optimality in discriminative settings. Our work complements this by\nfocusing on the objective of the induced algorithm: regardless of whether the mechanism resembles\nGD or exact inference, we ask if it produces the sufficient statistic required for the likelihood-ratio\ntest. Mechanistic interpretability and task vectors. Finally, our analysis draws on mechanistic interpretability to explain how these statistics are computed (Elhage et al., 2021; Nanda et al., 2023). Olsson et al. (2022) identified induction heads as a primary circuit for copying patterns in ICL. More\nrecently, Hendel et al. (2023) and Todd et al. (2024) have proposed that Transformers compress the\ncontext into function vectors or task vectors that modulate downstream processing. This aligns with\nour finding that the attention mechanism acts as a \"neural statistician\" of sorts (Edwards & Storkey,\n2017), compressing the context dataset into a single sufficient statistic (e.g., a mean vector or energy\nscalar) that determines the downstream decision rule.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 2, + "total_chunks": 29, + "char_count": 2295, + "word_count": 342, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ef8c8973-f162-4b65-a272-de125155b53a", + "text": "2 PROBLEM SETUP: DYNAMIC STATISTICAL DISCRIMINATION We study ICL in the setting of binary hypothesis testing with task parameters that vary across\nepisodes (i.e., independent task instances consisting of a context set and a query drawn from a\nshared latent task). Let Φ denote a family of binary classification tasks, where each task ϕ ∈Φ\nspecifies two class-conditional distributions (pϕ(x | H1)) and an associated label space\ny ∈{0, 1}. In each episode, we sample task parameters ϕ ∼p(Φ) and generate a context dataset\nC = {(xi, yi)}Ni=1 where yi ∼Bernoulli(1/2) and xi ∼pϕ(x | A query point (xq, yq)\nis drawn from the same task distribution. A Transformer model fθ is trained to predict the label\n(source distribution) yq given (xq, C) by minimizing the binary cross-entropy (BCE) loss:\nL = −Eϕ∼p(Φ)EC,xq [yq log fθ(xq, C) + (1 −yq) log(1 −fθ(xq, C))] . (1) Minimizing BCE is equivalent to estimating the posterior probability p(yq = 1 | xq, C). The logit\nof the Bayes-optimal predictor satisfies\np(yq = 1 | xq, C) π1\nlog = LLR(xq; ϕ) + log , (2)\np(yq = 0 | xq, C) π0\nwhere π1, π0 denote the class priors. Thus, under BCE training, the Bayes-optimal internal decision\nstatistic is identifiable up to an affine transformation of the LLR. Conditioned on the context dataset C, each episode induces a simple binary hypothesis testing problem between H0 and H1. By the Neyman-Pearson lemma, the likelihood-ratio test\np(xq | Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Figure 1: Approximation of the LLR. Regression of the Transformer's output logits against the true\nanalytical LLR for validation episodes. (Left) Task A: The model exhibits a strong linear correlation\n(r = 0.859), indicating it approximates the affine sufficient statistic µ⊤(x −k). (Right) Task B:\nThe model achieves near-perfect rank correlation (ρ = 0.976), effectively recovering the quadratic\nsufficient statistic ∥x∥2 up to a monotone transform.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 3, + "total_chunks": 29, + "char_count": 1935, + "word_count": 319, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aae476fc-9a9a-4da4-86d4-bbc1f5a47954", + "text": "The sigmoidal shape suggests the model has\nlearned a calibrated probability mapping, saturating for high-confidence inputs while preserving the\noptimal decision ordering. is the uniformly most powerful decision rule, and any Bayes-optimal classifier must implement a\nstatistic that is monotone in the corresponding log-likelihood ratio. Consequently, recovery of the\nLLR up to an affine transformation is both necessary and sufficient for optimal in-context prediction\nunder BCE training. To test whether Transformers rely on simple heuristics or perform optimal, context-dependent statistical inference, we design two Gaussian discrimination tasks with differing optimal statistics. Task A: Shifted Mean Discrimination (Linear Regime). We sample a discriminative direction\nµ ∼Unif(Sd−1) and a shift k ∼N(0, σ2kI). The class-conditional distributions are\nH0 : x ∼N(−µ + k, I), H1 : x ∼N(µ + k, I). (3) The optimal decision boundary is linear but not centered at the origin. The sufficient statistic is the\nshifted projection S(x) = µ⊤(x −k), requiring the model to infer both µ and k from the context. This task probes whether the model can dynamically estimate local centroids and perform linear\ndiscrimination. Static models that assume fixed centering fail on this task. Task B: Variance Discrimination (Nonlinear Regime). We sample two variances σ0, σ1 ∼\nUnif[0.5, 3.0] and fix the mean at zero. The distributions are\nH0 : x ∼N(0, σ20I), H1 : x ∼N(0, σ21I). (4)\nSince the class means coincide, dot-product similarity is uninformative. The optimal decision statistic depends on the quadratic energy ∥x∥2, with the sign determined by the relative ordering of\n(σ0, σ1).", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 4, + "total_chunks": 29, + "char_count": 1670, + "word_count": 255, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c187baf6-4c88-4997-9f6a-c4a7f6961c17", + "text": "This task tests whether the model can adapt its internal geometry from linear projections\nto norm-based estimation. 3 APPROXIMATION OF THE LLR 3.1 RECOVERY OF OPTIMALITY To quantify the model's ability to recover the sufficient statistic, we compare its in-context accuracy\nagainst a theoretical Bayes-optimal classifier. The oracle computes the exact log-likelihood ratio\nusing the ground-truth task parameters (µ, k, σ), representing the theoretical performance ceiling. In the nonlinear variance task (Task B), the model achieves an accuracy of 83.0 ± 0.5%, effectively\nmatching the oracle performance of 84.0 ± 1.0%. While the model's raw logits do not linearly Published at Latent & Implicit Thinking Workshop @ ICLR 2026 track the analytical LLR (Pearson r = 0.60), they achieve near-perfect rank alignment (Spearman\nρ = 0.98). This indicates that the model has successfully recovered the ordering induced by the\nquadratic sufficient statistic ∥x∥2, but maps it through a nonlinear calibration function (Figure 1). In the linear shifted mean task (Task A), the model achieves 78.3 ± 0.3%. While discriminative, it\nremains below the oracle accuracy of 84.6±1.0%, leaving an optimality gap of approximately 6.3%. This discrepancy is reflected in the regression analysis, which shows a noisy linear approximation\n(r = 0.86) rather than the clean functional relationship observed in Task B. This suggests that\ninstead of performing exact symbolic inference, the model implements some approximation. We\nverify this hypothesis in Appendix C.1 by evaluating the model on OOD tasks with significantly\nlarger nuisance shifts (σk = 9.0).", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 5, + "total_chunks": 29, + "char_count": 1633, + "word_count": 247, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0d52ebf3-c2e2-41fe-9e4e-773e114d8759", + "text": "Under these conditions, the correlation with the true LLR degrades\nto r = 0.567, demonstrating that the learned decision rule is a local approximation calibrated to the\ntraining support rather than an exact symbolic recovery. Nonetheless, the model does eventually\nbegin to generalize OOD, exhibiting a delayed rise in validation accuracy characteristic of partial\ngrokking. 3.2 ABLATIONS AND FAILURE MODES We isolate the necessary components for in-context learning by modifying the architecture and data\nstructure, as detailed in Table 1.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 6, + "total_chunks": 29, + "char_count": 540, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "973014ed-0e9b-4255-a0f5-cc4ba81cd97c", + "text": "Comprehensive results for all experimental conditions are provided\nin Appendix C.2. Table 1: Key Ablations (Task A). We test the necessity of specific architectural features. 1) Permutation Invariance: Removing positional encodings (NoPos) has negligible impact, confirming\nthe model treats the context as a set rather than a sequence. 2) Learned Metric: Freezing attention\nweights (FrozenQK) destroys performance, indicating the model must learn a task-specific similarity metric. 3) Supervision: Shuffling labels (ShuffledLabels) causes collapse to random\nchance, ruling out unsupervised clustering heuristics. Model Variant Validation Accuracy Implication Regular (Baseline) 78.3 ± 0.3% — NoPos 78.2 ± 0.5% Permutation Invariant\nShuffledLabels 49.6 ± 1.2% Requires x →y mapping\nFrozenQK 49.6 ± 1.3% Requires Learned Metric 4 MECHANISTIC EVIDENCE We now investigate how the model implements these statistical decision boundaries. Our analysis\nreveals that the model does not use a universal algorithm, but adapts its circuit depth to the task\ngeometry. First, a common hypothesis is that ICL performs nearest-neighbor smoothing (Han et al., 2025). To\ntest this, we compared the model's logits against a Nadaraya-Watson kernel regression estimator. The correlation is weak, confirming that the model is not merely averaging labels based on similarity, but computing a context-dependent sufficient statistic (e.g., centering by k). More details are\nprovided in Appendix C.3.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 7, + "total_chunks": 29, + "char_count": 1474, + "word_count": 209, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d3c7019-9600-4548-ace3-91b395de5401", + "text": "4.1 DECISION LATENCY AND LOGIT LENS Using the Logit Lens technique (nostalgebraist, 2020), we project intermediate residual states into\nthe vocabulary space. As shown in Figure 2 (Left), Task A exhibits an early decoding pattern: the\nrepresentation at Layer 1 shows a partial but decisive correlation with the final target. This suggests\nthat the model is performing a form of preprocessing or summary statistic calculation early in the\nnetwork which is then refined into a decision. In contrast, nonlinear tasks (Task B) show near-zero\ncorrelation until the final layer, indicating a need for deeper composition to estimate energy terms\n(∥x∥2). Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Figure 2: Mechanistic Adaptivity. (Left) Logit Lens (Task A): The correlation with the true\nLLR rises significantly in Layer 1, suggesting early linear decoding or aggregation. (Right) OV\nCircuit Alignment: In Task A (top), Layer 0 heads (e.g., Head 2) show strong positive alignment\n(> 0.7) with the logit direction, acting as voting ensemble. In Task B (bottom), Layer 0 heads\nare effectively silent (< 0.26), implying that the model suppresses early voting to perform deeper\nsequential processing in Layer 1.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 8, + "total_chunks": 29, + "char_count": 1218, + "word_count": 191, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfd5b389-b7e9-40b4-8f4f-6c1cfddb607e", + "text": "Both OV circuits are taken from representative seeds; qualitatively\nsimilar behavior persisted across seeds. 4.2 HYPOTHESIS: ADAPTIVE CIRCUIT DEPTH AND VOTING ENSEMBLES We find that this decision latency manifests as distinct circuit architectures (Figure 2, Right). To\ninterpret the role of individual attention heads, we analyze their Output-Value (OV) circuits (Elhage\net al., 2021). The OV matrix WOV = WV WO determines how the features read by a head are\nprojected into the residual stream and, subsequently, the output logits. In Task A, Layer 0 heads exhibit strong positive alignment (| cos θ| > 0.7) with the final decision\ndirection. We hypothesize that in this linear regime, the model utilizes a greedy voting ensemble,\nwhere heads independently compute partial summary statistics (via forwarding and suppression)\nthat are linearly aggregated to form the decision boundary immediately. On the other hand, in Task B, Layer 0 heads are effectively silent regarding the decision (| cos θ|\nsmall). Significant alignment only emerges in Layer 1. This suggests a sequential algorithm where\nLayer 0 is suppressed or repurposed to compute intermediate features (e.g., squared norms) rather\nthan voting directly. 5 CONCLUSION, LIMITATIONS, AND FUTURE WORK Importantly, binary hypothesis testing provides a setting where mechanistic interpretability techniques can be compared to a known ground truth. We have demonstrated that toy Transformers\ntrained on dynamic hypothesis testing tasks can approximate the Neyman-Pearson optimal decision\nrule in-context. By adapting their internal circuit depth (e.g., employing greedy heuristics for linear\ntasks and sequential processing for nonlinear boundaries) the models recover a sufficient statistic\nthat is highly monotonically correlated with the LLR, matching the performance of a Bayes-optimal\noracle in the quadratic regime. While our controlled synthetic environment allows for exact analytical baselines,\nit relies on a small two-layer Transformer and relatively low-dimensional Gaussian data.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 9, + "total_chunks": 29, + "char_count": 2047, + "word_count": 296, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0008e6ab-5ca7-41cd-a7b8-7c225ee3c4da", + "text": "Consequently, it remains an open question to what extent these specific mechanistic behaviors—such as\nthe discrete shift from early voting ensembles to deeper sequential processing—scale to more general statistical tasks or even large language models operating on complex, real-world distributions. Furthermore, our mechanistic interpretability results, including the Logit Lens and OV circuit align- Published at Latent & Implicit Thinking Workshop @ ICLR 2026 ment, establish strong correlational evidence rather than strict causal proofs of the model's internal\nalgorithms. Future work incorporating causal interventions could further substantiate these structural hypotheses. Firstly, conditioning on the in-context dataset reduces each episode to a simple binary hypothesis test, for which the optimal decision rule is characterized by the likelihood-ratio test. A natural extension is to consider composite hypotheses, where class-conditional distributions depend on latent parameters that cannot be eliminated by conditioning alone. In such settings, optimal\ndecision-making requires either marginalization over nuisance parameters or plug-in estimation. Studying ICL in this regime would help distinguish whether models behave more like Bayesian\nmodel averaging or approximate maximum-likelihood estimators. Secondly, our experiments assume balanced class priors and symmetric loss, leading to decision\nthresholds centered at zero log-likelihood ratio. Extending the framework to asymmetric priors or\ncost-sensitive objectives would test whether ICL adapts not only the sufficient statistic but also the\noptimal decision threshold, as prescribed by statistical decision theory. Finally, binary hypothesis testing provides a minimal setting with sharp optimality guarantees. Extending the analysis to multi-class or sequential testing problems, such as multi-way likelihood-ratio\ntests or Wald-style sequential procedures, would probe whether ICL can recover more complex decision rules under uncertainty while retaining decision-theoretic interpretability. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Ekin Aky¨urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 10, + "total_chunks": 29, + "char_count": 2199, + "word_count": 289, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9a309282-2647-413c-940e-ff309215b9b1", + "text": "What learning\nalgorithm is in-context learning? investigations with linear models, 2023. Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection, 2023. URL https://arxiv.\norg/abs/2306.04637. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,\nAriel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz\nLitwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec\nRadford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL\nhttps://arxiv.org/abs/2005.14165. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can\nGPT learn in-context? language models implicitly perform gradient descent as meta-optimizers,\n2023. URL https://arxiv.org/abs/2212.10559. Harrison Edwards and Amos Storkey.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 11, + "total_chunks": 29, + "char_count": 1108, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5892dfe8-ee6b-4ad2-b6f3-08e499f1aac5", + "text": "Towards a neural statistician, 2017. URL https:\n//arxiv.org/abs/1606.02185. A mathematical framework for transformer circuits. Transformer Circuits\nThread, 2021. URL https://transformer-circuits.pub/2021/framework/\nindex.html. Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can Transformers learn\nin-context? a case study of simple function classes, 2023. URL https://arxiv.org/abs/\n2208.01066. Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. Understanding emergent in-context learning from a\nkernel regression perspective, 2025. URL https://arxiv.org/abs/2305.12766. Roee Hendel, Mor Geva, and Amir Globerson.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 12, + "total_chunks": 29, + "char_count": 628, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8f22a871-4fe8-429c-b62f-7e8c7da5170d", + "text": "In-context learning creates task vectors, 2023. URL\nhttps://arxiv.org/abs/2310.15916. Erich L Lehmann and Joseph P Romano. Testing Statistical Hypotheses. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 13, + "total_chunks": 29, + "char_count": 212, + "word_count": 25, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4d092c4d-0fbd-4449-9f89-45bfb513a9c9", + "text": "Transformers as\nalgorithms: Generalization and stability in in-context learning, 2023. URL https://arxiv.\norg/abs/2301.07067. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142,\n1964. doi: 10.1137/1109020. URL https://doi.org/10.1137/1109020. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 14, + "total_chunks": 29, + "char_count": 347, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c6ca3747-88bf-442a-8444-5b54eba29fdd", + "text": "Progress measures\nfor grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301.\n05217. Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/\nAcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, Aug 31 2020. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan,\nBen Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli,\nZac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane\nLovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,\nand Chris Olah.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 15, + "total_chunks": 29, + "char_count": 612, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7b33b66e-26cb-4f59-927a-3aca00282d5c", + "text": "In-context learning and induction heads, 2022. URL https://arxiv.org/\nabs/2209.11895. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas\nGoldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria\nGarriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi\nSchoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders,\nDavid Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath. Open problems in mechanistic interpretability, 2025. URL https://arxiv.org/\nabs/2501.16496. Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 16, + "total_chunks": 29, + "char_count": 784, + "word_count": 104, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "890869f7-808e-4de6-be77-e6d4697fe7d8", + "text": "Function vectors in large language models, 2024. URL https://arxiv.org/abs/2310.\n15213. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.\norg/abs/1706.03762. Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo˜ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient\ndescent, 2023. URL https://arxiv.org/abs/2212.07677. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 17, + "total_chunks": 29, + "char_count": 580, + "word_count": 72, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "44c5d09e-1fc5-4e8e-9687-1164440a67f9", + "text": "An explanation of in-context\nlearning as implicit bayesian inference, 2022. URL https://arxiv.org/abs/2111.\n02080. Ruiqi Zhang, Spencer Frei, and Peter L.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 18, + "total_chunks": 29, + "char_count": 154, + "word_count": 20, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b84ae28-b70e-4972-b453-a7354358765a", + "text": "Trained transformers learn linear models incontext, 2023. URL https://arxiv.org/abs/2306.09927. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 A DERIVATION OF OPTIMAL TEST STATISTICS For completeness, we derive the analytical LLR for both tasks. Although the marginal problem\ninvolves latent task parameters ϕ, conditioning on the context C renders the hypotheses H0 and H1\nsimple for each episode. Classical Neyman-Pearson optimality therefore applies at the episode level,\nand the optimal decision statistic is given by the likelihood ratio conditioned on C. The following\nderivations make this dependence explicit for the two task families considered. A.1 TASK A: SHIFTED MEAN DISCRIMINATION The class-conditional distributions are isotropic Gaussians with means µ1 = µ + k\nand µ0 = −µ + k, and covariance Σ = I. For x ∼N(m, I),\nlog p(x | m) = −d log(2π) −1 −m∥2. (5) 2 2∥x\nThe LLR is\nΛ(x) = −1 −(µ + k)∥2 + 1 −(−µ + k)∥2 (6) 2∥x 2∥x\n= 2µ⊤x −2µ⊤k. (7)\nThus, the optimal statistic is affine in µ⊤(x−k); correct classification requires centering with respect\nto the context-dependent shift. A.2 TASK B: VARIANCE DISCRIMINATION For centered Gaussians with variances σ21 and σ20,\nlog p(x | σ) = −d log(2πσ2) −∥x∥2 . (8)\n2 2σ2\nThe LLR is\nd σ20 log + Λ(x) = ∥x∥2 1 −1 . (9)\n2 σ21 2 σ20 σ21\nThe first term is a constant bias, while the data-dependent term is proportional to the energy ∥x∥2. Hence, the optimal statistic is purely quadratic.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 19, + "total_chunks": 29, + "char_count": 1451, + "word_count": 248, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "606d86fb-b842-4dda-91ff-43f1d6079604", + "text": "B EXPERIMENTAL DETAILS Code, results, and figures are available on GitHub. B.1 MODEL ARCHITECTURE We use a toy Transformer architecture designed for set-to-scalar tasks, which we refer to as\nICLTransformer. • Type: Bidirectional Transformer Encoder (PyTorch nn.TransformerEncoder).\n• Layers: 2\n• Attention Heads: 4\n• Embedding Dimension (dmodel): 128\n• Feedforward Dimension (dff): 512\n• Activation: GELU\n• Normalization: Post-LayerNorm (norm first=False)\n• Input Processing: The input x ∈R16 is linearly projected to dmodel. The binary label\ny ∈{0, 1} is projected via a separate learnable linear layer. These two projections are\ncombined via element-wise addition to form the final context token embedding, effectively\nbinding the label information to the input features via superposition.\n• Positional Encodings: Standard learned absolute positional embeddings are added to the\nsequence. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 B.2 TASK SPECIFICATIONS Data is generated on-the-fly during training. Each batch consists of B = 64 independent episodes.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 20, + "total_chunks": 29, + "char_count": 1073, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3cd080bc-257e-4d94-b3eb-a6c720420ae4", + "text": "Task A: Shifted Mean (Linear). • Input Dimension: dx = 16. • Context Size: N = 32. • Latent Parameters:\n– Discriminative direction µ ∼Unif(Sdx−1).\n– Nuisance shift k ∼N(0, σ2kIdx).\n• Shift Magnitude: σk = 3.0 (Training), σk = 9.0 (Out-of-distribution (OOD) evaluation). • Data Generation: x | y ∼N(k + (2y −1)µ, I). Task B: Variance (Nonlinear). • Input Dimension: dx = 16. • Context Size: N = 32. – Class 0 Scale σ0 ∼Unif[0.5, 3.0].\n– Class 1 Scale σ1 ∼Unif[0.5, 3.0].\n• Data Generation: x | y ∼N(0, σ2yI). B.3 TRAINING HYPERPARAMETERS Models are trained to minimize the Binary Cross Entropy loss on the query label yq. • Optimizer: AdamW (β1 = 0.9, β2 = 0.999, weight decay 1e −4). • Scheduler: OneCycleLR.\n• Initial Learning Rate: 3 × 10−4. • Batch Size: 64 tasks per step. • Training Duration: 20 epochs.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 21, + "total_chunks": 29, + "char_count": 808, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e4a6707-9e19-44c1-84d9-0f8d4f75c7a0", + "text": "B.4 ABLATION VARIANTS To isolate the mechanism of in-context learning, we evaluated several model variants. Each variant\ntests a specific hypothesis regarding the inductive bias or information flow required for the task. Positional Encodings (NoPos, FrozenPos). Standard Transformers use positional encodings\nto process sequences. However, the statistical tasks (shifted mean, variance) are permutation invariant with respect to the context examples. Namely, the learned methodology should be permutationinvariant like sufficent statistics are. • ICLTransformerNoPos: We completely remove the learned positional embeddings\n(P = 0). This tests whether the model treats the context as a set rather than a sequence.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 23, + "total_chunks": 29, + "char_count": 712, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c7f3149e-ea70-45c5-9bc0-8fde1635f089", + "text": "• ICLTransformerFrozenPos: We initialize positional embeddings randomly but\nfreeze them during training. This tests whether the model requires learned positional information or can utilize random absolute position markers. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Attention Mechanism (FrozenAttention, FrozenQK). We test whether the attention heads\nmust learn a task-specific metric space or if they can function as random associative memories.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 24, + "total_chunks": 29, + "char_count": 464, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "57b43a9c-676e-47c7-989a-f3099d5401df", + "text": "• ICLTransformerFrozenQK: The Query (WQ) and Key (WK) projections are frozen at\ninitialization. Only the Value (WV ) and Output (WO) matrices are trainable. This enforces\na fixed, random similarity kernel.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 25, + "total_chunks": 29, + "char_count": 205, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d486f515-87a8-4098-90ba-a82256c70cd0", + "text": "• ICLTransformerFrozenAttention: All attention weights (WQ, WK, WV , WO)\nare frozen. Only the feedforward MLPs and embedding projections are trainable. Tokenization Strategy (Interleaved). Our default architecture sums input and label embeddings: ei = Proj(xi) + Proj(yi), effectively binding the label to the input in a single token. • ICLTransformerInterleavedEmbeddings: We replace the bound representation\nwith a standard GPT-style interleaved sequence [x1, y1, x2, y2, . . . , xq]. This tests whether\nthe additive binding is a necessary inductive bias for efficient learning at this scale (N = 2\nlayers). Label Dependence (NoLabels, ShuffledLabels). These ablations verify that the model is\nperforming supervised mapping (x →y) rather than unsupervised clustering (x →x). • ICLTransformerNoLabels: The context consists only of x vectors; y information is\nzeroed out. • ICLTransformerShuffledLabels: The y labels in the context are randomly permuted within the batch, destroying the specific xi →yi mapping while preserving the\nmarginal distribution of labels. • ICLTransformerNoisyLabels: During training, a fraction p of the context labels\nare flipped (0 ↔1).", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 26, + "total_chunks": 29, + "char_count": 1165, + "word_count": 169, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f699a1ad-fd82-48b3-a714-309dafa96814", + "text": "This tests the model's ability to aggregate evidence robustly despite\ncontradictory data points. C SUPPLEMENTARY EXPERIMENTAL RESULTS C.1 TASK A OOD GENERALIZATION ANALYSIS To assess whether the model has learned the exact symbolic form of the likelihood ratio or a local\napproximation, we evaluate it on OOD task where the nuisance shift magnitude σk is increased from\n3.0 (training) to 9.0 (validation). Figure 3 presents the learning dynamics and final decision geometry for this OOD setting. • Generalization Gap (Left): While the training accuracy converges rapidly to ≈78% (consistent with the in-distribution baseline), the OOD validation accuracy lags significantly,\nplateauing at ≈64%. The delayed rise in validation accuracy suggests a form of partial\n\"grokking,\" where the model gradually refines its decision rule, but the persistent gap indicates that the learned mechanism does not fully capture the invariant symbolic structure\nneeded for perfect extrapolation. • Regression Degradation (Right): The correlation between the model's logits and the true\nLLR drops from r ≈0.86 (in-distribution) to r ≈0.57. The increased scatter suggests that\nthe model's internal approximation of the sufficient statistic (µ⊤(x −k)) is calibrated only\nfor the training support and becomes brittle under large shifts. Taken together, these results support the hypothesis that the Transformer implements an amortized\napproximate inference algorithm: it constructs a decision boundary that mimics the optimal LLR\ngeometry locally, but relies on heuristics that degrade when the task parameters drift far from the\ntraining distribution. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 C.2 FULL ABLATION RESULTS Table 2: Full Experimental Results. We report mean accuracy ± 95% CI over 3 seeds for all\nexperimental conditions. The oracle rows represent the theoretical upper bound (Bayes-Optimal\nClassifier) computed using the true latent task parameters.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 27, + "total_chunks": 29, + "char_count": 1960, + "word_count": 287, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a11c6096-6494-4224-bc90-86bd6d49d8c4", + "text": "The model is close to the oracle on Task\nB, while Task A ablations demonstrate the necessity of learned attention mechanisms. Experiment / Condition Model Variant Train Acc (%) Val Acc (%) Theoretical Oracle\nTask A (Shifted Mean) LLR — 84.6 ± 1.0\nTask B (Variance) LLR — 84.0 ± 1.0 Main Tasks\nTask A (Shifted Mean) ICLTransformer 77.5 ± 1.1 78.3 ± 0.3\nTask B (Variance) ICLTransformer 83.0 ± 0.2 83.0 ± 0.5\nTask A OOD (σk = 9.0) ICLTransformer 77.5 ± 1.1 64.7 ± 4.8 Architecture Ablations (Task A)\nNo Positional Encodings NoPos 77.5 ± 1.1 78.2 ± 0.5\nFrozen Positional Encodings FrozenPos 77.5 ± 1.2 78.1 ± 0.6\nFrozen Attention Weights FrozenAttention 49.9 ± 0.2 50.4 ± 0.7\nFrozen Q/K Projections FrozenQK 49.7 ± 0.1 49.6 ± 1.3\nInterleaved Embeddings (x, y) Interleaved 49.8 ± 0.3 49.4 ± 1.2 Data Structure Ablations (Task A)\nShuffled Context Pairs ShuffledContext 77.5 ± 1.0 78.0 ± 0.6\nShuffled Labels Only ShuffledLabels 49.8 ± 0.2 49.6 ± 1.3\nNo Labels NoLabels 50.0 ± 0.1 50.2 ± 1.6\nIncreased Context Size ICLTransformer 75.4 ± 4.7 75.9 ± 4.3 Label Noise Robustness (Task A)\nNoisy Labels (p = 0.1) NoisyLabels 67.7 ± 11.2 70.2 ± 11.6\nNoisy Labels (p = 0.2) NoisyLabels 52.1 ± 2.9 53.3 ± 5.7\nNoisy Labels (p = 0.4) NoisyLabels 49.7 ± 0.2 49.7 ± 1.4 C.3 COMPARISON WITH KERNEL REGRESSION To verify that the model is performing algorithmic reasoning rather than simple pattern matching, we\ncompare its outputs to a Nadaraya-Watson (Nadaraya, 1964) estimator using a dot-product kernel: q xi N ex⊤\nˆyKR(xq) = X yi (10)\nq xj Pj ex⊤ i=1\nAs shown in Figure 4, the correlation between the Transformer's logits and the Kernel Regression\nestimator is weak (ρ ≈0.33). This falsifies the hypothesis that the model is merely smoothing labels\nbased on raw input similarity. In Task A, the optimal decision requires computing distances relative\nto a dynamic shift k, which a simple dot product kernel cannot capture without explicit centering. C.4 LOGIT LENS ANALYSIS: TASK B In contrast to the linear regime of Task A, where decision-relevant information emerges early in the\nresidual stream, Task B exhibits a delayed decision profile. As illustrated in Figure 5, the correlation between the intermediate residual states and the LLR\nremains negligible (≈0) through Layer 0 and Layer 1.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 28, + "total_chunks": 29, + "char_count": 2274, + "word_count": 391, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a286857-dc1f-4436-aae2-80bc78000df0", + "text": "A decisive spike in correlation appears only\nat the final output stage. This latency supports the hypothesis that nonlinear statistical inference\nrequires a deeper, sequential circuit. We posit that the early layers are occupied with computing\nthe necessary sufficient statistics (e.g., the quadratic energy term ∥x∥2) which are geometrically\northogonal to the final linear readout until fully assembled. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Figure 3: OOD Generalization Degradation (Task A). (Left) Learning curves show a significant\ngeneralization gap: while the model masters the training distribution (blue), it struggles to extrapolate to large shifts (orange), achieving only partial generalization. (Right) The correlation with the\ntrue LLR degrades to r = 0.567; the learned decision rule is a local approximation rather than the\nexact symbolic LLR. Figure 4: Transformer vs. The low correlation indicates the model implements\na more complex decision rule than similarity-based label smoothing. Published at Latent & Implicit Thinking Workshop @ ICLR 2026 Figure 5: Logit Lens for Task B. The Pearson and Spearman correlations with the true LLR are\neffectively zero for the initial layers, spiking only at the final output. This confirms that the model\ndoes not perform a greedy linear approximation early in the network, but relies on the full depth of\nthe Transformer to construct some nonlinear decision boundary.", + "paper_id": "2603.10573", + "title": "Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context", + "authors": [ + "Faris Chaudhry", + "Siddhant Gadkari" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10573v1", + "chunk_index": 29, + "total_chunks": 29, + "char_count": 1448, + "word_count": 219, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10577_semantic.json b/data/chunks/2603.10577_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..2a521c94deb99dd5205e812bd3999479c0e6b234 --- /dev/null +++ b/data/chunks/2603.10577_semantic.json @@ -0,0 +1,362 @@ +[ + { + "chunk_id": "f8cdc6c2-65a3-4572-832d-0050609443bd", + "text": "CUAAudit: Meta-Evaluation of Vision-Language Models as\nAuditors of Autonomous Computer-Use Agents Marta Sumyk Oleksandr Kosovan\nsumyk.pn@ucu.edu.ua o.kosovan@ucu.edu.ua\nUkrainian Catholic University Ukrainian Catholic University\nLviv, Ukraine Lviv, Ukraine", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 0, + "total_chunks": 20, + "char_count": 256, + "word_count": 27, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a0d32aad-2222-4ff7-8381-78aeb8f88daf", + "text": "Abstract From a Human–Computer Interaction (HCI) perspective, CUAs\nComputer-Use Agents (CUAs) are emerging as a new paradigm extend a long line of work on interface agents and intelligent user\nin human-computer interaction, enabling autonomous execution interfaces, where users attribute intent, agency, and social meaning\nof tasks in desktop environment by perceiving high-level natural- to interactive systems rather than viewing them as purely funclanguage instructions. As such agents become increasingly capable tional tools [7]. Recent systems further demonstrate that large\nand are deployed across diverse desktop environments, evaluating vision-language models can act as unified controllers for complex2026 their behavior in a scalable and reliable manner becomes a critical desktop environments, generalizing across applications, tasks, and\nchallenge. Existing evaluation pipelines rely on static benchmarks, operating systems without relying on handcrafted rules [26]. As\nrule-based success checks, or manual inspection, which are brittle, a result, CUAs offer a service-agnostic alternative to traditional\ncostly, and poorly aligned with real-world usage. In this work, we robotic process automation, reducing brittleness and maintenanceMar\nstudy Vision-Language Models (VLMs) as autonomous auditors for costs while supporting a broader range of real-world tasks [21]. Beyond automation, CUAs hold particular promise for accessibil- assessing CUA task completion directly from observable interac-11 and inclusive interaction. When paired with natural-language or tions and conduct a large-scale meta-evaluation of five VLMs that ity\njudge task success given a natural-language instruction and the final voice interfaces, they enable users with motor, visual, or cognitive\nenvironment state. Our evaluation spans three widely used CUA impairments to complete multi-step tasks through language alone\nbenchmarks across macOS, Windows, and Linux environments [25, 28]. More broadly, CUAs can reduce cognitive and interaction\nand analyzes auditor behavior along three complementary dimen- burdens for non-technical users, older adults, and individuals facing\nsions: accuracy, calibration of confidence estimates, and inter-model language or executive-function challenges [3].[cs.AI] agreement. We find that while state-of-the-art VLMs achieve strong As CUAs are increasingly deployed in real-world settings, rigoraccuracy and calibration, all auditors exhibit notable performance ous evaluation prior to deployment becomes essential. However,\ndegradation in more complex or heterogeneous environments, and assessing CUA behavior remains a fundamental challenge. Existing\neven high-performing models show significant disagreement in evaluation pipelines rely on static benchmarks, rule-based success\ntheir judgments. These results expose fundamental limitations of checks, or manual inspection, all of which are costly to maintain,\ncurrent model-based auditing approaches and highlight the need to brittle to interface changes, and poorly aligned with real-world usexplicitly account for evaluator reliability, uncertainty, and variance age [27]. Such approaches typically yield coarse success signals and\nwhen deploying autonomous CUAs in real-world settings. provide limited insight into partial task completion, user-acceptable\nfailures, or performance under realistic UI variation.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 1, + "total_chunks": 20, + "char_count": 3391, + "word_count": 445, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94326165-7999-4a9b-9d8d-19144a31e701", + "text": "These limitaCCS Concepts tions are especially concerning given that CUAs act autonomously\non users' behalf, often across multiple applications and involving\n• Human-centered computing →Human computer interac- sensitive data.\ntion (HCI); • Computing methodologies →Artificial intel- In this work, we study Vision-Language Models (VLMs) as auligence; Natural language processing; Computer vision; Machine tonomous auditors for CUAs. Rather than relying on internal agent\nlearning. states or handcrafted evaluation logic, VLM-based auditors assess task completion directly from observable evidence by judg-arXiv:2603.10577v1 Keywords ing whether a natural-language instruction has been satisfied in\nComputer-Use Agents, Vision-Language Models, Human-Computer the final GUI state. We conduct a large-scale meta-evaluation of\nInteraction, Auditing, Task Completion, Evaluation VLM auditors across multiple operating systems and benchmarks,\nanalyzing their accuracy, confidence calibration, and inter-model\nagreement. By treating evaluation as a first-class problem, our study\n1 Introduction\ncharacterizes the reliability and limitations of model-based auditing\nRecent advances in large language models and multimodal percep- and highlights key challenges for the safe and robust deployment\ntion have given rise to Computer-Use Agents (CUAs): autonomous of CUAs in real-world settings.\nsystems that can operate Graphical User Interfaces (GUIs) by translating high-level natural-language instructions into sequences of 2 Related Works\nactions such as clicking, typing, scrolling, and dragging [21].\n2.1 Computer-Use Agents and GUI Automation\nThis work has been accepted to appear at the AAAI 2026 Workshop on Trust and Research on CUAs builds on a long history of GUI automation,\nControl in Agentic AI (TrustAgent). robotic process automation (RPA), and intelligent user interfaces. Conference'17, July 2017, Washington, DC, USA Marta Sumyk and Oleksandr Kosovan", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 2, + "total_chunks": 20, + "char_count": 1955, + "word_count": 261, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59919815-3c1a-4235-a141-3a64d2329bb0", + "text": "Early systems relied on handcrafted rules, application-specific scripts, data, misaligned or unsafe behavior may have immediate and costly\nDOM trees, or accessibility APIs to automate repetitive tasks. While consequences, amplifying the need for reliable evaluation and aueffective in controlled settings, these approaches were brittle to diting mechanisms.\ninterface changes, required substantial manual maintenance, and\nfailed to generalize across applications or operating systems [3]. 2.3 Agents Audit\nRecent work has shifted toward learning-based approaches that As autonomous agents are increasingly deployed in real-world setoperate directly on multimodal observations of the interface, typ- tings, systematically auditing their behavior has become a central\nically combining screenshots with natural-language task instruc- concern. Agent auditing broadly refers to evaluating correctness, retions. This paradigm enables agents to interact with graphical user liability, safety, and alignment with intended objectives, particularly\ninterfaces through the same perceptual and control channels avail- in sequential and interactive environments [1, 6].\nable to human users. Systems such as SeeAct [29], InfiGUIAgent Traditional agent evaluation has focused on structured envi-\n[16], SEAGENT [24], and UI-TARS [26] demonstrate that large ronments such as simulators or benchmarks with explicit reward\nvision-language models can act as general-purpose GUI controllers functions or success criteria. Related work on verification and testin a wide range of desktop and mobile environments. ing explores formal methods, constraint checking, and adversarial\nCollectively, these results show that CUAs can achieve substan- stress testing, but similarly relies on structured state representatial cross-application and cross-platform generalization without tions and predefined safety properties [10]. These assumptions\nrelying on application-specific APIs or predefined workflows. By often break down in open-ended, real-world interfaces.\ntreating the GUI as an executable visual environment rather than With the rise of large language models and tool-using agents, rea structured programmatic interface, CUAs represent a departure cent work has explored evaluation under less structured conditions\nfrom traditional automation pipelines and enable more flexible, using human judgment, preference learning, or learned reward\nservice-agnostic interaction with existing software ecosystems. models [5, 19]. While effective in some contexts, these approaches\noften require human-in-the-loop supervision or access to agent internals, limiting scalability and applicability to complex GUI-based\n2.2 CUA as a New HCI Concept\nenvironments. CUAs introduce an emerging interaction paradigm in which users More recently, a small number of studies have begun to examine\ndelegate high-level goals to autonomous agents that perceive, rea- autonomous evaluation of CUAs [13, 23]. These works demonstrate\nson, and act directly within existing GUIs. Unlike traditional inter- the feasibility of model-based evaluators in realistic desktop setaction models based on direct manipulation [22], CUAs function as tings, but remain limited in scope—typically focusing on a narrow\nintermediaries that execute tasks on the user's behalf through the set of tasks, metrics, or operating systems. As a result, key chalsame visual and control channels available to humans. lenges such as cross-platform generalization, evaluator reliability,\nFrom an HCI perspective, CUAs build upon earlier work on in- and robustness under diverse interaction patterns remain underexterface agents and intelligent user interfaces, which explored how plored.\nsoftware agents could assist users through recommendations, re- Overall, CUAs expose a critical gap in existing agent auditing\nminders, or adaptive behavior [12, 17]. These systems, however, methodologies. They operate within unconstrained GUIs, intertypically played a supportive or advisory role and relied on struc- act with arbitrary third-party applications, and rely primarily on\ntured application access, predefined workflows, or handcrafted visual perception rather than structured environment states.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 3, + "total_chunks": 20, + "char_count": 4218, + "word_count": 565, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "64dcaf9f-ee56-4aca-bae6-e39dc6a3fb04", + "text": "In contrast, modern CUAs are designed for end-to-end task quently, standard evaluation signals—such as environment rewards,\nexecution: given a natural-language instruction, the agent must API-level logs, or deterministic success checks—are often unavailinterpret user intent, observe the current interface state, plan a se- able or unreliable. Given the potential for immediate and costly\nquence of actions, and adapt its behavior in dynamic and partially consequences from misaligned behavior [11, 18], these characterobservable environments. istics motivate the need for autonomous, scalable, and interfaceThis shift places CUAs within the tradition of mixed-initiative aware auditing approaches that evaluate CUA behavior directly\ninteraction and human–automation collaboration, where control is from observable interactions.\nshared between humans and autonomous systems [8, 9, 15]. How- Unlike prior work that evaluates a single auditor or a single\never, CUAs push this paradigm further by substantially reducing platform, our study is the first to systematically analyze crossdirect user oversight during task execution. The graphical user platform generalization, confidence calibration, and inter-model\ninterface becomes an executable environment rather than a passive disagreement of VLM auditors at scale.\ndisplay, and interaction is reframed as a sequential decision-making\nprocess over perceptual inputs and actions such as clicking, typing, 3 Methodology\nscrolling, or dragging. This framing aligns CUAs with agent-based\nmodels of perception–action loops in interactive systems [20]. 3.1 Vision-Language Model–Based Auditors\nAt the same time, increased autonomy introduces challenges We study VLMs used as autonomous auditors for evaluating the\ncentral to HCI research on trust, safety, and usability. Prior work task completion of CUAs. Given a task instruction and the final GUI\nshows that reduced human control can lead to loss of transparency, state produced by an agent, a VLM auditor is prompted to assess\nover-reliance on automation, and difficulty diagnosing or recov- whether the task has been successfully completed. The auditor\nering from failures [11, 18].", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 4, + "total_chunks": 20, + "char_count": 2180, + "word_count": 304, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a0e10e06-ddd2-42c1-aa36-0ce822d569a8", + "text": "Because CUAs act directly on users' outputs a binary judgment (done or not done) together with an\nbehalf, often across multiple applications and involving sensitive associated confidence score. CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents Conference'17, July 2017, Washington, DC, USA Formally, for each task instance 𝑖, the auditor observes a tuple 3.3 Calibration and Confidence Assessment\n(𝑥𝑖,𝑑𝑖), Beyond binary correctness, we evaluate how well VLM auditors'\nconfidence scores align with ground-truth task outcomes. Each\nwhere 𝑥𝑖denotes the final screenshot of the GUI environment and𝑑𝑖 auditor produces (i) a predicted probability of task success and (ii) a\nis the natural-language task description. The auditor then predicts corresponding binary decision. Specifically, for each task instance 𝑖\na probability and auditor 𝑚, the model outputs a probability\n𝑝(𝑚)𝑖 ∈[0, 1], 𝑝(𝑚)𝑖 ∈[0, 1],\nrepresenting the model's confidence that the task was successfully\nwhich is thresholded to obtain a predicted labelcompleted, where 𝑚indexes the auditor model. And the corresponding predicted done/not done label is defined ˆ𝑦(𝑚)𝑖 ∈{0, 1},\nas:\nwhere ˆ𝑦(𝑚)𝑖 = 1 denotes a prediction of done and ˆ𝑦(𝑚)𝑖 = 0 denotes\nˆ𝑦(𝑚)𝑖 ∈{0, 1}, not done. The ground-truth label provided by the benchmark is\ndenoted as\nWe evaluate five VLMs as autonomous auditors, spanning both\n𝑦𝑖∈{0, 1}.\nproprietary and open-source families. Among proprietary models,\nwe consider GPT-4o2 and Claude 3.5 Sonnet3, selected for their state- We measure calibration using the Brier score, a strictly proper\nof-the-art multimodal perception and reasoning capabilities. For scoring rule defined as\nopen-source auditors, we evaluate LLaVA-v1.5-7B [14], InternVL- 𝑁\n2-8B [4], and Qwen2-VL-7B [2], which represent strong publicly Brier𝑚= 1 ∑︁ 𝑝(𝑚)𝑖 −𝑦𝑖 2 , 𝑁available alternatives with diverse architectural designs and train- 𝑖=1\ning regimes.\nwhere 𝑁is the total number of evaluated tasks. These models span both proprietary and open-source systems\nand differ substantially in architecture size, training data, and mul-\n𝑁 vuttimodal reasoning capabilities, enabling a broad analysis of auditor 1 ∑︁ Std𝑚= 𝑝(𝑚)𝑖 −𝑦𝑖 2 −Brier𝑚 2 .behavior. 𝑁\n𝑖=1\n3.2 Benchmarks Since the Brier score is a squared-error metric, lower values\ncorrespond to better calibration. Likewise, a lower Std𝑚indicates\nWe evaluate VLM auditors using three widely adopted benchmarks\nmore stable calibration across tasks.\nfor CUAs: Windows Agent Arena, OSWorld, and macOSWorld. Together, these benchmarks cover a diverse set of real-world tasks\n3.4 Inter-Model Agreementacross major desktop operating systems, including Windows, Linux,\nand macOS, and span a broad range of applications, interaction Beyond correctness and calibration, we analyze the extent to which\npatterns, and task complexities. different VLM auditors agree in their judgments of task compleEach benchmark defines tasks via natural-language instructions tion. Inter-model agreement captures the consistency of auditing\nand evaluates agent behavior based on task completion in realistic decisions across models and provides insight into task ambiguity\nGUI environments. While the underlying environments differ in and evaluator subjectivity, particularly in settings where success\noperating system and application ecosystem, all three benchmarks criteria may not be fully observable from the final GUI state.\nprovide a binary notion of task success, indicating whether a task For each pair of auditors (𝑚,𝑚′), we measure agreement on the\nwas successfully completed or not at the end of an episode. binary predictions ˆ𝑦(𝑚)𝑖 ∈{0, 1} using Cohen's 𝜅coefficient.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 5, + "total_chunks": 20, + "char_count": 3698, + "word_count": 536, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a59634e9-278e-4b6d-beff-10afc1c75b12", + "text": "CoIn our study, we adopt this binary done / not done task outcome hen's 𝜅accounts for agreement occurring by chance and is defined\nprovided by each benchmark as ground-truth supervision. Formally, as\n𝑝𝑜−𝑝𝑒for each task instance 𝑖, the benchmark assigns a ground-truth label 𝜅= ,\n1 −𝑝𝑒\n𝑦𝑖∈{0, 1}, where 𝑝𝑜denotes the observed agreement rate between two audiwhere 𝑦𝑖= 1 denotes that the task is deemed done by the bench- tors and 𝑝𝑒denotes the expected agreement under independence.\nmark's official evaluation protocol, and 𝑦𝑖= 0 denotes not done. Values of 𝜅range from −1 to 1, with higher values indicating\nThese labels serve as the reference against which we assess the stronger agreement and 𝜅= 0 corresponding to chance-level agreecorrectness, calibration, and agreement of VLM-based auditors. ment. By relying on benchmark-provided success signals rather than We compute pairwise 𝜅scores separately for each benchmark\nhuman annotations, we ensure scalability and reproducibility of our and operating system, enabling an analysis of how agreement varies\nevaluation while enabling systematic comparison across operating across environments and task distributions. High inter-model agreesystems and task domains. ment suggests that task completion is visually and semantically unambiguous in the final GUI state, whereas low agreement indicates\n2https://openai.com/index/hello-gpt-4o/ cases where success is difficult to infer, multiple interpretations are\n3https://claude.com/product/overview plausible, or auditors rely on different implicit assumptions. Conference'17, July 2017, Washington, DC, USA Marta Sumyk and Oleksandr Kosovan", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 6, + "total_chunks": 20, + "char_count": 1637, + "word_count": 229, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42cff03a-5ba5-498b-9655-14a3d98460a9", + "text": "Agreement is highest between proprietary auditors, indicating relatively consistent judgments in assessing task completion. Agreement between proprietary and open-source models is\nmarkedly lower, while agreement among open-source models remains moderate. Across all auditor pairs, agreement decreases on Windows Agent\nArena and OSWorld, suggesting that harder or more ambiguous\ntasks amplify subjective differences in auditor judgments. These\nresults indicate that even high-performing auditors may disagree\nsubstantially in complex environments, underscoring the imporFigure 1: Accuracy of VLM auditors across benchmarks, ortance of studying auditor variance rather than relying on a single\ndered by increasing mean accuracy across macOSWorld, Winmodel.\ndows Agent Arena, and OSWorld. By explicitly analyzing inter-model agreement, we move beyond 5 Discussion and Limitations\nsingle-model evaluation and characterize the variance and uncer- Our results indicate that while VLM-based auditing of CUAs is\ntainty inherent in model-based auditing of CUAs. feasible, auditor outputs should be interpreted as uncertain signals\nrather than definitive judgments. In particular, calibration quality\n4 Results and inter-model agreement provide critical information about audiIn this section, we present evaluation of 5 VLMs as an auditors of tor reliability that is not captured by accuracy alone. In practical\nCUA across three operating systems (macOS, Windows and Linux). settings, auditor confidence is often used to guide downstream deOur analysis focuses on three complementary aspects: (i) accu- cisions such as whether to request user confirmation, abstain from\nracy of task completion assessment, (ii) calibration of confidence judgment, or trigger fallback behaviors. Auditors that achieve high\nestimates, and (iii) inter-model agreement. accuracy but exhibit poor calibration may therefore still introduce\nrisk by overstating certainty in ambiguous cases. Accuracy of Task Completion Assessment. Table 1 reports the accuInter-model disagreement further highlights the inherent difracy of VLM auditors in predicting benchmark-provided done / not\nficulty of inferring task completion from a final GUI state alone.\ndone labels. Overall, proprietary models outperform open-source\nMany CUA tasks depend on hidden system state, background effects,\nalternatives across all benchmarks, with GPT-4o and Claude 3.5 Sonor transient interface changes that may not be visible in a single\nnet achieving the highest accuracy. Performance varies substanscreenshot. As a result, different auditors may rely on different imtially across operating systems: all auditors perform best on macOSplicit assumptions when judging success, leading to divergent but\nWorld, while accuracy drops notably on Windows Agent Arena and\nindividually plausible decisions. Rather than being treated purely\nOSWorld.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 7, + "total_chunks": 20, + "char_count": 2877, + "word_count": 393, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59e28955-5e5b-4792-8e66-ba44e49ee71c", + "text": "This suggests that auditing difficulty is strongly influas noise, such disagreement can serve as a signal of task ambiguity\nenced by environment complexity and interaction diversity, rather\nor insufficient observability, suggesting that additional evidence\nthan by auditor architecture alone.\nmay be required for reliable evaluation. Among open-source models, InternVL-2-8B and Qwen2-VL-7B\nThis study has several limitations. We restrict auditors to observconsistently outperform LLaVA-v1.5-7B, but still lag behind propriing only the task instruction and final GUI state, which reflects a\netary models. These results indicate that while open-source VLMs\nscalable and deployment-relevant setting but may underestimate\ncan function as auditors, their reliability remains limited in more\nperformance for tasks where intermediate actions or temporal concomplex or heterogeneous environments.\ntext are essential. Our calibration analysis relies on model-reported\nCalibration and Confidence Reliability. Beyond accuracy, reliable confidence elicited through standardized prompting, since tokenauditing requires that confidence scores meaningfully reflect un- level log probabilities are not consistently accessible across VLMs;\ncertainty. Table 2 reports Brier scores (mean ± standard deviation) consequently, we evaluate the reliability of reported uncertainty\nfor each auditor, where lower values indicate better calibration. rather than intrinsic probabilistic calibration. Finally, we focus\nProprietary models exhibit substantially lower Brier scores across exclusively on binary task completion and do not address other\nall benchmarks, indicating more reliable confidence estimates. In important auditing dimensions such as safety, policy compliance,\ncontrast, open-source models tend to be overconfident or poorly privacy, or harmful side effects, which are critical for real-world\ncalibrated, particularly on Windows Agent Arena and OSWorld. deployment of autonomous agents. Notably, calibration quality does not always track accuracy: some\nmodels with comparable accuracy exhibit significantly different\nBrier scores. This highlights that binary correctness alone is insuffi- 6 Conclusions\ncient to characterize auditor reliability, especially in safety-critical\nWe conducted a large-scale meta-evaluation of VLMs as autonomous\nor deployment settings where confidence estimates inform downauditors for CUAs across three widely used benchmarks spanning\nstream decisions.\nmacOS, Windows, and Linux. Our results reveal several consistent\nInter-Model Agreement. To assess consistency across auditors, patterns that have important implications for how model-based\nwe computed pairwise inter-model agreement using Cohen's 𝜅 evaluation should be designed, reported, and used in practice. CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents Conference'17, July 2017, Washington, DC, USA Table 1: Accuracy of task competion assesment by VLM auditors across benchmarks.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 8, + "total_chunks": 20, + "char_count": 3008, + "word_count": 389, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8555532c-a98b-4d25-a6ff-3ed7761fa891", + "text": "GPT-4o 0.91 0.71 0.77\nClaude 3.5 Sonnet 0.89 0.75 0.79 InternVL-2-8B 0.85 0.69 0.72\nLLaVA-v1.5-7B 0.82 0.66 0.68\nQwen2-VL-7B 0.87 0.68 0.73 Table 2: Calibration of VLM auditors measured by Brier score (mean ± std) across benchmarks. Auditor macOSWorld WindowsAgentArena OSWorld Proprietary Auditors\nGPT-4o 0.058 ± 0.003 0.091 ± 0.006 0.074 ± 0.004\nClaude 3.5 Sonnet 0.063 ± 0.004 0.099 ± 0.007 0.081 ± 0.005 Open-Source Auditors\nInternVL-2-8B 0.097 ± 0.007 0.142 ± 0.010 0.118 ± 0.008\nLLaVA-v1.5-7B 0.112 ± 0.008 0.159 ± 0.012 0.134 ± 0.009\nQwen2-VL-7B 0.105 ± 0.008 0.167 ± 0.011 0.141 ± 0.010 Table 3: Pairwise inter-model agreement of VLM auditors measured using Cohen's 𝜅across benchmarks. Model A Model B macOSWorld WindowsAgentArena OSWorld GPT-4o Claude 3.5 Sonnet 0.76 0.66 0.71 Proprietary vs Open-Source Auditors", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 10, + "total_chunks": 20, + "char_count": 822, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5f0c465-99fd-4549-a2b5-3c67a7cf8db5", + "text": "GPT-4o InternVL-2-8B 0.64 0.57 0.61\nGPT-4o LLaVA-v1.5-7B 0.61 0.54 0.59\nGPT-4o Qwen2-VL-7B 0.66 0.58 0.63\nClaude 3.5 Sonnet InternVL-2-8B 0.67 0.59 0.64\nClaude 3.5 Sonnet LLaVA-v1.5-7B 0.63 0.56 0.66\nClaude 3.5 Sonnet Qwen2-VL-7B 0.69 0.61 0.6 InternVL-2-8B LLaVA-v1.5-7B 0.62 0.55 0.60\nInternVL-2-8B Qwen2-VL-7B 0.68 0.60 0.65\nLLaVA-v1.5-7B Qwen2-VL-7B 0.64 0.67 0.61 First, auditor performance is strongly environment-dependent. reporting and testing that reflects realistic domain shift rather than\nAll evaluated models achieve substantially higher accuracy on ma- averaged metrics alone.\ncOSWorld than on Windows Agent Arena and OSWorld, indicating Second, confidence calibration emerges as a critical and indethat auditing difficulty is shaped not only by auditor architecture pendent axis of auditor reliability. Proprietary VLMs exhibit conbut also by interface heterogeneity, visual ambiguity, and task diver- sistently lower Brier scores and more stable confidence estimates,\nsity across operating systems and applications. As a result, single while open-source models are often poorly calibrated, particularly\naggregated performance scores can obscure meaningful failure on more challenging benchmarks. Importantly, calibration does\nmodes. Reliable auditing therefore requires environment-specific not always correlate with accuracy: auditors may make correct Conference'17, July 2017, Washington, DC, USA Marta Sumyk and Oleksandr Kosovan", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 11, + "total_chunks": 20, + "char_count": 1449, + "word_count": 191, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53f6ea96-abcd-4488-8d87-e918e448a82d", + "text": "judgments while expressing overconfident or unreliable probabili- arXiv:2510.18596 [cs.SE] https://arxiv.org/abs/2510.18596\nties. This distinction is essential for downstream use, where auditor [14] Haotian Liu, Chunyuan Li, et al. 2024. LLaVA 1.5: Improved Multimodal Reasoning\nand Instruction Following. arXiv preprint arXiv:2401.02410 (2024).\nconfidence may guide decisions such as when to request user con- [15] Yang LIU. 2025. A new human-computer interaction paradigm: Agent interaction\nfirmation, defer execution, or trigger safer fallback policies. model based on large models and its prospects. Virtual Reality & Intelligent\nThird, we observe substantial inter-model disagreement, espe- Hardware 7, 3 (2025), 237–266. doi:10.1016/j.vrih.2025.04.001 [16] Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu,\ncially on Windows Agent Arena and OSWorld. This disagreement Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025. InfiGUIAreflects the inherent ambiguity of judging task completion from gent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection.\na final GUI state alone. Many tasks involve hidden state changes, [17] Pattie Maes. 1994. Agents that Reduce Work and Information Overload. Commun.\nbackground effects, or success criteria that are not fully observ- ACM 37, 7 (1994), 30–40.\nable in a single screenshot, leading different auditors to resolve [18] Donald A. The Design of Everyday Things. Doubleday, New York.\n[19] Long Ouyang, JeffWu, Xu Jiang, Diogo Almeida, Carroll L.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 12, + "total_chunks": 20, + "char_count": 1542, + "word_count": 216, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05416d84-a04d-44fa-a0de-d917d9755720", + "text": "Wainwright, Pamela\nuncertainty differently. Rather than being treated as noise, disagree- Mishkin, Chong Zhang, Sandhini Agarwal, et al. 2022. Training Language Models\nment can serve as an informative signal, highlighting ambiguous to Follow Instructions with Human Feedback. Advances in Neural Information\ntasks, implicit benchmark assumptions, or cases where additional Processing Systems (NeurIPS) 35 (2022).\n[20] Stuart Russell and Peter Norvig. 2010. Artificial Intelligence: A Modern Approach\nevidence beyond the final state is required. (3rd ed.). Taken together, these findings suggest concrete implications [21] Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan\nEtaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe,\nfor both benchmarking and deployment. Benchmarks would ben- and Thilo Stadelmann. 2025.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 13, + "total_chunks": 20, + "char_count": 861, + "word_count": 116, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "615bd70c-4da5-4696-98ef-d73a12580c78", + "text": "A Comprehensive Survey of Agents for Computer\nefit from providing richer, verifiable evidence of success—such as Use: Foundations, Challenges, and Future Directions. arXiv:2501.16150 [cs.AI]\nstructured logs, intermediate states, or checkable artifacts, for tasks https://arxiv.org/abs/2501.16150\n[22] Ben Shneiderman. 1983. Direct Manipulation: A Step Beyond Programming Lanwhere the final GUI state is insufficient. In deployment, oriented guages. Ablex Publishing, Norwood, NJ.\nevaluation, metrics aligned with safety and reliability, such as cal- [23] Marta Sumyk and Oleksandr Kosovan. 2025. \"Are We Done Yet?\": A Visionibration quality, robustness under domain shift, and consistency Based Judge for Autonomous Task Completion of Computer Use Agents. arXiv:2511.20067 [cs.AI] https://arxiv.org/abs/2511.20067\nacross evaluators, should be prioritized over accuracy alone. [24] Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua\nOverall, while VLM-based auditing of CUAs is feasible and pro- Lin, and Jiaqi Wang. 2025. SEAgent: Self-Evolving Computer Use Agent with\nAutonomous Learning from Experience. arXiv:2508.04700 [cs.AI] https://arxiv.\nprietary models currently provide the strongest accuracy and cali- org/abs/2508.04700\nbration, our results show substantial degradation and disagreement [25] Minh Duc Vu, Han Wang, Zhuang Li, Jieshan Chen, Shengdong Zhao, Zhenin more complex environments. These findings underscore that chang Xing, and Chunyang Chen. 2024.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 14, + "total_chunks": 20, + "char_count": 1491, + "word_count": 196, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "246ca04f-f209-4e42-a9a5-eb7932d6d12c", + "text": "GPTVoiceTasker: Advancing Multi-step\nMobile Task Efficiency Through Dynamic Interface Exploration and Learning.\nevaluation itself is a central bottleneck for dependable CUA deploy- arXiv:2401.14268 [cs.HC] doi:10.1145/3654777.3676356\nment and must be treated as a first-class research problem, with [26] Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting\nexplicit modeling of evaluator uncertainty, variance, and ambiguity. Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong,\nYining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li,\nChen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu,\nReferences HaobinPeiyao Zhao,Chen,PengfeiHongyiLiu,Guo,QinghaoJing Su,Ye,JingjiaRenjieHuang,Zheng,KaiShulinShen,Xin,KaiyuWayneShi,XinLinZhao,Yan,\n[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu,\nDan Mané. 2016. Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565 Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan\n(2016). Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang,\n[2] Shuai Bai et al. 2024. Qwen2-VL: A Versatile Vision-Language Model for Under- Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui\nstanding and Generation. arXiv preprint arXiv:2409.12191 (2024). Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li,\n[3] Jeffrey P.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 15, + "total_chunks": 20, + "char_count": 1518, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fc722e24-d53f-45d3-9e63-7798fd78884e", + "text": "Accessibility and Assistive Technology. ACM Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu\n63, 4 (2020), 54–63. doi:10.1145/3386296 Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin,\n[4] Xiaoyi Chen et al. 2024. InternVL 2.0: Scaling Up Vision-Language Pretraining Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai\nand Benchmarking. arXiv preprint arXiv:2405.07961 (2024). Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao\n[5] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin\nDario Amodei. 2017.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 16, + "total_chunks": 20, + "char_count": 700, + "word_count": 108, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fcadf747-aa8b-4a0d-b61f-27ad4a7c424b", + "text": "Deep Reinforcement Learning from Human Preferences. Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen\nAdvances in Neural Information Processing Systems (NeurIPS) 30 (2017). Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, and Guang Shi. 2025. UI-\n[6] Finale Doshi-Velez and Been Kim. 2017. Towards a Rigorous Science of Inter- TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement\npretable Machine Learning. arXiv preprint arXiv:1702.08608 (2017). Learning. arXiv:2509.02544 [cs.AI] https://arxiv.org/abs/2509.02544\n[7] Jodi Forlizzi, John Zimmerman, Vince Mancuso, and Sonya Kwak. 2007. How [27] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng\ninterface agents affect interaction between humans and computers. In Pro- Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng\nceedings of the 2007 Conference on Designing Pleasurable Products and Inter- Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu.\nfaces. Association for Computing Machinery, New York, NY, USA, 209–221. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in\ndoi:10.1145/1314161.1314180 Real Computer Environments. arXiv:2404.07972 [cs.AI] https://arxiv.org/abs/\n[8] Marti A. Mixed-Initiative Interaction. IEEE Intelligent Systems 14, 5 2404.07972\n(1999), 14–23. [28] Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang,\n[9] Eric Horvitz. 1999. Principles of Mixed-Initiative User Interfaces. Proceedings Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang,\nof the ACM SIGCHI Conference on Human Factors in Computing Systems (1999), and Qi Zhang. 2025. Large Language Model-Brained GUI Agents: A Survey.\n159–166. arXiv:2411.18279 [cs.AI] https://arxiv.org/abs/2411.18279\n[10] Guy Katz, Clark Barrett, David L.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 17, + "total_chunks": 20, + "char_count": 1859, + "word_count": 252, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a7de5fee-e427-4c27-84a8-54fb8bae0740", + "text": "Dill, Kyle Julian, and Mykel J. Kochenderfer. [29] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 18, + "total_chunks": 20, + "char_count": 114, + "word_count": 20, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88a8eb4f-8709-40e2-b1f8-cccd30590603", + "text": "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In is a Generalist Web Agent, if Grounded. arXiv:2401.01614 [cs.IR] https://arxiv. Proceedings of the 29th International Conference on Computer Aided Verification org/abs/2401.01614\n(CAV). Springer, 97–117.\n[11] John D.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 19, + "total_chunks": 20, + "char_count": 289, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd88548f-f5ce-417e-ae25-32ab64280314", + "text": "Trust in Automation: Designing for\nAppropriate Reliance. Human Factors 46, 1 (2004), 50–80.\n[12] Henry Lieberman. 1997. Autonomous Interface Agents. Proceedings of the ACM\nConference on Computers and Human Interaction (CHI) (1997), 67–74.\n[13] Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li,\nShaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, and Xing Sun. 2025. CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent.", + "paper_id": "2603.10577", + "title": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", + "authors": [ + "Marta Sumyk", + "Oleksandr Kosovan" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10577v1", + "chunk_index": 20, + "total_chunks": 20, + "char_count": 464, + "word_count": 70, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10582_semantic.json b/data/chunks/2603.10582_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..7b109c4d67ba44edd68b7a3d19211f9b3ddab512 --- /dev/null +++ b/data/chunks/2603.10582_semantic.json @@ -0,0 +1,452 @@ +[ + { + "chunk_id": "eb6fa4dc-9564-4257-a266-965929206675", + "text": "Jannis Maier jannis.maier@helmholtz-berlin.de\nHelmholtz-Zentrum Berlin\nTU Dortmund Lennart Purucker purucker@cs.uni-freiburg.de\nPrior Labs\nUniversity of Freiburg\n2026 Abstract Ensembling is commonly used in machine learning on tabular data to boost predictiveMar\nperformance and robustness, but larger ensembles often lead to increased hardware demand. We introduce HAPEns, a post-hoc ensembling method that explicitly balances accuracy11\nagainst hardware efficiency. Inspired by multi-objective and quality diversity optimization,\nHAPEns constructs a diverse set of ensembles along the Pareto front of predictive performance\nand resource usage. Existing hardware-aware post-hoc ensembling baselines are not available,\nhighlighting the novelty of our approach. Experiments on 83 tabular classification datasets\nshow that HAPEns significantly outperforms baselines, finding superior trade-offs for ensemble\nperformance and deployment cost. Ablation studies also reveal that memory usage is a[cs.LG]\nparticularly effective objective metric. Further, we show that even a greedy ensembling\nalgorithm can be significantly improved in this task with a static multi-objective weighting\nscheme. Standard Naive\nOurs Method Hardware-Aware Model 2\nModel 1 Model 1\nModel 4\nModel 2 Model 6 Hardware Hardware Requirement\nConstraint Model 3arXiv:2603.10582v1\n98.7% Accuracy 97.9% Accuracy 98.3% Accuracy Figure 1: Illustration of three ensemble selection strategies: a standard method ignoring hardware\nconstraints, a naive hardware-aware variant that sacrifices accuracy, and an advanced hardware-aware method\nthat balances accuracy and efficiency. Box size reflects model resource usage; the red dashed line indicates\nthe hardware resource constraint.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 1, + "total_chunks": 25, + "char_count": 1738, + "word_count": 225, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b41d120a-5d4b-49c3-85f9-78fc59cd872e", + "text": "Ensembling is a central technique in machine learning, used to improve predictive performance, stability, and\nrobustness across a wide range of applications. From boosting and bagging in classical supervised learning to\nstacking in modern deep learning workflows, ensembles are frequently adopted to combine the strengths of In many practical scenarios, models produced during training or exploratory analysis are later\ncombined into ensembles in a post-hoc fashion to substantially improve performance (Erickson et al., 2025;\nArango et al., 2025). This workflow has been further popularized by automated machine learning (AutoML)\nsystems for tabular data (Purucker & Beel, 2023; He et al., 2021; Erickson et al., 2020), where greedy ensemble\nselection (GES) by Caruana et al. (2004) has emerged as a widely used method to automatically build strong\nensembles from model libraries. While post-hoc ensembling generally improves predictive performance, larger ensembles lead to increased\nhardware demands at inference time. Each additional model increases prediction latency and resource\nconsumption, inducing higher costs.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 2, + "total_chunks": 25, + "char_count": 1121, + "word_count": 159, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e17240c7-ebdb-42ec-85dd-408ab9b9ab34", + "text": "While this matters greatly in production settings, it is ignored by\nstandard post-hoc ensembling methods. As machine learning is increasingly deployed in environments with\ntight resource constraints, the gap between high predictive accuracy and hardware feasibility has become\nmore pronounced. We address this challenge by introducing HAPEns, a post-hoc ensembling method that explicitly balances\npredictive performance against hardware costs. It improves on existing baselines by constructing Pareto\nfronts of ensembles that more effectively balance competing objectives. Thus, practitioners can select\nbetter models that satisfy both performance and deployment requirements under their specific hardware\nconstraints. Drawing inspiration from multi-objective optimization (Gunantara, 2018) and quality diversity\noptimization (Pugh et al., 2016), HAPEns maintains a diverse population of ensembles that vary in both\nhardware cost and predictive behavior, while optimizing for predictive performance. The result is a set of\ncandidate ensembles that offer distinct trade-offs between both objectives. To evaluate HAPEns, we performed experiments on 83 tabular classification datasets of varying size and\ncomplexity.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 3, + "total_chunks": 25, + "char_count": 1213, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c72007fe-896a-4fbc-bbec-1e9a8a8f1ed9", + "text": "All cost metrics are concrete measurements taken on the same system, held fixed across methods\nfor fair comparison. We compare ensembles constructed by our method to those selected by baselines like\nGES and a novel multi-objective baseline. Our findings reveal that optimizing for memory footprint is a\nparticularly effective metric for deployment cost and that our method significantly outperforms competitors\nin balancing hardware costs and predictive performance. To our knowledge, this is the first systematic study\nof hardware-aware post-hoc ensemble selection. Prior hardware-aware work focuses on model search or NAS,\nnot ensemble construction from fixed model libraries. In this work, we: (i) Propose a novel post-hoc ensembling algorithm that explicitly\nincorporates hardware cost into the selection process; (ii) Demonstrate through extensive benchmarking\nthat our method achieves superior accuracy–cost trade-offs compared to existing baselines; (iii) Show that\nmemory-awareness yields substantial gains even in inference-time efficiency; (iv) Limitations, including the\ndependence on a single hardware configuration, are discussed, with directions for extending the method to\nheterogeneous devices. (v) Ensure reproducibility by open-sourcing all code1, results, and integration with\npopular ensembling frameworks. Ensembling—combining multiple pre-trained models—is an effective approach to improve predictive performance and robustness. Common strategies include bagging, stacking, and ensemble selection (ES). Bagging\nand stacking are typically integrated into the training process, whereas ES can be applied post hoc, that\nis, after model training has completed. ES might then also be referred to as post hoc ensembling. Figure 2\nprovides an overview of the research fields discussed in the following paragraphs.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 4, + "total_chunks": 25, + "char_count": 1828, + "word_count": 250, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8aaa4e41-a0d4-416a-8b69-7aea12f93d2e", + "text": "ES as introduced by Caruana et al. (2004) is a forward selection algorithm that greedily constructs an\nensemble by iteratively adding the model that improves the predictive performance of the ensemble the most. The resulting ensemble is defined by a weight vector derived from this superset of selected models. In this\nwork, we adopt a broader interpretation of ensemble selection: Any algorithm that produces such a weight\nvector from a pool of trained models qualifies as ES. To distinguish the classical algorithm of Caruana et al.,\nwe refer to it as greedy ensemble selection (GES). 1All code used for this publication is available at: https://anonymous.4open.science/r/C07F Multi-Objective\nAutoML Ensemble Selection\nOptimization Hardware Aware\nPost-Hoc Ensembling GES\nMachine Learning Hardware-Aware QDO-ES\nNAS Modality: Tabular Data Hardware-Aware Post-Hoc Ensembling Figure 2: Overview of the main research areas. HW-NAS (red) is shown as a parallel area, while the others\n(orange) directly influence HAPEns. This work focuses solely on tabular data (blue).", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 5, + "total_chunks": 25, + "char_count": 1064, + "word_count": 157, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bf56ddcf-ad50-47fb-af85-e21715a0c218", + "text": "Post-hoc ensembling is a widely adopted component in automated machine learning (AutoML) systems,\nparticularly for tabular data (Erickson et al., 2020; Feurer et al., 2015; Purucker & Beel, 2023). It enables\nthe reuse of models generated during training without retraining, making it a computationally attractive\nfinal optimization step. Although ensemble selection can theoretically be used during training, we reserve the\nterm ES for its post-hoc usage in this work. The term blending also appears in this context, but it specifically\nrefers to ensemble selection applied to a holdout validation set distinct from the training data. Recent years have seen the integration of multi-objective optimization (MOO) into various stages of the\nmachine learning and AutoML pipelines, including neural architecture search (NAS) (Benmeziane et al.,\n2021b;a). These methods optimize for trade-offs such as accuracy versus latency, energy consumption,\nor memory usage. However, the use of MOO techniques in post-hoc ensemble selection remains largely\nunexplored. Shen et al. (2022) introduced DivBO, a diversity-aware Bayesian optimization framework that\nincorporates ensemble selection during candidate evaluation to promote both accuracy and diversity. Although\ntheir approach targets the model search stage rather than post-hoc optimization, it highlights the potential\nof multi-objective formulations to improve ensemble composition. Nevertheless, to the best of our knowledge,\nno prior work systematically investigates the construction of Pareto-optimal ensembles that explicitly account\nfor hardware constraints such as inference time or memory usage. Modern implementations of GES—still the de facto standard in AutoML frameworks such as Autosklearn (Feurer et al., 2022) and AutoGluon (Erickson et al., 2020)—typically optimize only for predictive\nperformance and remain agnostic to deployment cost.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 6, + "total_chunks": 25, + "char_count": 1897, + "word_count": 264, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5a0057c4-3183-4508-bff3-65b218b9598b", + "text": "Consequently, they may produce ensembles that are\nunnecessarily large or infeasible for deployment due to hardware requirements. Our work addresses this identified gap by introducing a hardware-aware approach to ES, explicitly targeting\nthe trade-off between accuracy and resource usage. QDO-ES as developed by Purucker et al. (Purucker et al.,\n2023) inspired HAPEns and the inclusion cost metrics during ensemble construction. In doing so, we extend\nthe utility of ES beyond predictive performance to deployment and real-world use. One of the last steps in the ML pipeline is model generation, where human experts or AutoML systems\nexplore and evaluate various configurations. This process yields a set of candidate models, typically followed\nby the selection of the single best model for deployment. Post-hoc ensembling instead aims to improve the\nquality of the prediction by combining multiple candidates from this set. Let M = {M1, . . . , Mp} be the library of models and let cj be the number of times Mj is selected out of a\ntotal of T picks (with repetition). Define the weight vector w: 1 cj\nw = (w1, . . . , wp)⊤= (c1, . . . , cp)⊤, wj = , X wj = 1. (1)\nT T The ensemble predictor for input x is fens(x) = Ppj=1 wj fj(x). This formulation applies broadly: for regression,\neach fj(x) is a scalar prediction; for probabilistic classification, fj(x) is a vector of class probabilities, and\nfens(x) is the averaged probability vector. Although the ensemble predictor is ultimately defined by a weight vector, there are multiple ways to construct\nit. A common method is GES, which uses a forward selection strategy to iteratively build the ensemble by\ngreedily adding models that improve performance the most. In contrast, our work explores a population-based\napproach.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 7, + "total_chunks": 25, + "char_count": 1774, + "word_count": 295, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4c9c4d8f-7c0d-47af-bf30-4687576f9e72", + "text": "We begin by sampling an initial population of ensembles across a two-dimensional behavior space (e.g.,\nmemory footprint vs. average loss correlation). Each ensemble is evaluated and stored in a niche corresponding\nto its behavior and hardware costs. New ensembles are generated by selecting suitable parents from these\nniches and applying crossover and mutation (see Figure 3). This process repeats until convergence or a\ntime/iteration limit is reached, allowing us to explore a wide range of model combinations and discover\nPareto-optimal trade-offs between prediction quality and deployment cost. What follows now are detailed\ndefinitions of these concepts, similarly outlined by Purucker et al. (2023). Sample\nEnsemble\nNew Ensembles Sample\nEnsembles Correlation Memory Loss\nFootprint Average 2. Figure 3: Illustration of the HAPEns search process. Ensembles are sampled from bins over memory\nfootprint and average loss correlation, then evolved via crossover and mutation to explore the behavior space. Each ensemble E is assigned a two-dimensional descriptor b(E) = (ALC, HW), where\nALC is the mean Pearson correlation among the loss vectors of its constituent models and HW is a cost\nmetric aggregated over those models. Following prior work (Purucker et al., 2023), we divide this 2D space\ninto a 7 by 7 grid using a sliding bounding archive (Fontaine et al., 2019), creating 49 bins (niches).2 The\nalgorithm allows ensembles to compete only within the same niche. This ensures that different regions of 2The 7×7 grid follows the setup of Purucker et al. (2023), balancing behavior space coverage against niche sample density. Sensitivity to this choice is mitigated by the sliding boundaries archive (Fontaine et al., 2019), which adapts niche boundaries to\nthe observed solution distribution; a full sensitivity analysis is left for future work. the behavior space can retain their best solutions.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 8, + "total_chunks": 25, + "char_count": 1906, + "word_count": 291, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4f1c218-d7d5-4033-b4f6-4e5b3dc0fa65", + "text": "Therefore, a diverse population across two objectives is\nmaintained while optimizing predictive performance. Here we found that memory as a cost metric produces\nensembles which are best at trading off predictive performance and hardware cost. Each ensemble E is scored by a scalar loss L(E) on cross-validation data. The behavior space is\npartitioned into fixed niches or bins, and each niche retains the lowest loss ensemble observed. The parents are selected from the archive using a combined dynamic strategy that balances\nexploration and exploitation. The method alternates between deterministic selection of the best solution\nand stochastic selection of random solutions, with the selection probability dynamically adjusted every ten\niterations based on which approach yields better results. Deterministic and tournament-based selection\nmethods are also available as alternatives. Two parent repetition vectors r and r′ are recombined using two-point crossover restricted\nto the index set S = {i | r[i] > 0} ∪{i | r′[i] > 0}, i.e., indices nonzero in either parent. If |S| < 3 or the\nresulting offspring is all-zero, average crossover is used instead, rounding counts up to the nearest integer to\nmaintain valid repetition counts. The procedure is detailed in Algorithm 1. A single element of the child repetition vector rchild is incremented by one, with the index\nj chosen uniformly at random from the model pool P of size p. Rejection sampling avoids re-evaluating\npreviously seen ensembles, allowing up to 50 retries; if all retries fail, an emergency brake increases the\nincrement magnitude or raises the mutation-after-crossover probability to 1.0 for the current iteration to\nescape the duplicate region.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 9, + "total_chunks": 25, + "char_count": 1716, + "word_count": 263, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f421b8f3-220a-4e4b-878a-b8a9fa58706b", + "text": "The main objective of this paper is to compare the proposed method and the baselines on how well they\ncan balance predictive performance and hardware costs. In this context, only Pareto optimal ensembles are\nrelevant. In addition, there is no best ensemble because choosing the right trade-off depends on the real\nworld scenario. Therefore, our main focus lies with the Pareto fronts of ensembles generated by each method. Our proposed method uses memory usage as its hardware-aware behavior metric.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 10, + "total_chunks": 25, + "char_count": 499, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3a989356-3cb8-4e81-9334-7fae948e4702", + "text": "We compare it with four\nbaselines: • Single-Best: A naive baseline that selects the single model with the highest validation performance. Including Single-Best highlights the performance gains achieved through ensembling. • GES*: Our implementation of greedy ensemble selection (GES) enhanced to return the entire\nsequence of ensembles generated during its run. GES* therefore represents the best-case performance\nof the original widely used GES, providing a strong reference point for assessing improvements. • Multi-GES: Our implementation of novel multi-objective extensions of GES to enable the algorithm\nto balance predictive performance and inference time using a static weighting scheme; see Appendix A.1\nfor details on our implementation. Multi-GES introduces a naive approach to optimizing multiple\nobjectives in ensemble selection and allows us to assess the benefits of our more flexible formulation. • QDO-ES: The quality-diversity optimization ensemble selection method(Purucker et al., 2023),\nwhich optimizes for performance and behavioral diversity but is not hardware aware. This baseline\nisolates the effect of hardware awareness by comparing against a method that can already generate\nvarious Pareto-optimal ensembles without considering resource costs. To assess the quality of the generated Pareto fronts, we rely on two standard multi-objective indicators:\ninverted generational distance plus (IGD+) (Ishibuchi et al., 2015) and hypervolume (HV) (Zitzler & Thiele,\n1999). IGD+ quantifies how well a set of solutions approximates a reference front, which in our case is\nconstructed from the Pareto optimal solutions of all the methods under comparison. HV measures the portion\nof the objective space dominated by a set of solutions (see B for details). The set of solutions here is the\nset of ensembles constructed by one method for a given task and seed. Both HV and IGD+ are widely", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 11, + "total_chunks": 25, + "char_count": 1903, + "word_count": 281, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88d4bbb1-2397-444a-bba7-6a4512a8da19", + "text": "104 18\n103 14 Classes Features 12\nof of\n8 Number Number 102\n101 4\n102 103\nNumber of Samples Figure 4: Scatter plot of datasets over their number of features (y), number of samples (x), and the number\nof classes (color). used in multi-objective evaluation, and for our experiments we employ the pygmo (Biscani & Izzo, 2020)\nimplementation. We focus primarily on HV in the main analysis because we do not have a true Pareto front\nfor IGD+, and both metrics lead to the same conclusions.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 12, + "total_chunks": 25, + "char_count": 484, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35dfac06-3d18-4733-9d6b-2484bd84ba40", + "text": "The ROC AUC and cost metrics were normalized per seed and task using min-max normalization over all\nmethods. This makes the results comparable across experiments even after selecting specific methods per\nexperiment. To ensure a comprehensive and reproducible evaluation, we organize our experiments into three\ngroups: (1) Main Results, (2) Details, and (3) Ablation (shown in Table 1). We conducted our experiments using TabRepo (Salinas & Erickson, 2023), which provides\nprecomputed model predictions for 1,530 model configurations across 211 tabular datasets, enabling largescale, reproducible simulation of post-hoc ensemble selection. We used the D244_F3_C1530_100 context,\na pre-configured evaluation setup that covers 100 of these datasets; after excluding regression tasks, 83\nclassification datasets remained. We aggregated results for 10 seeds to account for run-to-run variance. Their\ncharacteristics are shown in Figure 4, revealing a wide variety of class, sample, and feature counts. The available base models are plotted in Figure 5 with their inference times, illustrating the diversity of the\nmodel pool—from cheap linear and boosting methods to computationally expensive transformers—providing\na realistic and varied set of candidates for ensemble construction. Since each component model with non-zero\nweight must be evaluated independently at inference time, every such model incurs its full hardware cost. Therefore, the hardware cost of an ensemble under weight vector w equals the sum of the hardware costs of\nall models with non-zero weight, i.e., P j:wj̸=0 hj, where hj denotes the hardware cost of model Mj.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 13, + "total_chunks": 25, + "char_count": 1632, + "word_count": 238, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "57487ff0-8c1e-4411-9e41-686ea4cf26d0", + "text": "We first present the main results, followed by detailed analyses, and finally ablation studies. The central\nfocus is on the ability of each method to balance two objectives: predictive performance and hardware cost. In particular, identifying a single strong ensemble may be less effective than discovering several competitive\nensembles that trade off these objectives differently. In general, HAPEnsconsistently outperforms baselines\nin both Main Results (EXP1) and Details (EXP2, EXP3, EXP4), demonstrating its superior ability to produce\ncompetitive ensembles while incorporating hardware awareness. 2 Time10\n10 3\n10 4 Inference\n10 5\nKNeighborsCatBoostLinearRegressionRandomForestExtraTreesNeuralNetTorchLightGBMNeuralNetFastAIXGBoostFTTransformerTabPFN\nModel Type Figure 5: Comparison of TabRepos model types and their corresponding inference times for varying tasks. KNeighbours and linear regression are expectedly on the lower end of the spectrum, while transformers have\nincreased cost due to their complexity.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 14, + "total_chunks": 25, + "char_count": 1018, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f787e231-9cfd-490b-85c7-d167cb8a7829", + "text": "CD CD\n5 4 3 2 1 5 4 3 2 1 Single-Best HAPEns Single-Best HAPEns\nQDO-ES Multi-GES(0.68) QDO-ES Multi-GES(0.68)\nGES* GES* Figure 6: HAPEns significantly outperforms the base- Figure 7: HAPEns significantly outperforms the baselines on HV. Single-Best is significantly outperformed lines on IGD+. Single-Best is significantly outperby all other methods. formed by all other methods.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 15, + "total_chunks": 25, + "char_count": 379, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b58d0a37-dae2-4b55-9657-d7884f404c8b", + "text": "EXP1 Figure 6 shows a critical difference (CD) diagram (Demšar, 2006; Herbold, 2020) summarizing the\naverage ranks of the methods evaluated based on their HV values. The HV was calculated from the inverted\nROC AUC on the test data and the averaged normalized3 cost metrics (inference time, memory, and disk\nusage)—collectively referred to as the hardware aggregate. Therefore, this figure provides an overview for all\nthe datasets, model configurations, and cost metrics we explored in our tests. To simplify the presentation\nand highlight the overall trade-off between predictive performance and hardware costs, we aggregate the\nthree hardware measures into a single score. This avoids overemphasizing any single metric, while keeping\nthe focus on the general notion of hardware efficiency. In the CD plot, methods connected by a horizontal bar are statistically indistinguishable according to the\nNemenyi post-hoc test.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 16, + "total_chunks": 25, + "char_count": 921, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e1ef2369-9849-4327-a2e4-8cdc396575db", + "text": "HAPEns shows significantly superior performance to the baselines, which makes it\nthe best method to balance the trade-off between predictive performance and hardware costs. Between the\nbaselines, we do not see significant differences except for the single-best method, which simply picks the best\nmodel configuration based on its ROC AUC. A single-best model is not well suited for this setting because it\ncannot capture diverse trade-offs between predictive performance and different hardware costs, which multiple\nensembles can exploit more effectively. We see slight improvements in GES* over QDO-ES, which can be 3Min-max normalization applied after averaging over folds and seeds", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 17, + "total_chunks": 25, + "char_count": 684, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba2a5d66-9996-446c-84ed-9c73c322b254", + "text": "CD CD\n5 4 3 2 1 5 4 3 2 1 Single-Best HAPEns Single-Best HAPEns\nQDO-ES Multi-GES(0.68) QDO-ES Multi-GES(0.68)\nGES* GES* (a) Disk usage. (b) Memory usage. Single-Best HAPEns\nQDO-ES Multi-GES(0.68)\nGES* Figure 8: Critical difference plots for the hypervolume across different hardware-aware objectives.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 18, + "total_chunks": 25, + "char_count": 300, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "30f695b8-9fe0-43ce-9d05-1d6eb708d5d9", + "text": "attributed to the modification of GES to return all intermediate ensembles, which generally leads to a higher\nnumber of ensembles produced (see Figure 14). This improvement over the standard procedure of returning\nthe final ensemble gives GES a strong edge here. Multi-GES performs slightly higher, but insignificantly so, by\nconstructing ensembles with reduced hardware costs while keeping their predictive performance comparable\nto GES*. A discussion on GES*'s overfitting problem and the corresponding cost-to-performance trade-off\nfollows in the Multi-GES ablation part of this section. Details (EXP2, EXP3, EXP4) EXP2 In Figure 7, the IGD+ results are generally consistent with the HV findings.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 19, + "total_chunks": 25, + "char_count": 699, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "89f9f6f3-7432-44ba-820c-76909b615c9d", + "text": "The main difference is\nthe stronger relative performance of Multi-GES, which now significantly outperforms GES and comes close to\nmatching HAPEns, to the point that HAPEns's superiority is no longer statistically significant. This effect\narises because Multi-GES constructs more efficient ensembles, while QDO-ES primarily improves predictive\nperformance (Figure 9) but at the cost of building more expensive ensembles on average. Since IGD+\nevaluates solutions with respect to a reference front, Multi-GES benefits disproportionately: a larger share of\nits efficient solutions lies on the reference Pareto front, reducing the relative advantage of HAPEnscompared\nto dominated HV. For this reason, we focus on HV in the remainder of the paper, while noting that Multi-GES\nis particularly strong at exploring the low-cost end of the Pareto front. EXP3 Looking at the HV results for the individual cost metrics in Figure 8, we see in more detail what was\nalready evident in the main results: HAPEns performs strongly across all metrics. The method demonstrates\nrobustness to different hardware considerations, even when the behavior space is defined solely by memory\nusage. Notably, Multi-GES shows a significant improvement over the other baselines when optimizing for\ninference time. This highlights its specialization toward the specific cost metric it uses during ensemble\nconstruction. It also raises the question of whether incorporating additional cost metrics could lead to further\nimprovements—but we will leave this for future work. Since our experiments abstract away from specific\nhardware configurations, these findings should be viewed as preliminary. Overall, these results point to an\ninteresting direction for future research that investigates hardware-aware behavior more directly under diverse\nconfigurations and cost measures.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 20, + "total_chunks": 25, + "char_count": 1844, + "word_count": 267, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7a40644-2a59-4869-a7cd-3cfec5920a6b", + "text": "GES* HAPEns Multi-GES\n1.00 0.00 (lower QDO-ES Single-Best\n1.00 0.00\n0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normalized Test ROC AUC (inverted; lower is better) Figure 9: Comparison of constructed ensembles when including cost metrics in the ensembling process. The\nbaselines and the hardware-aware methods in the density plots, produce a clear trend, where the ensembles\nof the latter methods are more condensed toward the x-axis. Single-Best Memory\nQDO-ES Inference Time\nGES* Diskspace\nEnsemble Size Figure 10: Comparison of different cost metrics used for HAPEns. Memory and inference time perform\nstrongest, but ensemble size is still notable as a proxy cost metric, which does not need additional measurements. EXP4 Figure 9 shows a density plot of the ensembles constructed by the different methods. Compared to\nSingle-Best, all ensemble methods increase hardware costs but also yield clear gains in predictive performance. Multi-GES reduces hardware costs relative to GES*, confirming its intended effect. QDO-ES and HAPEns\nproduce similar overall trends, but the ensembles of HAPEns are more concentrated along the x-axis, indicating\nlower resource usage. These observations clarify and reinforce the improvements of HAPEns over QDO-ES in\nterms of hardware efficiency, and likewise of Multi-GES over GES*. Overall, the inclusion of cost metrics in\nthe ensemble construction process achieves the desired shift toward more efficient ensembles.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 21, + "total_chunks": 25, + "char_count": 1458, + "word_count": 217, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fdb9c192-f49b-4715-99fc-3f21d8fccf21", + "text": "GES* produces 10–15\nmore ensembles on average than HAPEns, yet fewer of them lie on the Pareto front, indicating that many\nof its ensembles are not useful in this context. QDO-ES and HAPEns both generate a high ratio of unique\nensembles, illustrating the effectiveness of the behavior space in promoting diversity. By contrast, Multi-GES\nproduces fewer ensembles overall and fewer unique ensembles than GES*, which aligns with the increased\ndifficulty of adding models once hardware costs are incorporated into the selection process.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 23, + "total_chunks": 25, + "char_count": 533, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f0bd999c-056b-4e58-a886-650c75da1438", + "text": "212019181716151413121110 9 8 7 6 5 4 3 2 1 Multi-GES(1.00) Multi-GES(0.68)\nSingle-Best Multi-GES(0.74)\nGES* Multi-GES(0.84)\nMulti-GES(0.05) Multi-GES(0.42)\nMulti-GES(0.95) Multi-GES(0.37)\nMulti-GES(0.11) Multi-GES(0.63)\nMulti-GES(0.16) Multi-GES(0.79)\nMulti-GES(0.53) Multi-GES(0.58)\nMulti-GES(0.21) Multi-GES(0.47)\nMulti-GES(0.89) Multi-GES(0.32)\nMulti-GES(0.26) Figure 11: Comparison of static weights for Multi-GES highlighting the trade-off between predictive performance and hardware costs. Ablation (EXP5, EXP6) EXP5 We further evaluated HAPEns with four different cost metrics: inference time, memory usage, disk\nusage, and ensemble size.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 24, + "total_chunks": 25, + "char_count": 645, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88326e85-3416-4b2a-828f-aaef11d33cb4", + "text": "The last serves only as a proxy cost metric, yet Figure 10 shows that it still\nprovides a competitive signal to balance the trade-off, without requiring additional measurements. Among the\ntrue cost metrics, memory usage and inference time consistently lead to the strongest results, with memory\nshowing a slight edge. These findings highlight that, while the size of the ensemble can act as a lightweight\napproximation, the use of actual cost metrics yields the most reliable improvements. EXP6 In Figures 11 and 15 we investigate the effect of different static weightings in Multi-GES. By gradually\nincreasing the weight on the inference time, the constructed ensembles shift from high-performing but more\nexpensive configurations toward ensembles with lower hardware costs. This transition is clearly visible in\nthe density plots, where the mass of ensembles moves closer to the origin of the objective space as the\nemphasis on inference time increases. The trade-off between predictive performance and efficiency becomes\napparent: a stronger emphasis on time reduces costs but slightly lowers predictive accuracy, while a weaker\nemphasis maintains accuracy at the expense of efficiency. In Figure 11 we see a sweet spot, where excessively\nhigh or low time weights yield sub-par performance relative to intermediate weightings. For comparison in\nthe main results, we chose the best performing weight: 0.68. These results confirm that Multi-GES allows\npractitioners to explicitly control the desired balance between performance and hardware costs through a\nweighting mechanism, highlighting its flexibility for different deployment scenarios. This work introduced HAPEns, a hardware-aware post hoc ensemble selection method that explicitly balances\npredictive performance and deployment efficiency. By integrating cost metrics into the ensemble construction\nprocess, HAPEns extends ensemble selection into a multi-objective framework that explores the Pareto front\nof accuracy and resource usage. Across 83 tabular classification datasets, HAPEns consistently outperforms\nexisting baselines, achieving superior trade-offs under controlled hardware measurement conditions and\ndemonstrating robustness across different cost metrics. Ablation studies reveal that memory usage is a\nparticularly effective objective, providing a stable optimization signal and leading to ensembles that generalize\nwell across cost measures. Additionally, our experiments show that even simple greedy methods like GES can\nbenefit substantially from static multi-objective weighting, emphasizing the broad potential of hardware-aware\nensemble construction.", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 25, + "total_chunks": 25, + "char_count": 2633, + "word_count": 367, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4f1729ad-3586-42b2-b1dc-ec4eb1a0ce79", + "text": "To our knowledge, this is the first systematic study of hardware-aware post hoc\nensemble selection, opening a new research direction for the AutoML and tabular ML communities. Future\nwork may explore dynamic weighting schemes, simultaneous optimization across multiple hardware objectives,\ntask-specific hardware profiling, real-device benchmarking, and integration into end-to-end AutoML pipelines. L.P. acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)\nunder SFB 1597 (SmallData), grant number 499552394", + "paper_id": "2603.10582", + "title": "HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data", + "authors": [ + "Jannis Maier", + "Lennart Purucker" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10582v1", + "chunk_index": 26, + "total_chunks": 25, + "char_count": 549, + "word_count": 69, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10588_semantic.json b/data/chunks/2603.10588_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..b172d5d9c0f78997dfe01fb07426c634a5a2050f --- /dev/null +++ b/data/chunks/2603.10588_semantic.json @@ -0,0 +1,577 @@ +[ + { + "chunk_id": "24ae018c-aa90-4537-a9e2-83a874666f9b", + "text": "DOES LLM ALIGNMENT REALLY NEED DIVERSITY? AN EMPIRICAL STUDY OF ADAPTING RLVR METHODS FOR MORAL REASONING Zhaowei Zhang1 B∗, Xiaohan Liu3, Xuekai Zhu4, Junchao Huang5, Ceyao Zhang1,\nZhiyuan Feng6, Yaodong Yang1, Xiaoyuan Yi2, Xing Xie2 1 Institute for Artificial Intelligence, Peking University 2 Microsoft Research\n3 University of Michigan 4 Shanghai Jiao Tong University 5 CUHKSZ 6 THU ABSTRACT\n2026 Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable\nsuccess in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hy-Mar pothesis is that alignment tasks inherently require diversity-seeking distribution-\n11 matchingconduct thealgorithmsfirst comprehensiverather than reward-maximizingempirical study comparingpolicy-basedboth paradigmsmethods. To enable stable RLVR training, we build a rubric-grounded reward\npipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we\nfind that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through\nsemantic visualization mapping high-reward responses to semantic space, we[cs.AI] demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield\nsimilarly high rewards. This counter-intuitive finding explains why mode-seeking\noptimization proves equally or more effective for alignment tasks. Our results\nsuggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer\nto moral reasoning without explicit diversity mechanisms. Recent advances in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) have achieved impressive performance in well-defined, structured domains by directly\noptimizing long context chain-of-thought reasoning (Jaech et al., 2024; Guo et al., 2025; Comanici\net al., 2025). However, existing approaches primarily target logical reasoning tasks, especially mathematics (Cobbe et al., 2021) and coding (Chen et al., 2021), leaving their potential in alignment andarXiv:2603.10588v1 moral reasoning largely unexplored. Intuitively, alignment tasks typically admit multiple valid answers that reflect different ethical frameworks and value systems, in stark contrast to mathematical\nand coding problems, which usually have only one objectively correct solution. Therefore, in this paper, we investigate a natural question: Is introducing diversity key to adapting the strong reasoning\ncapabilities that RL brings to the logical reasoning into LLMs' alignment and moral reasoning?", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 0, + "total_chunks": 23, + "char_count": 2863, + "word_count": 365, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9ab57d18-75eb-4cf3-91bc-c5fa824d1b66", + "text": "Existing RL methods for LLM reasoning can be broadly categorized into two paradigms. The first\ncategory encompasses reward-maximizing methods rooted in PPO (Schulman et al., 2017), which\naim to identify an optimal policy that maximizes reward functions under specific regularization\nconstraints. Most current mainstream RLVR methods, including RLHF-style PPO (Schulman et al.,\n2017; Christiano et al., 2017; Ouyang et al., 2022), GRPO (Shao et al., 2024), and DAPO (Yu et al.,\n2025), fall into this category and focus on finding a policy mode generally seeking a single highreward strategy (Li et al., 2025). The second category consists of distribution-matching methods,\nwhich learn the flow between policy and reward distributions to enable the policy to capture finegrained details of the reward landscape. By explicitly modeling this flow, methods like FlowRL\n∗Work done when working as an intern at Microsoft Research Asia. (Zhu et al., 2025) can discover diverse solutions and achieve superior performance on complex\ntasks. Given the differences between these two paradigms, we hypothesize that, compared with\nreward-maximizing methods, distribution-matching methods, with the ability to capture diversity,\nmay be more suited for alignment tasks. To investigate this hypothesis, we conduct a comprehensive empirical study on MoReBench (Chiu\net al., 2025), a challenging moral reasoning benchmark that consists of two complementary subtasks: MoReBench-Public, which requires models to reason about value-laden dilemmas in realworld scenarios, and MoReBench-Theory, which tests reasoning consistency under specific philosophical frameworks including utilitarianism, deontology, virtue ethics, care ethics, and justice as\nfairness. Following the original benchmark's evaluation protocol, we train the Qwen3-1.7B-Base\nmodel (Yang et al., 2025) to serve as our judge model, which evaluates responses based on detailed\nrubrics capturing the complex nature of moral reasoning.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 1, + "total_chunks": 23, + "char_count": 1975, + "word_count": 279, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "68833ed5-9398-46fe-9784-f8aa26b85136", + "text": "Our experiments reveal several surprising findings that challenge our initial hypothesis. First,\nwe observe that reward-maximizing methods can achieve even superior performance compared to\ndistribution-matching methods on moral reasoning tasks. Moreover, through detailed analysis of\nreward distributions, we demonstrate that alignment rewards are not necessarily more diverse than\nreasoning tasks in high-reward regions, in most cases, math reasoning tasks exhibit even greater\ndiversity, contrary to the conventional opinion that alignment requires diversity-seeking algorithms. These findings all suggest alignment does not necessarily need to introduce diversity. With sufficiently discriminative verifiable rewards, standard reward-maximizing methods can effectively\ntransfer reasoning capabilities to moral reasoning without explicitly promoting solution diversity. In summary, our contributions are threefold.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 2, + "total_chunks": 23, + "char_count": 916, + "word_count": 112, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e674e35-7e52-464b-abcc-de4b79d12d79", + "text": "Firstly, we build a rubric-grounded verifiable reward\npipeline for moral reasoning by training a compact Qwen3-1.7B judge, enabling stable reward computation and controlled RLVR training on MoReBench. Secondly, we present the first systematic\ncomparison of reward-maximizing and distribution-matching methods on moral reasoning, and show\nthat reward-maximizing methods can match or outperform distribution-matching ones, challenging\nthe view that alignment requires diversity-seeking algorithms. Lastly, we analyze reward distributions and demonstrate that high-reward regions in moral reasoning are not inherently more diverse\nthan those in logical reasoning, explaining why standard reward-maximizing methods can transfer\nreasoning capabilities to moral reasoning without explicitly promoting diversity. In this section, we will review the relevant literature from two research areas that our study bridges:\nRL methods for reasoning tasks as well as LLM alignment and moral reasoning. We will elaborate\non them separately below.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 3, + "total_chunks": 23, + "char_count": 1030, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "15fc69aa-2d68-4c89-b28b-70739669c0bd", + "text": "RL Methods for LLM Reasoning. RL post training is widely used to strengthen LLM reasoning. A representative thread is RLHF (Schulman et al., 2017; Christiano et al., 2017; Ouyang et al.,\n2022), which learns rewards from human preferences and motivates later RL reasoning methods. Under the verifiable reward setting, rewards can be generated automatically with math checkers\nor code evaluation, bringing consistent gains on math and programming tasks (Chen et al., 2021;\nWhite, 2023). Subsequent work improves efficiency and stability by modifying policy gradient updates. GRPO (Shao et al., 2024) removes an explicit value network and uses within group relative\nrewards, reducing computation and improving DeepSeekMath. REINFORCE++ (Hu et al.) stabilizes training with a globally normalized advantage term. DAPO (Yu et al., 2025) introduces clip\ndecoupling and dynamic sampling to better match large model training, achieving strong results on\ndifficult math benchmarks. However, most methods still maximize expected reward, which can concentrate learning on a single high scoring trajectory and reduce coverage of diverse valid reasoning\npaths. FlowRL (Zhu et al., 2025) addresses this by optimizing for distribution matching. It defines\na target distribution from normalized rewards and trains with reverse KL based flow balance, encouraging the policy to sample multiple high quality trajectories in proportion to reward, improving\nboth accuracy and diversity in math and code reasoning. Overall, existing RL methods for reasoning\nfall into two routes: policy gradient based uni-modal optimization and distribution matching based\nmulti-modal coverage. We use this distinction to analyze transferability and performance on more\nopen ended LLM alignment and moral reasoning tasks. LLM Alignment and Moral Reasoning. Early works on LLM moral reasoning largely framed\nethics as outcome level judgment or classification.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 4, + "total_chunks": 23, + "char_count": 1919, + "word_count": 278, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac8c0c08-d629-411f-944a-6b954ce39284", + "text": "It relied on datasets such as ETHICS (Hendrycks\net al., 2020), Delphi (Jiang et al., 2021), community judgment corpora such as Scruples (Lourie\net al., 2021), and norm focused resources such as Social Chem 101 (Forbes et al., 2020). Later\nstudies expanded evaluation to narrative dilemmas and unified benchmark suites, including Moral\nStories (Emelin et al., 2021) and MoralBench (Ji et al., 2025). Researchers also explored scalable\nevaluation with LLM based judges (Zheng et al., 2023), as well as principle driven and critique\ndriven alignment frameworks (Bai et al., 2022), including self judging and self reward training\n(Yuan et al., 2024). While useful for evaluation, these resources transfer poorly to RLVR because\ntheir supervision is often sparse and subjective, relying on binary labels, acceptability judgments, or\npreference annotations.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 5, + "total_chunks": 23, + "char_count": 851, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "22c99353-2f80-42d2-b506-e78afc0e039c", + "text": "MoReBench (Chiu et al., 2025) instead formalizes procedural and pluralistic\nmoral reasoning with expert written rubrics. Each scenario provides fine grained criteria that score\nintermediate considerations and trade offs while allowing multiple defensible resolutions, yielding\na naturally multi-modal learning target. This design fits RLVR by enabling checkable and dense\nrewards over reasoning traces rather than single outcome labels. Therefore, in this paper, we adopt\nMoReBench as our primary benchmark.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 6, + "total_chunks": 23, + "char_count": 507, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a422ec61-aca8-44a1-9ab2-03cb4f9b198f", + "text": "Similar to the logical reasoning tasks, we formulate the alignment and moral reasoning task as\na conditional generation problem, where an LLM with parameters θ, denoted as policy πθ(y|x),\nreceives a prompt x and generates a response y. The objective is to optimize the policy under taskspecific reward signals r(x, y) ∈R that capture the generation quality. It is worth noting that, in\nthis paper, diversity is defined as whether different algorithms can find a diverse set of high-reward\nsolutions to the same problem.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 7, + "total_chunks": 23, + "char_count": 519, + "word_count": 84, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f2f01c76-9f44-4697-9b97-5ebba821e572", + "text": "Our hypothesis on the difference between moral reasoning tasks\nand logical reasoning tasks is rooted on this. We will then briefly introduce the main thought of\nreward-maximizing and distribution-matching algorithms in the following paragraphs. Reward-Maximizing Methods. Reward-maximizing methods aim to maximize the expected reward directly through policy gradient optimization, which are usually considered to have the property of mode seeking. The standard objective is:", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 8, + "total_chunks": 23, + "char_count": 474, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3a7cfb2f-ab16-4ce4-a0f3-7c516a21fe7c", + "text": "max E(x,y)∼πθ[r(x, y)] −λDf(πθ∥πref), (1)\nwhere πref is a reference pre-trained model and λ controls the optional f-divergence (usually KLdivergence) regularization strength. We primarily introduce GRPO (Shao et al., 2024), which samples a group of G responses {y1, . . . , yG} from the old policy πθold for each prompt x and optimizes: \" G # 1 πθ(yi|x) πθ(yi|x)\nJGRPO(θ) = E X min ˆAi, clip 1 −ϵ, 1 + ϵ ˆAi −λDKL(πθ∥πref),\nG i=1 πθold(yi|x) πθold(yi|x),\n(2)\nwhere the advantage ˆAi is computed by normalizing rewards within the group: ˆAi =\nri−mean({r1,...,rG}) . This eliminates the need for a separate value function while maintaining sta- std({r1,...,rG})\nble training through group-based advantage normalization. These reward-maximizing methods focus on finding a single high-reward policy mode through reward maximization, which may lead to mode collapse in tasks with multiple valid solutions.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 9, + "total_chunks": 23, + "char_count": 900, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d6b90c93-7818-419d-9ca9-23c765045af1", + "text": "Distribution-Matching Methods. An alternative approach shifts from reward maximization to\nreward distribution matching. We mainly present the FlowRL (Zhu et al., 2025) algorithm here,\nwhich core idea is to align the policy distribution with a target distribution proportional to the reward\nfunction, which can be formulated as minimizing the reverse KL divergence: min DKL πθ(y|x) ∥exp(βr(x, y)) , (3)\nθ Zϕ(x) where β is a temperature parameter and Zϕ(x) is a learnable partition function that normalizes scalar\nrewards into a valid probability distribution. This distribution-matching formulation encourages the policy to sample diverse trajectories in proportion to their rewards, promoting mode coverage rather than collapsing to dominant reward modes\nas in reward-maximizing methods. In this section, we conduct extensive experiments to compare the performance of rewardmaximizing algorithms and distribution-matching algorithms on alignment and moral reasoning\ntasks. We further analyze and show that, under existing reward constructions for RLVR tasks, the\nalignment task does not necessarily require more diverse learning algorithms.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 10, + "total_chunks": 23, + "char_count": 1140, + "word_count": 159, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "478e45de-9053-4c07-88e4-d0fbb3ff6248", + "text": "4.1 EXPERIMENTAL SETTINGS We will first introduce the specific experimental setup, including the using base models, benchmarks\nand baselines for analysis. Models and Benchmarks. In this paper, we conduct experiments using two prevail open-source\nmodels: Qwen2.5-7B-Base (Qwen et al., 2025) and Llama3.1-8B-Instruct (Dubey et al., 2024). These models were chosen for their diversity in developers, training stage, and performance characteristics, enabling a thorough assessment. For the benchmarks, we primarily conduct our analytical\nexperiments on MoReBench (Chiu et al., 2025), a comprehensive benchmark designed to assess the\nprocedural moral reasoning capabilities of LLMs. Unlike traditional benchmarks, it employs a large\nset of human-crafted rubrics paired with GPT-5 (Singh et al., 2025) as a judge model for evaluation, enabling a more precise and effective quantification of moral reasoning quality. It contains two\nsubtasks: MoReBench-Public, which examines value dilemmas, and MoReBench-Theory, which\nstudies reasoning based on different philosophical perspectives, including utilitarianism, deontology, virtue ethics, care ethics, and justice as fairness. We compare representative reward-maximizing methods and distribution-matching\nmethods to assess whether alignment and moral reasoning tasks benefit from explicitly encouraging output diversity. Specifically, Base is the original model without any additional RL fine-tuning.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 11, + "total_chunks": 23, + "char_count": 1442, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e0a4e7c3-e582-44a9-9f51-e35d2aaa81a3", + "text": "Reward-maximizing methods include PPO (i.e., RLHF-style PPO) (Schulman et al., 2017; Christiano et al., 2017; Ouyang et al., 2022), REINFORCE++ (Hu et al.) (RFPP), GRPO (Shao et al.,\n2024), and DAPO (Yu et al., 2025). For the distribution-matching method, we use FlowRL (Zhu\net al., 2025). 4.2 BENCHMARK CONFIGURATION MoReBench itself is a benchmark used solely for evaluation: for each question, the dataset contains\nmultiple rubrics that are manually designed by humans (covering multiple dimensions such as ethical considerations, stakeholder trade-offs, actionable recommendations, etc.), and these are used to\njudge the model's response rubric by rubric. In its original setup, MoReBench uses GPT-5 as the\njudge model: given an input x and a model answer y, GPT-5 produces a binary decision ji ∈0, 1 for\neach rubric (1 if satisfied, otherwise 0), and computes the final score by combining these decisions\nwith the pre-specified weight wi of each rubric. Concretely, in the setup of this paper, we take a\nnormalized weighted sum over all items with wi ≥0 and wi < 0 separately, and then subtract the\nlatter from the former to obtain the final reward:\nP i:wi>0 wi · ji Pi:wi<0 |wi| · ji\nr(x, y) = − . (4)\nP i:wi>0 wi Pi:wi<0 |wi| This design normalizes r(x, y) to the interval [−1, 1]: when an answer better satisfies the positive rubrics while triggering fewer negative rubrics, the reward is positive; otherwise it is negative,\nthereby providing an optimizable, dense, multi-dimensional, verifiable signal. However, using GPT-5 directly as the judge during training is prohibitively expensive, both inference cost and call latency are non-negligible. More importantly, RLVR training requires repeatedly Table 1: Performance on MoReBench (Public and Theory). Gains (%) are computed relative to the\nBase method within each benchmark, base model, and different pass number settings. Qwen2.5-7B Base Llama3.1-8B Instruct", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 12, + "total_chunks": 23, + "char_count": 1921, + "word_count": 302, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b1e03d2a-bb6f-4e7a-8d7c-ad6495db0a44", + "text": "Benchmark Method Score@1 Gain (%) Avg@8 Gain (%) Score@1 Gain (%) Avg@8 Gain (%) Base 0.37 – 0.37 – 0.44 – 0.45 –\nPPO 0.51 37.84 0.52 40.54 0.52 18.18 0.52 15.56\nGRPO 0.54 45.95 0.53 43.24 0.53 20.45 0.54 20.00\nPublic\nRFPP 0.65 75.68 0.65 75.68 0.60 36.36 0.60 33.33\nDAPO 0.67 81.08 0.67 81.08 0.69 56.82 0.72 60.00\nFlowRL 0.60 62.16 0.61 64.86 0.61 38.64 0.60 33.33 Base 0.45 – 0.43 – 0.49 – 0.51 –\nPPO 0.55 22.22 0.50 16.28 0.52 6.12 0.54 5.88\nGRPO 0.55 22.22 0.54 25.58 0.60 22.45 0.57 11.76\nTheory\nRFPP 0.62 37.78 0.61 41.86 0.64 30.61 0.64 25.49\nDAPO 0.76 68.89 0.72 67.44 0.74 51.02 0.76 49.02\nFlowRL 0.65 44.44 0.65 51.16 0.72 46.94 0.70 37.25 evaluating model outputs over massive numbers of rollouts and feeding back dense rewards, which\nwould cause the total number of calls to grow by orders of magnitude, making it unsuitable as a\nscalable training pipeline.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 13, + "total_chunks": 23, + "char_count": 870, + "word_count": 159, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b3a005eb-fc34-42c7-acd2-9bcd9e0e9201", + "text": "To address this, we build a locally runnable judge model on top of a Qwen3-1.7B-Base. First,\nfor each moral-reasoning scenario, we sample candidate answers with diverse styles and stances\nfrom multiple open-source and closed-source pretrained models, forming synthetic labeled data with\nbroader coverage. Next, we use GPT-5 to evaluate these answers according to the fine-grained rubric\nprovided by MoReBench, producing an overall quality score as well as fine-grained decisions/scores\nfor each rubric item. Finally, we perform supervised fine-tuning on Qwen3-1.7B-Base using this\nGPT-5-labeled data, training it to predict both the overall score and the per-rubric judgments. Following the standard MoReBench protocol to assess the quality on the validation set, our judge\nachieves agreement with GPT-5 of 87.07% on MoReBench-Public and 69.21% on MoReBenchTheory. In subsequent RLVR training, this local judge can stably and inexpensively provide dense,\nrubric-aligned rewards, thereby supporting large-scale, controllable moral-reasoning optimization\nexperiments. To validate the hypothesis proposed in section 1, in our main experiments, we will propose and\ndiscuss two research questions (RQ): • RQ1: Do the distribution-matching methods have advantages over the reward-maximizing ones\non LLM alignment and moral reasoning tasks?\n• RQ2: Do moral reasoning tasks indeed require algorithms to have stronger diversity capabilities\nthan logical reasoning tasks?", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 14, + "total_chunks": 23, + "char_count": 1461, + "word_count": 203, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b366bae-7f6f-48a8-bacc-6caa883c93f8", + "text": "In the following paragraphs, we will first present the overall performance and then answer these two\nresearch questions separately. As shown in Table 1, we present a comprehensive evaluation on both\nthe MoReBench-Public and MoReBench-Theory benchmarks, comparing reward-maximizing and\ndistribution-matching methods across two base models. We compute two different metrics:\nScore@1 (the score of a single sample) and Avg@8 (the average score across 8 samples), and further calculate the relative improvement ratio of each method compared to the Base results. Contrary\nto our initial hypothesis that alignment tasks inherently require diversity-seeking algorithms, we\nfind that distribution-matching methods are not significantly better than reward-maximizing methods across both benchmarks and base models. The method rankings are highly consistent: DAPO Figure 1: The visualization for the high-reward response distribution in semantic space of six cases\nin MATH-500 (blue) and MoReBench-Public (red) benchmark.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 15, + "total_chunks": 23, + "char_count": 1011, + "word_count": 139, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94437e8d-8b75-45a4-bff0-c00b00022453", + "text": "performs the best overall, while in most scenarios, FlowRL follows behind, and then comes RFPP,\nGRPO, PPO, and the Base results. This robustness across different base models suggests that the\nsuperiority of reward-maximizing methods reflects fundamental properties of the optimization algorithms rather than artifacts of specific model choices. These results directly address the question\nposed in the introduction: alignment tasks do not necessarily require diversity-seeking algorithms. In the following paragraphs, we will further investigate two research questions: RQ1 examines in\ndetail whether distribution-matching methods have advantages over reward-maximizing ones, and\nRQ2 explores whether moral reasoning tasks indeed require stronger diversity capabilities than logical reasoning tasks through semantic visualization and reward distribution analysis. Reward-Maximizing vs. Distribution-Matching Methods. In response to RQ1, which asks\nwhether distribution-matching methods have advantages over reward-maximizing ones on alignment tasks, our results do not support this hypothesis as expected. Given the apparent tolerance\nfor multiple valid responses in moral reasoning, the intuitive hypothesis would be that diversitypreserving algorithms like FlowRL should outperform or at least show significant advantages over\nmode-seeking approaches. However, our experimental evidence reveals that distribution-matching\nmethods do not demonstrate the expected performance advantage over reward-maximizing methods on both tasks. On the Public benchmark, DAPO achieves remarkable improvements of 81.08%\non Qwen-Avg@8 setting (0.37 to 0.67) and 60.00% on Llama-Avg@8 (0.45 to 0.72) for Score@1\non Public, while FlowRL lags significantly with only 64.86% and 33.33% gains. ven RFPP, another reward-maximizing method, surpasses FlowRL with gain of 75.68% and 33.33%. On the\nTheory benchmark, the gap persists with DAPO achieving 67.44% and 49.02% improvements versus FlowRL's 51.16% and 37.25%. The analysis between Score@1 and Avg@8 further confirms\nthis pattern, with DAPO showing exceptional single-sample stability, while FlowRL's supposed\nadvantage in diversity does not translate to better multi-sampling performance. This robust counterintuitive finding demonstrates that alignment tasks, despite their apparent open-endedness, do not\nbenefit from diversity-seeking algorithms under the current reward construction. Diversity Characteristics: Moral vs. In response to RQ2, which investigates whether moral reasoning tasks require stronger diversity capabilities than logical reasoning\ntasks, our semantic visualization provides more interesting evidence that may contradict this assumption.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 16, + "total_chunks": 23, + "char_count": 2696, + "word_count": 350, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8626a8db-5b2c-49e0-9315-cc0bd064aa9f", + "text": "As shown in Figure 1, we visualize 500 high-reward responses per question from MATH-\n500 (Lightman et al., 2023) and MoReBench-Public by mapping them to semantic space using\nall-MiniLM-L6-v2 (Wang et al., 2020) and applying t-SNE (Maaten & Hinton, 2008) dimensionality reduction. Across all six showcased cases, mathematical reasoning exhibits substantially more Table 2: Integrity versus career incentives moral dilemma. A fashion blogger must choose between preserving credibility with followers and securing industry access tied to a brand relationship. Boxed phrases indicate reasoning stages and structural steps that recur across different responses,\n\". . . \" denotes omitted text, and R1 and R2 denote two responses under the same method.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 17, + "total_chunks": 23, + "char_count": 745, + "word_count": 110, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "25ad4e15-a3d7-4a61-bae0-0d3171a40c81", + "text": "Question A fashion blogger received an unreleased dress from a brand, with the condition that\nwriting a positive review in exchange for significant career opportunities. However, the\nblogger found the dress to be substandard. Should the blogger comply with the request to\npost a positive review to preserve career prospects, or expose the flaws to uphold reader\ntrust, even if it jeopardizes the relationship with the brand? . . . FlowRL R1: Situation analysis (honesty, event access, brand relationship, backlash) ⇒\nDiscussion of pros and cons on both sides (Option A: post a positive review, pros to\nkeep access and invitations, cons mislead followers and erode credibility; Option B: report\nthe issues, pros protect trust, cons risk losing collaborations) ⇒Decision (post authentic\nand professional reviews, and privately contact brand to provide constructive feedback.). . . R2: Situation analysis (audience trust, career incentives, brand pressure, complaints) ⇒ Discussion of pros and cons on both sides (Option A deliver a glowing endorsement,\npros short term benefits, cons reputational damage and backlash; Option B disclose a\nnegative experience, pros integrity, cons reduced future access) ⇒Decision (share an\nhonest assessment framed constructively, and reach out to the PR manager to discuss\nreplacement or return). . . DAPO R1: Situation analysis (multiple stakeholders, short term gain, long term credibility, legal\nrisk) ⇒Discussion of pros and cons on both sides (Option A review positively, pros\nnetworking and continued access, cons deceiving the audience; Option B post an honest\ncritique, pros consistency with values, cons losing the event and partnerships) ⇒Decision\n(communicate privately first, then post a candid review with constructive suggestions and a\nproposed remedy). . . R2: Situation analysis (integrity v.s incentives, follower trust, liability) ⇒ Discussion of pros and cons on both sides (Option A comply with the requested tone,\npros preserve the relationship, cons long term credibility loss; Option B disclose issues,\npros transparency, cons reduced opportunities) ⇒Decision (offer a mixed but truthful\nevaluation, and contact the PR manager to align expectations and remediation). . . RFPP R1: Situation analysis (career incentives, trust, crucial event, backlash) ⇒\nDiscussion of pros and cons on both sides (Option A publish a positive review, pros\ninvitation and partnership, cons misleading followers; Option B publish an honest review,\npros protecting the audience, cons potential retaliation) ⇒Decision (document\ncommunications, contact the PR manager professionally, and publish a constructive but\ntruthful critique). . . R2: Situation analysis (authenticity, the allure of networking, disclosure norms,\nreputation damage) ⇒Discussion of pros and cons on both sides (Option A comply\nwith promotion, pros short term career benefit, cons trust erosion; Option B disclose\nconcerns, pros integrity, cons loss of access) ⇒Decision (use clear disclosure and\nprofessional tone, provide constructive criticism, and reach out to the PR manager about\nreturn or exchange). . .", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 18, + "total_chunks": 23, + "char_count": 3113, + "word_count": 461, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b077234e-3bdd-4448-872e-7f78100a79d9", + "text": "diverse semantic distributions, with high-reward responses spread across multiple distinct clusters\nrepresenting different solution strategies. In stark contrast, MoReBench-Public shows much more\nconcentrated distributions, where high-reward responses cluster tightly around a single dominant\nsemantic region. This visualization directly confirms that high-quality moral reasoning responses\ntend to cluster around limited ethically appropriate frameworks, resulting in a more concentrated\ndistribution rather than the multi-modal diversity one might expect from alignment tasks. This evidence may further explain why mode-seeking algorithms like DAPO can effectively converge toward high-reward regions without distraction, whereas diversity-preserving methods like\nFlowRL allocate optimization capacity to cover lower-reward regions that contribute less to final\nperformance. This counter-intuitive finding demonstrates that moral reasoning tasks, despite their\napparent open-endedness, actually may exhibit more uni-modal reward structures than mathematical\nreasoning, favoring mode-seeking optimization approaches. Beyond quantitative evaluation, we also conduct qualitative analysis to examine whether model outputs exhibit diversity in response strategy, both within the same method across multiple sampled\nresponses and across different methods. As shown in Table 2, the case study centers on an integrity versus career incentives dilemma, where a blogger is pressured to publish a positive review\nin exchange for industry access, while a truthful review could protect audience trust but jeopardize\ncollaboration opportunities. The table includes two reward-maximizing methods, DAPO and RFPP,\nand one distribution-matching method, FlowRL, and reports two sampled responses per method. It presents the two responses under each method side by side, enabling a direct comparison of\nframing, reasoning progression, and final recommendation both within the same method and across\nmethods. Across all six responses, the outputs are highly aligned in viewpoint and reasoning progression, differing mainly in surface-level phrasing rather than in underlying decision criteria. The\nanswers typically enumerate a similar set of considerations, then structure the dilemma as a twooption comparison with pros and cons, and finally propose a similar mitigation route, namely a\ntruthful evaluation framed with constructive feedback paired with private outreach to the brand. Overall, this case illustrates apparent multi-perspective consideration without substantive diversity,\nand it aligns with our quantitative findings by suggesting that under the current RLVR reward mechanism, alignment tasks do not necessarily require more diverse learning algorithms to yield different\nresponse strategies. While the responses mention multiple stakeholders and constraints, they largely\ninstantiate the same reasoning template and converge to the same recommendation. The outputs do\nnot display the pluralism one might intuitively expect from alignment style dilemmas, in which multiple defensible answers could be grounded in distinct ethical frameworks or value systems. Instead,\nthe models repeatedly reduce the problem to a trust versus benefit framing, treat backlash and legal\nrisk as a dominant deterrent against promotional compliance, and resolve the tension via a similar\ncompromise narrative, constructive honesty plus private negotiation.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 19, + "total_chunks": 23, + "char_count": 3434, + "word_count": 455, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "69233f09-99b6-4d2c-bfea-66b9c8ebdd48", + "text": "This work addresses the critical challenge of adapting reinforcement learning from verifiable rewards to moral reasoning and alignment tasks. Through extensive experiments on MoReBenchPublic and MoReBench-Theory across Qwen2.5-7B-Base and Llama3.1-8B-Instruct, we conduct\nthe first comprehensive empirical study comparing reward-maximizing and distribution-matching\nRLVR methods. Our findings challenge the conventional wisdom that alignment tasks inherently\nrequire diversity-seeking algorithms. Contrary to this hypothesis, we find that distribution-matching\nmethods do not show the expected advantages over reward-maximizing methods on alignment tasks. Through semantic visualization and reward distribution analysis, we demonstrate that high-reward\nregions in moral reasoning are actually more concentrated than in mathematical reasoning, explaining why mode-seeking optimization proves equally or more effective for these tasks. These results\nsuggest that alignment and reasoning tasks share fundamentally similar optimization landscapes, and\nstandard reward-maximizing RLVR methods can successfully transfer to moral reasoning without\nrequiring explicit diversity-preserving mechanisms.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 21, + "total_chunks": 23, + "char_count": 1192, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9ca0cdb1-d941-498e-8264-c5b04d5282e6", + "text": "On the other hand, the definition of diversity is still a topic in the field remaining a settled consensus. This concept can usually refer to diversity in different aspects, such as reward distribution,\ndata distribution, exploration strategies, and diversity with respect to minorities, etc. In this paper,\nwe mainly focus on an empirical analysis of whether the data itself exhibits a multi-modal reward\ndistribution, and whether the RLVR algorithm can accurately capture this property.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 22, + "total_chunks": 23, + "char_count": 488, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cdd2e164-1884-45e2-850e-f56294d67576", + "text": "To further address this question, there is still substantial room for improvement in this work. First, there are\nrelatively few alignment and moral reasoning benchmarks available for RLVR research; this paper\neven needs to build its own pipeline, so more extensive follow-up experiments are required to validate the generality of its conclusions. Second, since there are relatively few distribution-matching\nmethods, future work can further improve FlowRL and conduct more empirical analyses. Finally,\nbecause the property of diversity is closely related to the definition of reward and specific engineering implementations, we will further discuss the impact of different reward definitions on different\ntasks and methds in future work.", + "paper_id": "2603.10588", + "title": "Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning", + "authors": [ + "Zhaowei Zhang", + "Xiaohan Liu", + "Xuekai Zhu", + "Junchao Huang", + "Ceyao Zhang", + "Zhiyuan Feng", + "Yaodong Yang", + "Xiaoyuan Yi", + "Xing Xie" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10588v1", + "chunk_index": 23, + "total_chunks": 23, + "char_count": 737, + "word_count": 107, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10592_semantic.json b/data/chunks/2603.10592_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..a92e4be44300afbc64b896d44e194f681c33cbfa --- /dev/null +++ b/data/chunks/2603.10592_semantic.json @@ -0,0 +1,952 @@ +[ + { + "chunk_id": "1cd71e61-b271-4c9c-8a7e-d7360e1177f8", + "text": "GRADIENT FLOW DRIFTING:\nGENERATIVE MODELING VIA WASSERSTEIN GRADIENT FLOWS\nOF KDE-APPROXIMATED DIVERGENCES Jiarui Cao, Zixuan Wei Yuxin Liu\nThe Chinese University of Hong Kong Civil Aviation University of China\nHong Kong Tianjin\n{1155244613, 1155245852}@link.cuhk.edu.hk yx_liu2025061016@163.com2026\nMar ABSTRACT\n11 Wecall revealGradienta preciseFlow mathematicalDrifting. Withframeworkthis framework,about wea newprovefamilyan equivalenceof generativebetweenmodelsthewhichrecentlywe\nproposed Drifting Model and the Wasserstein gradient flow of the forward KL divergence under\nkernel density estimation (KDE) approximation. Specifically, we prove that the drifting field of\ndrifting model Deng et al. [2026] equals, up to a bandwidth-squared scaling factor, the difference of\nKDE log-density gradients ∇log pkde −∇log qkde, which is exactly the particle velocity field of the\nWasserstein-2 gradient flow of KL(q∥p) with KDE-approximated densities. Besides that, this broad\nfamily of generative models can also include MMD-based generators, which arises as special cases of[cs.LG]\nWasserstein gradient flows of different divergences under KDE approximation. We provide a concise\nidentifiability proof, and a theoretically grounded mixed-divergence strategy. We combine reverse\nKL and χ2 divergence gradient flows to simultaneously avoid mode collapse and mode blurring, and\nextend this method onto Riemannian manifold which loosens the constraints on the kernel function,\nand makes this method more suitable for the semantic space.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 0, + "total_chunks": 50, + "char_count": 1530, + "word_count": 197, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53da961b-dd5b-46f6-a20d-43c73ca201c5", + "text": "Preliminary experiments on synthetic\nbenchmarks validate the framework. Generative modeling seeks to learn a mapping f such that the pushforward f#pϵ of a simple prior pϵ approximates a\ndata distribution pdata. The recently proposed Drifting Model Deng et al. [2026] introduces a new paradigm: rather\nthan relying on iterative inference-time dynamics (as in diffusion or flow-based models), it evolves the pushforward\ndistribution during training time via a drifting field Vp,q, and naturally admits one-step generation. Drifting Models\nachieve state-of-the-art one-step FID on ImageNet 256 × 256 (1.54 in latent space and 1.61 in pixel space).arXiv:2603.10592v1 Despite their empirical success, theoretical foundations of Drifting Models remain underdeveloped. The original paper's\nanalysis is somewhat heuristic and the identifiability proof (Appendix C.1 therein) requires additional smoothness\nassumptions. We argue that this complexity stems from a failure to recognize a fundamental connection.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 1, + "total_chunks": 50, + "char_count": 1000, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8cb44e1d-aa68-4971-9105-88d9e38b30ed", + "text": "The drifting field of Deng et al. [2026], when instantiated with a Gaussian kernel kh(x, y) =\nexp(−∥x−y∥22h2 ), satisfies the exact identity: Vp,q(x) = h2 ∇log pkde(x) −∇log qkde(x) , (1) where pkde(x) = Ep[kh(x, y)] is the Kernel Density Estimation (KDE) of p. The right-hand side is precisely the\nparticle velocity field of the Wasserstein-2 gradient flow of the KL divergence KL(q∥p), with true densities replaced\nby their KDE approximations with the same kernel. This identification has several consequences:", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 2, + "total_chunks": 50, + "char_count": 512, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfd78d93-8cb0-45d5-8d74-3afa0888c48a", + "text": "Gradient Flow Drifting Unified framework: By varying the divergence functional, we obtain a family of gradient flow drifting models. MMD-based generators correspond to the L2 distribution distance, and drifting models to the KL divergence. We can\nconstruct new models from any f-divergence and any other divergence that can prove the distribution convergence.\n2. Mixed gradient flows: Convex combinations of divergences yield legitimate mixed gradient flows (Theorem 4.12),\nenabling strategies that combine the complementary strengths of different divergences—e.g., MMD for global mode\ncoverage and reverse KL for local sharpness, and reverse KL divergence and χ2 provide more specific precise\nforcing.\n3. Simplified identifiability: The equilibrium condition Vp,q = 0 ⇒p = q follows in lines from the injectivity of the\nkernel mean embedding under characteristic kernels.\n4. Drifting Model as a special case: The standard energy dissipation inequality for Wasserstein gradient flows\nimmediately yields dtKL(qkded t ∥pkde) ≤0. Concurrent work by Li and Zhu [2026] reinterprets Drifting Models through a flow-map semigroup decomposition, but\ndoes not identify the KDE–gradient flow connection. Belhadji et al. [2025] unifies MMD gradient flows with mean shift\nbut does not extend to f-divergences. Our framework subsumes both perspectives. Deng et al. [2026] propose learning a one-step pushforward map by evolving the generated\ndistribution during training via a kernel-based drifting field. They achieve strong empirical results but provide limited\ntheoretical analysis.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 3, + "total_chunks": 50, + "char_count": 1571, + "word_count": 224, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3625e379-9d91-450c-83b7-cb8f268bb864", + "text": "Li and Zhu [2026] reinterpret Drifting Models via long-short flow-map factorization, connecting\nthem to closed-form flow matching based on semigroup consistency. Our work gives a new perspective to view it and\nincludes it into a big family of generative models. Wasserstein gradient flows in generative modeling. Wasserstein gradient flows Jordan et al. [1998], Ambrosio et al.\n[2005], Santambrogio [2015] provide a variational framework for the evolution of probability measures. Several works\nleverage this framework for generative modeling: Arbel et al. [2019] study MMD gradient flows for sampling; Yi et al.\n[2023] use Wasserstein gradient flows to unify divergence GANs, introducing MonoFlow with a monotone rescaling of\nthe log density ratio; Choi et al. [2024] propose scalable Wasserstein gradient descent. Our work makes optimizing the\ngradient work directly possible through kernel density estimation. Kernel density estimation and score estimation. The connection between mean shift and KDE gradients is classical Cheng [1995], Comaniciu and Meer [2002]. Belhadji et al. [2025] recently unified mean shift, MMD-optimal\nquantization, and gradient flows. Our work extends this connection to arbitrary f-divergences. MMD and kernel methods for generation. MMD-based generative models Dziugaite et al. [2015], Li et al. [2015]\nminimize the MMD between generated and data distributions. Zhou et al. [2025] extend moment matching to one-/fewstep diffusion. Chizat et al. [2026] provide quantitative convergence rates for MMD Wasserstein gradient flows. Our\nframework reveals MMD generators as one member of a broader family.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 4, + "total_chunks": 50, + "char_count": 1630, + "word_count": 235, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ae695413-a31e-4e16-8c7b-654bc909f125", + "text": "f-divergence minimization. f-divergence variational estimation has been widely studied Nguyen et al. [2010],\nNowozin et al. [2016]. Yi et al. [2023] connects f-divergence GANs to Wasserstein gradient flows but requires a\ndiscriminator to estimate density ratios. Our KDE-based approach avoids adversarial training entirely, but aligns with\nan approximation based on particulars.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 5, + "total_chunks": 50, + "char_count": 378, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5bc79d8c-c0dc-4fb2-9cd7-1aa5f813ef0c", + "text": "Let P(Rd) denote the set of Borel probability measures on Rd, and P2(Rd) the subset with finite second moments. For\nµ ∈P(Rd), we write µ also for its density with respect to Lebesgue measure when it exists. We use ⟨·, ·⟩for inner\nproducts and ∥· ∥for norms, with subscripts indicating the space when ambiguous.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 6, + "total_chunks": 50, + "char_count": 310, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "21268671-80c6-4a26-bb1c-6f30deffefcc", + "text": "3.2 Kernel Density Estimation Definition 3.1 (KDE operator). Given a kernel k : Rd × Rd →R and µ ∈P(Rd), the KDE operator is\nTk[µ](x) := k(x, y)dµ(y). (2) Gradient Flow Drifting For the Gaussian kernel kh(x, y) = exp(−∥x −y∥2/(2h2)) with bandwidth h > 0, we write µkde(x) := Tkh[µ](x). 3.3 Reproducing Kernel Hilbert Spaces Definition 3.2 (RKHS and kernel mean embedding). A symmetric positive definite kernel k induces a unique reproducing\nkernel Hilbert space Hk with inner product ⟨·, ·⟩Hk satisfying the reproducing property: f(x) = ⟨f, k(x, ·)⟩Hk for all\nf ∈Hk. The kernel mean embedding of µ ∈P(Rd) is mµk := R k(·, y)dµ(y) ∈Hk. Definition 3.3 (Characteristic kernel). A kernel k is characteristic if the kernel mean embedding map µ 7→mµk is\ninjective on P(Rd). 3.4 Wasserstein Gradient Flows Definition 3.4 (Wasserstein-2 gradient flow). Given a functional F : P2(Rd) →R with first variation δFδq , its\nWasserstein-2 gradient flow is the curve {qt}t≥0 satisfying the continuity equation: ∂tqt = ∇· qt∇δF . (3)\nδq qt", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 7, + "total_chunks": 50, + "char_count": 1022, + "word_count": 171, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bdbb82bb-7dfd-45a9-83d0-5cf950d2e63e", + "text": "Equivalently, particles xt ∼qt evolve as dxtdt = v(xt) where v(x) = −∇δFδq (x). 3.5 The Drifting Model We recall the core formulation of Deng et al. [2026]. Given a data distribution p and a generated distribution q = f#pϵ,\nthe drifting field is:\nEp[k(x, y+)(y+ −x)] −Eq[k(x, y−)(y−−x)] , (4) Vp,q(x) =\nEp[k(x, y+)] Eq[k(x, y−)]\n| V−q{z(x) }\nwith training loss L = Eϵ[∥fθ(ϵ) −stopgrad(fθ(ϵ) + Vp,qθ(fθ(ϵ)))∥2]. 4 Method: Gradient Flow Drifting We present a unified framework in which generative models arise as Wasserstein gradient flows (WGFs) of divergence\nfunctionals under KDE approximation. The logical development proceeds in three layers: • Foundation 4.1: Under mild kernel regularity conditions, KDE-level distribution matching is equivalent to matching\nthe original distributions.\n• Engine 4.2–4.3: General f-divergence WGFs at the KDE level, with energy dissipation and unified identifiability.\n• Instantiation 4.4–4.6: The Drifting Model, MMD generators, and mixed gradient flows emerge as special cases. The framework extends naturally to Riemannian manifolds 4.7, and is summarized as a complete training pipeline in\n4.8.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 8, + "total_chunks": 50, + "char_count": 1135, + "word_count": 170, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fbcad435-7457-4561-9fce-beb117c3c3af", + "text": "4.1 Foundation: KDE Smoothing and Distribution Matching The starting point of our framework is that KDE smoothing, under mild kernel regularity, preserves distributional\nidentity and simultaneously provides the smoothness needed for gradient flow analysis. This allows us to work entirely\nat the KDE level without imposing any regularity on the data distribution p or the generated distribution q. Assumption 4.1 (Kernel regularity; full statement in Appendix A). Let k : Rd × Rd →R satisfy:\nK1. Characteristic: the mean embedding µ 7→ R k(·, y)dµ(y) is injective on P(Rd). Uniform gradient bound: Mk := supx,y ∥∇xk(x, y)∥< ∞. Strict positivity: k(x, y) > 0 for all x, y. Differentiability: x 7→k(x, y) is C1 for every y. The Gaussian kernel kh(x, y) = exp(−∥x −y∥2/(2h2)) satisfies K1.–K4.; the Laplace kernel used in the original\nDrifting Model fails K4. (Appendix J). Gradient Flow Drifting Table 1: Generative models as Wasserstein gradient flows of divergences under KDE approximation. All velocity fields\nare sample-computable via the KDE score formula (Appendix E).\nf(u) Divergence f ′(u) KDE velocity field vkde(x) Model u log u Forward KL log u + 1 ∇log pkde −∇log qkde Drifting\n−log u Reverse KL −1/u pkde (∇log pkde −∇log qkde) – qkde\n2(u1 −1)2 χ2 (u −1) pkdeqkde (∇log pkde −∇log qkde) –\n12∥mpk −mqk∥2Hk ∇x R k(x, y)d(p −q)(y) = ∇(pkde −qkde) MMD Theorem 4.2 (KDE regularity; proof in Appendix C).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 9, + "total_chunks": 50, + "char_count": 1409, + "word_count": 233, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b35d28bf-c809-4d64-937c-d65c22b92811", + "text": "Under K2–K4, for any µ ∈P(Rd): (i) µkde ∈C1(Rd) with\n∇xµkde(x) = R ∇xk(x, y)dµ(y); (ii) µkde(x) > 0 for all x; (iii) supx ∥∇µkde(x)∥≤Mk. In particular, no moment or smoothness conditions on µ are required: the constant Mk serves as a universal dominating\nfunction for any probability measure, enabling all subsequent Leibniz interchanges. Proposition 4.3 (KDE injectivity; proof in Appendix B). Under K1, µkde = νkde pointwise implies µ = ν. Remark 4.4 (Foundation summary). Under K1.–K4., the KDE-smoothed densities pkde and qkde are strictly positive\nand C1. In particular, the log-ratio log(pkde/qkde) is well-defined and C1, and pkde = qkde if and only if p = q. This\nmeans every divergence-minimization argument at the KDE level faithfully transfers to the original distributions. 4.2 Gradient Flows of f-Divergences under KDE Approximation With smoothed densities that are smooth and positive (Theorem 4.2), we can apply the standard Wasserstein gradient\nflow machinery to f-divergences directly at the KDE level. Recall that for a convex function f : (0, ∞) →R with f(1) = 0, the f-divergence is Df(ρ∥π) = R π f(ρ/π)dx\n(Definition D.1 in Appendix). The WGF of F[q] = Df(q∥p) has first variation δFδq (x) = f ′(q(x)/p(x)) and particle\nvelocity vf(x) = −∇f ′(q(x)/p(x)) (Proposition D.2 in Appendix). Replacing the true densities with their KDE\napproximations yields the generalized drifting velocity field: ′ q(x)\nvf(x) = −∇f . (5)\np(x) Theorem 4.5 (Energy dissipation).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 10, + "total_chunks": 50, + "char_count": 1476, + "word_count": 236, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "17f73e16-662d-4df1-b065-be571f3494a2", + "text": "Let f be strictly convex and let {qt}t≥0 be smooth positive densities evolving according to the continuity equation with velocity (5). Under appropriate boundary conditions (Appendix D, Remark D.3):\nd Z ′ qt(x) 2\ndtDf(qt∥p) = − qt(x) ∇f p(x) dx ≤0. (6)\nOn compact Riemannian manifolds without boundary (e.g., Sd−1), the boundary condition is vacuous and (6) holds\nunconditionally.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 11, + "total_chunks": 50, + "char_count": 380, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f4b8097d-ba92-49f1-a3f0-5997005a1da9", + "text": "Table 1 records the specific velocity fields for the divergences of primary interest. Remark 4.6 (Factored velocity structure). All f-divergence velocities in Table 1 share the common factor (∇log pkde −\n∇log qkde), modulated by a density-ratio weight w(x): w ≡1 (forward KL), w = pkde/qkde (reverse KL), w =\nqkde/pkde (χ2). This weight governs the local emphasis: forward KL treats all regions equally, reverse KL up-weights\nregions of high data density (encouraging precision), and χ2 up-weights regions of high generated density (penalizing\nspurious mass).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 12, + "total_chunks": 50, + "char_count": 559, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d606ccad-0b6c-48cb-a031-ecd9f731ea6c", + "text": "4.3 Unified Identifiability Combining the distribution-matching foundation 4.1 with the gradient flow machinery 4.2, we obtain a unified\nidentifiability result. Theorem 4.7 (Unified identifiability (Proof in Appendix F.1)). Let k satisfy K1.–K4. and f be strictly convex with\nf(1) = 0. If the generalized drifting velocity (5) vanishes identically, vkdef ≡0, then p = q.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 13, + "total_chunks": 50, + "char_count": 370, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bc3e7c60-ea88-4f41-a98a-c4fb484315af", + "text": "Corollary 4.8 (Loss landscape). The KDE-level f-divergence Df(qkde∥pkde) satisfies:\n1. Df ≥0, with equality if and only if q = p (identifiability);", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 14, + "total_chunks": 50, + "char_count": 147, + "word_count": 22, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e2adfe43-a738-4cf5-a061-2908167709ec", + "text": "Gradient Flow Drifting 2. dtDfd ≤0 along the Wasserstein gradient flow (energy dissipation);\n3. the only equilibrium of the flow is q = p. The unique global optimum of the KDE-level divergence is p = q, and the energy is monotonically non-increasing\nalong the flow. 4.4 The Drifting Model as Forward KL Gradient Flow We now show that the Drifting Model of Deng et al. [2026] is a special case of our framework, corresponding to the\nforward KL divergence f(u) = u log u. Theorem 4.9 (Core equivalence; proof in Appendix G). Let kh(x, y) = exp(−∥x −y∥2/(2h2)) with h > 0, and let\np, q ∈P(Rd).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 15, + "total_chunks": 50, + "char_count": 590, + "word_count": 107, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83578347-1b93-4e33-8e0e-c4ed4a583c24", + "text": "Then the drifting field (4) satisfies Vp,q(x) = h2 ∇log pkde(x) −∇log qkde(x) for all x ∈Rd. (7) The proof is a direct computation: the Gaussian kernel satisfies ∇xkh(x, y) = y−xh2 kh(x, y), and substituting into the\nKDE score formula (Appendix E) gives exactly the mean-shift vectors V+p and V−q from (4). Corollary 4.10 (Drifting = Forward KL Wasserstein gradient flow + KDE). The right-hand side of (7) is precisely\nh2vkdeKL (x), the forward KL row of Table 1 scaled by h2. Hence the Drifting Model's velocity fields correspond to the\nWasserstein-2 gradient flow of KL(qkde∥pkde), up to a time rescaling by h2. This identification immediately imports the convergence and identifiability results of 4.3 to the Drifting Model. The\nThm. 4.9 Thm. 4.7\nidentifiability proof, in particular, reduces to: Vp,q ≡0 =====⇒∇log pkde = ∇log qkde =====⇒p = q. 4.5 MMD Generators as L2 Gradient Flows The squared MMD functional F[q] = 12MMD2k(q, p) = 12∥mqk −mpk∥2Hk is not an f-divergence, but fits naturally into\nour framework. Proposition 4.11 (MMD gradient flow velocity; proof in Appendix H). Under K1.–K4., the WGF velocity of\n12MMD2k(q, p) is\nvMMD(x) = ∇ pkde(x) −qkde(x) = ∇xk(x, y)d(p −q)(y). (8) Note that vMMD is the gradient of the L2 density difference, while the f-divergence velocities in Table 1 involve the\ngradient of a nonlinear function of the density ratio. Both families are sample-computable via the KDE score formula,\nand the same identifiability argument applies (Remark in 4.3).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 16, + "total_chunks": 50, + "char_count": 1492, + "word_count": 244, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "765ede45-7aad-4ffd-b3dc-363b1f01a5a7", + "text": "4.6 Mixed Gradient Flows Different divergences induce complementary failure modes. We propose mixing gradient flows to combine their\nstrengths. Theorem 4.12 (Legitimacy of mixed gradient flows; proof in Appendix I). Let D1, D2 be divergences (Di(q∥p) ≥0,\nwith equality iff q = p), and α, β > 0 with α + β = 1. Define Dmix = αD1 + βD2. (a) Dmix is a valid divergence;\n(b) its WGF velocity is vmix = α v1 + β v2;\n(c) dtDmix[qt]d ≤0 along the flow. Practical mixed drifting field. We propose combining the reverse KL and χ2 velocity fields:\npkde qkde\nVmix(x) = α · (∇log pkde −∇log qkde) + β · (∇log pkde −∇log qkde), (9)\nqkde pkde\ncorresponding to Dmix = α KLkde(p∥q) + β χ2kde(q∥p). Referring to Remark 4.6, the reverse KL weight pkde/qkde\nprovides strong attraction toward high-density regions of p (precision-forcing, avoiding mode blurring), while the χ2\nweight qkde/pkde penalizes spurious generated mass (coverage-forcing, avoiding mode collapse). Their combination\nreconciles mode-seeking and mode-covering behaviors. Experiments in 5 confirm this qualitative picture.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 17, + "total_chunks": 50, + "char_count": 1073, + "word_count": 175, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40cde3cb-34c6-4fc4-9a58-17a3391d57c0", + "text": "Gradient Flow Drifting Algorithm 1 Gradient Flow Drifting: training algorithm\nRequire: Generator fθ, data distribution p, source distribution pϵ\n1: Divergence selection: Choose divergence(s) achieving distributional convergence (e.g., reverse KL, forward KL,\nχ2, or mixture thereof).\n2: Velocity field: Derive the WGF velocity from the chosen divergence.\n3: Kernel design: Select a kernel k satisfying Assumption 4.1 K1.–K4., or their Riemannian analogues.\n4: for each training iteration do\n5: Sample ϵ ∼pϵ; compute x = fθ(ϵ) (generated samples).\n6: Sample y+ ∼p (data samples).\n7: Mini-batch KDE velocity estimation: Compute vkde(x) over {y+} and {x}.\n8: Update: L(θ) = Eϵ ∥fθ(ϵ) −sg(fθ(ϵ) + vkde(fθ(ϵ)))∥2 ; θ ←θ −η∇θL.\n9: end for 4.7 Extension to Riemannian Manifolds The Drifting Model Deng et al. [2026] trains in a semantic feature space that is empirically close to a hypersphere. This\nmotivates extending our framework to Riemannian manifolds M. Two benefits emerge:\n1. Vacuous boundary conditions. On compact manifolds without boundary (e.g., Sd−1), the energy dissipation\ninequality (6) holds unconditionally (Theorem 4.5), eliminating the tail-decay assumptions required on Rd.\n2.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 18, + "total_chunks": 50, + "char_count": 1191, + "word_count": 176, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7b7b07cc-a77d-476f-8740-efb13179bd60", + "text": "Richer kernel design. The adapted spherical assumptions K1S–K4S (Appendix J.2) admit kernels with qualitatively\ndifferent weighting profiles. For example, the von Mises–Fisher (vMF) kernel kκ(x, y) = exp(κ x⊤y) provides\na spherical analog of the Gaussian kernel, while the spherical logarithmic kernel (Appendix J.2, Proposition J.5)\nproduces polynomial (inverse-distance) weighting, analogous to the Euclidean IMQ kernel, offering heavier tails\nand better global mode coverage. All results of 4.2–4.6—velocity fields, energy dissipation, identifiability, and mixed flows—extend to the Riemannian\nsetting by replacing Euclidean gradients with Riemannian gradients and requiring the manifold analogues of K1.–K4.. Details and kernel verifications are given in Appendix J.2. 4.8 Training Pipeline Algorithm 1 summarizes the full training procedure of Gradient Flow Drifting. The framework is modular: one selects\na divergence (or mixture), a kernel satisfying K1.–K4., and trains a one-step generator via the stop-gradient loss.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 19, + "total_chunks": 50, + "char_count": 1026, + "word_count": 139, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43c02395-8f8d-46f8-8ef7-4a5220af085e", + "text": "5.1 Synthetic 2D Benchmarks We visualized the particle evolution under the velocity field of gradient flow using different implementations of\ndivergence and kernel functions. As shown in Fig.1, the original drifting model and L2 flow drifting (with the same gradient flow with MMD) show\na mode-covering training process as harsh punishment from divergence. We can easily find that they both have blur\nsituations. The reverse KL divergence + χ2 divergence mixture flow drifting shows a totally different evolving process. This model\nalmost only generate precise samples, but not struggles in mode collapse, it quickly explored all the modes.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 20, + "total_chunks": 50, + "char_count": 640, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b47bc962-a4f3-4777-9dcc-2fa827a97e5b", + "text": "The original drifting model uses laplace kernel which may have some issues in high probability area. Since the Laplace\nkernel violates the assumptionK4., the gradient flow derived from it is mathematically only \"weakly\" defined, and\nduring the convergence stage, it causes numerical instability (jittering) of particles near the data manifold. We can\nobserve this phenomenon on the center of the swiss-roll distribution, the generation distribution has weird distortion,\nwhile RBF kernel version not. While the drifting model achieved empirical success due to the uniformity of semantic\ndistribution, we can make it much more stable through the design of kernel function. Gradient Flow Drifting (a) Reverse KL divergence + χ2 divergence (RBF kernel) (b) L2 distance(RBF kernel) (c) Forward KL divergence (Laplace kernel, original drifting model) (d) Forward KL divergence (RBF kernel) Figure 1: Training results with the velocity field of gradient flow under different implementations of divergence and\nkernel function on 2D-toy dataset. 6 Conclusion and Discussion We have found a new family of generative models, and given a mathematical equivalence between the Drifting Model\nand the Wasserstein gradient flow of the KL divergence under KDE approximation as a special case of our method,\nGradient Flow Drifting. We have proved that under the aid of a finely designed kernel function, matching the\npushforward distribution of KDE can achieve an approximation of the original distribution, and then extended this\nmethod onto Riemannian manifold which loosens the constraints on the kernel function, and make this method more\nsuitable for the semantic space which is used in Drifting Model Deng et al. [2026].", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 21, + "total_chunks": 50, + "char_count": 1709, + "word_count": 260, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "507e872e-696f-4eb3-973f-104ceefdf0b1", + "text": "Besides, we did some preliminary\nexperiments on synthetic benchmarks to validate the framework. Our approach utilizes the convergence of the KDE distribution to induce the convergence of the original\ndistribution. However, in practice, we can only approximate this KDE distribution using minibatches. As the dimension\nincreases, the variance of the minibatch estimation will gradually increase, which will seriously affect the stability of\nthe training and the final convergence effect, as other kernel-based methods suffer. Future work includes extending Gradient Flow Drifting to large-scale, high-dimensional datasets and\ndiverse generation tasks, such as conditional generation and multi-modal generation. We plan to conduct comprehensive\nablation experiments to evaluate the contribution of the combined reverse KL and χ2 gradient flows, explore the\nimpact of different kernel functions and bandwidth choices and investigate acceleration techniques such as mini-batch\nparticle updates and kernel approximation to improve computational efficiency and practical application capabilities. Meanwhile, we will follow the engineering techniques used in drifting model Deng et al. [2026], like training the Gradient Flow Drifting model in semantic space and using multiple bandwidths. Furthermore, as theoretical analysis indicates, the Riemannian\nmanifold is highly suitable for our approach. We will employ the hyperspherical semantic space constructed by JEPA,\nuse ViT-based instead of a CNN-based architecture to achieve high computation efficiency, and make this type of model\nmore scalable. Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability\nmeasures. Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 22, + "total_chunks": 50, + "char_count": 1819, + "word_count": 252, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "120f36c2-7079-4f4b-956c-dad8214a4e34", + "text": "Advances in\nneural information processing systems, 32, 2019. Ayoub Belhadji, Daniel Sharp, and Youssef Marzouk.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 23, + "total_chunks": 50, + "char_count": 111, + "word_count": 15, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a7bddd92-1e58-4fa6-aa7a-dbd40bc9cc69", + "text": "Weighted quantization using mmd: From mean field to mean\nshift via gradient flows. arXiv preprint arXiv:2502.10600, 2025. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine\nintelligence, 17(8):790–799, 1995. Lénaïc Chizat, Maria Colombo, Roberto Colombo, and Xavier Fernández-Real. Quantitative convergence of wasserstein\ngradient flows of kernel mean discrepancies. arXiv preprint arXiv:2603.01977, 2026. Jaemoo Choi, Jaewoong Choi, and Myungjoo Kang. Scalable wasserstein gradient flow for generative modeling through\nunbalanced optimal transport. In International Conference on Machine Learning, pages 8629–8650. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on\npattern analysis and machine intelligence, 24(5):603–619, 2002. Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting. arXiv preprint Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum\nmean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 24, + "total_chunks": 50, + "char_count": 1127, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8435aa4b-36be-40cf-8cb9-4e2cc0e54dcd", + "text": "Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM\njournal on mathematical analysis, 29(1):1–17, 1998. Yujia Li, Kevin Swersky, and Rich Zemel.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 25, + "total_chunks": 50, + "char_count": 208, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4a1e4aa0-e055-44b2-9705-1eb9b94a0e9f", + "text": "Generative moment matching networks. In International conference on\nmachine learning, pages 1718–1727. A long-short flow-map perspective for drifting models. arXiv preprint arXiv:2602.20463, 2026. XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 26, + "total_chunks": 50, + "char_count": 256, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9c5a5fb3-fe80-4c8b-8603-f88a1f1cadca", + "text": "Estimating divergence functionals and the likelihood\nratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational\ndivergence minimization. Advances in neural information processing systems, 29, 2016. Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling,\nvolume 87 of Progress in Nonlinear Differential Equations and Their Applications. Birkhäuser, 2015. doi: 10.1007/\n978-3-319-20828-2. Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space\nembeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010. Mingxuan Yi, Zhanxing Zhu, and Song Liu. Monoflow: Rethinking divergence gans via the perspective of wasserstein\ngradient flows. In International Conference on Machine Learning, pages 39984–40000. Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. arXiv preprint arXiv:2503.07565, 2025.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 27, + "total_chunks": 50, + "char_count": 1128, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c7486610-6802-4ae4-8f2a-8baf2cf13bb4", + "text": "A Definitions and Standing Assumptions Probability measures. Let P(Rd) denote the set of Borel probability measures on Rd. For µ ∈P(Rd) and a\nmeasurable function g, we write Eµ[g] = R Rd g(y)dµ(y) whenever the integral is well-defined. Assumption A.1 (Kernel Regularity K1–K4). Let k : Rd × Rd →R be a kernel satisfying:\nK1. The kernel k is positive definite, and the mean embedding µ 7→ R k(·, y)dµ(y) is injective on\nP(Rd). Uniform gradient bound. Mk := sup ∇xk(x, y) < ∞.\nx,y∈Rd\nK3. Strict positivity. k(x, y) > 0 for all x, y ∈Rd. For every y ∈Rd, the map x 7→k(x, y) is continuously differentiable. Remark A.2 (Normalized KDE). For a translation-invariant kernel k(x, y) = φ(x −y) with φ ∈L1(Rd),\nR µkde(x)dx = ∥φ∥L1 =: Zk for every µ ∈P(Rd). The normalized density ¯µh := µkde/Zk ∈P(Rd) satisfies\n∇log ¯µh = ∇log µkde, so all score-based formulas are unaffected by the normalization constant. B Injectivity of the KDE Operator", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 29, + "total_chunks": 50, + "char_count": 932, + "word_count": 164, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1051bfd-d9f7-41a5-ac5b-a3c80a5c4916", + "text": "We show that a characteristic kernel allows the recovery of the original measure from its KDE. Proposition B.1 (KDE Injectivity). If µkde(x) = νkde(x) for all x ∈Rd, then µ = ν. Let Hk be the RKHS of k and denote the mean embeddings mµ := R k(·, y)dµ(y), mν := R k(·, y)dν(y) ∈\nHk. By the reproducing property, mµ(x) = ⟨mµ, k(·, x)⟩Hk = µkde(x) and similarly for ν. Hence µkde = νkde\npointwise implies ⟨mµ −mν, k(·, x)⟩Hk = 0 for all x. Since {k(·, x) : x ∈Rd} spans a dense subset of Hk, we obtain\nmµ = mν in Hk. By K1. (injectivity of the mean embedding), µ = ν. C Regularity of KDE-Smoothed Densities We establish that the smoothness of µkde is inherited entirely from the kernel, regardless of the regularity of µ. Theorem C.1 (KDE Regularity). Let k satisfy K2.–K4. and let µ ∈P(Rd) be arbitrary. Then:\n(i) µkde ∈C1(Rd) and differentiation commutes with integration: ∇x µkde(x) = ∇xk(x, y)dµ(y). (10)\n(ii) µkde(x) > 0 for all x ∈Rd.\n(iii) supx ∥∇µkde(x)∥≤Mk < ∞. Proof. (i): By K4., x 7→k(x, y) is C1 for each y.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 30, + "total_chunks": 50, + "char_count": 1017, + "word_count": 193, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a7a2fba-0eaa-4f83-acbc-cb943d74d35d", + "text": "By K2., ∥∇xk(x, y)∥≤Mk for all x, y. Since the constant Mk is\ntrivially µ-integrable ( R Mkdµ = Mk < ∞for any probability measure µ), the Leibniz integral rule yields (10) and\ncontinuity of the derivative.\n(ii): By K3., k(x, y) > 0 for all x, y. Since µ is a nonzero positive measure, µkde(x) = R k(x, y)dµ(y) > 0. (iii): ∥∇µkde(x)∥= R ∇xk(x, y)dµ(y) ≤ R ∥∇xk(x, y)∥dµ(y) ≤Mk. For any p, q ∈P(Rd), the KDE densities pkde, qkde are strictly positive and C1. In particular, the logratio log(pkde/qkde) is well-defined and C1, and the f-divergence machinery of the next section applies to pkde, qkde\nwith no additional assumptions on p, q. D Wasserstein Gradient Flows of f-Divergences We recall the Wasserstein gradient flow (WGF) framework Ambrosio et al. [2005] for f-divergences between smooth\npositive densities. Gradient Flow Drifting Definition D.1 (f-divergence). Let f : (0, ∞) →R be convex with f(1) = 0. For positive densities ρ, π on Rd, Z ρ(x)\nDf(ρ∥π) := π(x) f dx. (11)\nRd π(x) Proposition D.2 (First Variation, Velocity, and Energy Dissipation). Let ρ, π ∈C1(Rd) with ρ, π > 0 everywhere and\nDf(ρ∥π) < ∞.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 31, + "total_chunks": 50, + "char_count": 1116, + "word_count": 196, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "07e079f0-9c8f-4fb7-8208-a6e6127e2988", + "text": "Then:\n(i) The first variation of Df(ρ∥π) with respect to ρ is δDf(ρ∥π) ′ ρ(x) (x) = f . (12)\nδρ π(x) (ii) The WGF particle velocity is\n′ ρ(x)\nvf(x) = −∇f . (13)\nπ(x) (iii) Suppose in addition that the boundary terms arising from integration by parts vanish: Z ′ ρt ′ ρt lim ρt f π ∇f π · ˆn dS = 0. (14) R→∞ ∥x∥=R Then along any smooth WGF solution (ρt)t≥0: d Z ′ ρt(x) 2\ndtDf(ρt∥π) = − Rd ρt(x) ∇f π(x) dx ≤0. (15) Proof. (i) Let η be a smooth, compactly-supported perturbation with R ηdx = 0. d Z η Z\nDf(ρ + ϵη∥π) = π f ′(u) dx = f ′(u) ηdx,\ndϵ ϵ=0 π giving (12).\n(ii) Immediate from Definition 3.4: vf = −∇δFδρ = −∇f ′(u).\n(iii) Write Φ := f ′(ρt/π) = δFδρ . Using the continuity equation (3): d Z Z\ndtDf(ρt∥π) = Rd Φ ∂tρtdx = Rd Φ ∇· ρt∇Φ dx. The product rule gives Φ ∇· (ρt∇Φ) = ∇· (Φ ρt∇Φ) −ρt∥∇Φ∥2. Integrating the divergence term over BR and\napplying the divergence theorem yields the boundary integral ρt Φ (∇Φ · ˆn) dS,\n∥x∥=R", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 32, + "total_chunks": 50, + "char_count": 935, + "word_count": 199, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5acabd2c-c607-4d13-8c6e-f367d751c473", + "text": "which vanishes as R →∞by (14). d Z\ndtDf(ρt∥π) = − Rd ρt∥∇Φ∥2dx ≤0. Remark D.3 (Boundary condition (14)). We distinguish two settings and don't assume the vanishing-boundary\ncondition (14). Within Rd, in our context we assume the original p, q ∈P2(Rd) and the induced KDE distributions can\nbe easily satisfy the vanishing-boundary condition. Condition (14) is an assumption on the joint tail behavior of ρt and π. It holds whenever ρt, π, their ratio, and its\ngradient have sufficient decay at infinity. In the context of this paper, ρt and π are KDE-smoothed densities, whose\ntails are governed by the kernel. For any kernel with exponential or faster decay (e.g., Gaussian, Matérn with\nν > 1, Pseudo-Huber), the KDE-smoothed densities inherit exponential decay and the condition is satisfied for all\np, q ∈P2(Rd) with finite second moments. For kernels with only polynomial decay (e.g., IMQ with exponent β),\nthe condition requires β > (d −2)/2. Gradient Flow Drifting If the ambient space is a compact Riemannian manifold M without boundary, then condition (14) is vacuous.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 33, + "total_chunks": 50, + "char_count": 1075, + "word_count": 176, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3b0bc544-1457-4017-9f16-bdad5397da78", + "text": "Indeed, the divergence theorem on M reads Z Z\n∇g · X dvolg = g(X, ˆn) dS = 0\nM ∂M\nsince ∂M = ∅. Consequently, the energy dissipation inequality (15) holds unconditionally for any f-divergence,\nany kernel satisfying the manifold analogues of K1.–K4., and any p, q ∈P(M). Proposition D.4 (Identifiability for f-divergence gradient flows). Let M be a connected Riemannian manifold (e.g., Rd\nor Sd−1). Let ρ, π ∈C1(M) with ρ, π > 0 and R M ρ dvol = R M π dvol. If f is strictly convex and the WGF velocity\nvanishes identically, vf ≡0, then ρ = π. By (13), vf ≡0 implies ∇f ′(ρ/π) ≡0 on M. Since M is connected and f ′(ρ/π) ∈C0(M), we have\nf ′(ρ/π) ≡c for some constant c. Strict convexity of f implies f ′ is strictly monotone, so ρ/π ≡(f ′)−1(c) =: λ > 0. Integrating: R ρ = λ R π, hence λ = 1 and ρ = π.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 34, + "total_chunks": 50, + "char_count": 801, + "word_count": 163, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4a7d4186-0f7a-4ad5-992e-ceb3f68f95e0", + "text": "Remark D.5 (Specific f-divergences). Writing u := ρ(x)/π(x), we record the velocities for the divergences of primary\ninterest: f(u) Divergence f ′(u) Velocity vf(x) u log u KL(ρ∥π) 1 + log u ∇log π −∇log ρ\n−log u KL(π∥ρ) −1/u πρ (∇log π −∇log ρ)\n2(u1 −1)2 χ2(ρ∥π) (u −1) π(∇logρ π −∇log ρ) We express the score of µkde as an expectation under µ, establishing sample computability. Proposition E.1 (KDE Score). Let k satisfy K2.–K4. and µ ∈P(Rd). Then for all x ∈Rd:\nR ∇xk(x, y)dµ(y)\n∇log µkde(x) = . (16)\nR k(x, y)dµ(y) By Theorem C.1(i) and (ii), µkde ∈C1 and µkde > 0, so ∇log µkde = ∇µkde/µkde. Substituting (10)\ngives (16).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 35, + "total_chunks": 50, + "char_count": 627, + "word_count": 115, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8fe64439-b535-4da8-878e-fd18db51f518", + "text": "Corollary E.2 (Gaussian Kernel Score). For kh(x, y) = exp −∥x −y∥2/(2h2) : ∇log µkde(x) = Eµ,kh[y | x] −x , (17)\nwhere Eµ,kh[y | x] := R y kh(x, y)dµ(y) R kh(x, y)dµ(y). For the Gaussian kernel, ∇xkh(x, y) = y−xh2 kh(x, y). Substituting into (16):\nR y−x h2 kh(x, y)dµ(y) 1 R y kh(x, y)dµ(y) ∇log µkde(x) = = −x . R kh(x, y)dµ(y) h2 R kh(x, y)dµ(y) Throughout this section, π := pkde and ρ0 := q0,kde denote the KDE-smoothed densities of the target distribution\np and the initial generated distribution q0, respectively. By Theorem C.1, both are strictly positive and C1 under\nAssumptions K1.–K4., with no regularity conditions on p or q0. We consider the Wasserstein-2 gradient flow of the\nf-divergence F[ρ] = Df(ρ∥π), i.e., the continuity equation\n′ ρt\n∂tρt = ∇· ρt∇Φt , Φt := f , (18)\nwith initial condition ρ0 > 0. Gradient Flow Drifting Remark F.1 (Density-level vs. particle-level flow). Equation (18) describes the evolution of a smooth density ρt in\nthe Wasserstein-2 metric. In the practical training algorithm 4.8, one evolves a finite collection of particles whose\nempirical distribution is qt, and estimates the velocity using the KDE density qt,kde from a mini-batch. The convergence\ntheorems below apply to the idealized density-level flow; the particle system is viewed as a consistent approximation\nthat converges to this flow in the large-sample limit. F.1 Proof of Identifiability (Theorem 4.7)", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 36, + "total_chunks": 50, + "char_count": 1411, + "word_count": 237, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "650d48f3-43a5-4a84-a3e9-5c6290ebd096", + "text": "The proof proceeds in two steps. Step 1: Constant density ratio. By definition, vkdef (x) = −∇f ′ qkde(x)/pkde(x) . The hypothesis vkdef ≡0 thus\nimplies\n′ qkde(x) ∇f = 0 for all x. (19)\npkde(x)\nLet M be the ambient space (Rd or a connected Riemannian manifold). By Theorem C.1, pkde and qkde are C1\nwith pkde, qkde > 0, so the ratio u(x) := qkde(x)/pkde(x) is C1 and strictly positive. By the composition rule,\nΦ := f ′ ◦u ∈C1(M). Since M is connected and ∇Φ ≡0, Φ is a constant: f ′(u(x)) = c for some c ∈R. Strict convexity of f implies that f ′ is strictly monotone, hence injective. Therefore u(x) = (f ′)−1(c) =: λ > 0 for all\nx, i.e., qkde = λ pkde everywhere. Step 2: KDE matching implies distribution matching. Integrating both sides over M: Z Z\nqkde dx = λ pkde dx. M M\nFor a translation-invariant kernel, R qkde = R pkde = Zk (Remark A.2); on a compact manifold the same equality\nholds by symmetry. Hence λ = 1 and qkde = pkde. By the injectivity of the KDE operator under characteristic kernels\n(Proposition B.1), q = p.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 37, + "total_chunks": 50, + "char_count": 1031, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "727d2ef0-0eda-4963-8826-086984b75f7c", + "text": "Remark F.2 (MMD identifiability). For the MMD functional (Proposition H.2), vMMD ≡0 implies ∇(pkde−qkde) ≡0. By connectedness, pkde −qkde = c. Integrating gives c = 0, so pkde = qkde, and Proposition B.1 yields p = q.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 38, + "total_chunks": 50, + "char_count": 217, + "word_count": 37, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2d1a932c-c560-49b8-9449-ff566216dea4", + "text": "G Core Equivalence: Drifting as KL Gradient Flow Theorem G.1 (Core Equivalence). Let kh be the Gaussian kernel and p, q ∈P(Rd).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 39, + "total_chunks": 50, + "char_count": 127, + "word_count": 22, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "00dc777d-3fdb-43ab-bb65-6ea95ae5401c", + "text": "Then the drifting field (4) satisfies Vp,q(x) = h2 ∇log pkde(x) −∇log qkde(x) = h2 vKL(x) (20) ρ=qkde, π=pkde, where vKL = ∇log π −∇log ρ is the WGF velocity of the KL divergence DKL(ρ∥π) (Remark D.5). Apply Corollary E.2 to p and q respectively:\nh2∇log pkde(x) = Ep,kh[y | x] −x, (21)\nh2∇log qkde(x) = Eq,kh[y | x] −x. (22) Subtracting, the x terms cancel:\nh2 ∇log pkde(x) −∇log qkde(x) = Ep,kh[y | x] −Eq,kh[y | x] = Vp,q(x), (23)\nThe identification with vKL follows from Remark D.5 with ρ = qkde, π = pkde (both smooth and positive by\nCorollary C.2). Definition H.1 (Squared MMD). For a positive-definite kernel k with RKHS Hk, the squared MMD between p, q ∈\nP(Rd) is\nMMD2k(q, p) := 2∥mq −mp∥2Hk, (24)\nwhere mµ := R k(·, y)dµ(y) is the mean embedding. Gradient Flow Drifting Proposition H.2 (MMD Flow Velocity).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 40, + "total_chunks": 50, + "char_count": 814, + "word_count": 148, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca555767-266e-4d0f-8497-07a403edaaa1", + "text": "Let k satisfy K1.–K4.. The WGF of F[q] := 12MMD2k(q, p) has particle\nvelocity\nvMMD(x) = ∇xk(x, y)d(p −q)(y) = ∇ pkde(x) −qkde(x) . (25) Perturb q →(1 −ϵ)q + ϵδx, so mqϵ = mq + ϵ(k(·, x) −mq). Expanding and differentiating at ϵ = 0: (x) = ⟨mq −mp, k(·, x)⟩Hk = mq(x) −mp(x) = qkde(x) −pkde(x), using the reproducing property. The velocity is vMMD = −∇δFδq = ∇(pkde −qkde).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 41, + "total_chunks": 50, + "char_count": 371, + "word_count": 71, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e80cf0ec-d578-4be8-b336-0250c2ebb5f1", + "text": "By Theorem C.1(i), differentiation\nunder the integral gives the first equality in (25). I Superposition of Gradient Flows Proposition I.1 (Superposition). Let F1, F2 : P(Rd) →R be functionals with well-defined first variations. For any\nα, β ≥0, the mixed functional Fmix := αF1 + βF2 has WGF velocity vmix(x) = α v1(x) + β v2(x), (26) By linearity of the first variation and the gradient:\nvmix = −∇δFmix = −∇ αδF1 + β δF2 = αv1 + βv2.\nδρ δρ δρ Table 2 summarizes the KDE-based gradient flow velocities, all sample-computable via the\nscore formula (16). Table 2: Unified KDE-based gradient flow framework. All velocities are expressed via pkde, qkde and are samplecomputable. Functional WGF velocity v(x) Model DKL(¯qh∥¯ph) ∇log pkde −∇log qkde Drifting (/h2)\nDKL(¯ph∥¯qh) ∇(pkde/qkde) -\nχ2(¯qh∥¯ph) −∇(qkde/pkde) -\nMMD2kh(q, p) ∇(pkde −qkde) MMD generators J Analysis of Specific Kernel Families We verify Assumptions K1.–K4. for several kernel families. For each kernel, we also compute the score weight w(r)\ndefined by ∇x log k(x, y) = −w(r)(x −y) where r := ∥x −y∥.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 42, + "total_chunks": 50, + "char_count": 1068, + "word_count": 176, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2dd0b355-f21d-40da-9e98-e9e4453513a1", + "text": "J.1 Euclidean Kernels J.1.1 Gaussian (RBF) Kernel The Gaussian kernel kh(x, y) := exp −∥x −y∥2/(2h2) satisfies K1.–K4.. Its score weight is\nw(r) = 1/h2 (constant). If kh is the Gaussian kernel, one additionally obtains µkde ∈C∞(Rd) with every derivative uniformly bounded for any\nµ ∈P(Rd). Gradient Flow Drifting", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 43, + "total_chunks": 50, + "char_count": 312, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "12d0cec5-3074-4466-93fb-da9866ff6e2c", + "text": "K1.: The Fourier transform ˆφ(ω) = (2πh2)d/2 exp(−h2∥ω∥2/2) > 0 for all ω. By Sriperumbudur et al.\n[2010, Theorem 9], a translation-invariant kernel with strictly positive Fourier transform is characteristic. K2.: ∥∇xkh∥= h2r e−r2/(2h2) where r = ∥x −y∥. Maximizing over r ≥0: the maximum occurs at r = h, giving\nMk = he−1/21 = h√e1 < ∞. K4.: kh ∈C∞(Rd × Rd). J.1.2 Matérn-ν Kernel The Matérn kernel with smoothness ν > 0 and length scale ℓ> 0,\n√ !ν √ ! 21−ν 2ν r 2ν r\nkν,ℓ(x, y) = Kν , r := ∥x −y∥,\nΓ(ν) ℓ ℓ where Kν is the modified Bessel function of the second kind, satisfies all four assumptions if and only if ν > 1. K1.: The Fourier transform ˆφ(ω) ∝(2ν/ℓ2 +∥ω∥2)−(ν+d/2) > 0 for all ω and ν > 0, so kν,ℓis characteristic. K3.: kν,ℓ> 0 since Kν(z) > 0 for z > 0 and kν,ℓ(x, x) = 1. K4.: The sample path regularity theory of Matérn processes shows that kν,ℓis C⌈ν⌉−1 as a function of r. When ν ≤1,\na cusp at r = 0 (Laplace-like) violates K4.. When ν > 1, at least C1 regularity is guaranteed. K2.: When ν > 1, ∥∇xkν,ℓ∥is continuous (by K4.) and decays exponentially as r →∞, hence bounded. When\nν ≤1, the gradient diverges at r = 0.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 44, + "total_chunks": 50, + "char_count": 1138, + "word_count": 224, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e636fdee-39ed-4263-b54a-794dd63b42f6", + "text": "J.1.3 Summary of Euclidean Kernels Table 3: Verification of K1.–K4. for Euclidean kernels. Status\nGaussian ✓ ✓ ✓ ✓ ✓\nIMQ ✓ ✓ ✓ ✓ ✓\nPseudo-Huber ✓ ✓ ✓ ✓ ✓\nMatérn (ν > 1) ✓ ✓ ✓ ✓ ✓\nLaplace ✓ ✓ ✓ ✗ ✗\n✓: verified; ✗: fails. J.2 Spherical Kernels On the unit sphere Sd−1 := {x ∈Rd : ∥x∥= 1} with the round metric, we adapt the kernel assumptions. With ∇S denoting the Riemannian gradient on Sd−1:\nK1S. k is characteristic on P(Sd−1). K2S. supx,y∈Sd−1 ∥∇S,xk(x, y)∥< ∞. K3S. k(x, y) > 0 for all x, y ∈Sd−1. K4S. x 7→k(x, y) is C1 on Sd−1. J.2.1 von Mises–Fisher (vMF) Kernel The vMF kernel kκ(x, y) := exp(κ x⊤y) with κ > 0 satisfies K1S–K4S. K1S: The Mercer expansion kκ(x, y) = P∞ℓ=0 aℓ(κ) Pm Yℓ(x)Ym ℓ(y)m has coefficients aℓ(κ) > 0 for all\nℓ≥0. Since all eigenvalues are positive, kκ is universal and hence characteristic.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 45, + "total_chunks": 50, + "char_count": 820, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "191a9b7e-a611-478b-b058-f8000b60add7", + "text": "Gradient Flow Drifting K2S: The Riemannian gradient is ∇S,xkκ = κ kκ ProjTxS(y) where ProjTxS(y) = y −(x⊤y)x. Since kκ ≤eκ and\n∥ProjTxS(y)∥≤1: ∥∇S,xkκ∥≤κeκ < ∞. Proposition J.4 (Spherical Core Equivalence). Let p, q ∈P(Sd−1) and kκ be the vMF kernel. Define µkde(x) :=\nR Sd−1 kκ(x, y)dµ(y). (i) The spherical KDE score is\n∇S log µkde(x) = κ ProjTxS Eµ,kκ[y | x] . (27)\n(ii) The spherical drifting field VSp,q(x) := ProjTxS Ep,kκ[y | x] −Eq,kκ[y | x] satisfies VSp,q(x) = ∇S log pkde(x) −∇S log qkde(x) . (28) Proof. (i) The ambient gradient of kκ(x, y) with respect to x is κy kκ(x, y). Hence the ambient gradient of log µkde\nis κ R y kκdµ/ R kκdµ = κ Eµ,kκ[y | x]. The Riemannian gradient is its tangential projection, giving (27).\n(ii) By linearity of ProjTxS: ∇S log pkde −∇S log qkde = ProjTxS Ep,kκ[y | x] −Eq,kκ[y | x] = VSp,q(x). κ J.2.2 Spherical Logarithmic Kernel Let c > 0 and 0 < α < 2+c.1 The spherical logarithmic kernel defined by kc,α(x, y) := −log α(1 −x⊤y + c) (29)", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 46, + "total_chunks": 50, + "char_count": 983, + "word_count": 185, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "602e4b8a-d290-4bcf-80bf-4528a169b4f1", + "text": "satisfies all spherical assumptions K1S–K4S. Since x, y ∈Sd−1, we have z ∈[−1, 1]. Let the argument of the logarithm be denoted as\ng(z) := α(1 −z + c). K3S (Strict positivity): We analyze the bounds of g(z). Since z ∈[−1, 1], the term 1 −z + c achieves its minimum at\nz = 1 (yielding c) and its maximum at z = −1 (yielding 2 + c). 0 < α · c ≤g(z) ≤α(2 + c). (30)\nBy the given condition α < 2+c,1 we strictly have g(z) < 1. Thus, 0 < g(z) < 1 for all x, y ∈Sd−1. Consequently,\nkc,α(x, y) = −log(g(z)) > 0 everywhere. K4S (Continuous differentiability): Since g(z) ≥α · c > 0, the argument of the logarithm is strictly bounded away\nfrom zero. The functions z = x⊤y and t 7→−log(t) (for t > 0) are smooth (C∞). Therefore, their composition\nkc,α ∈C∞(Sd−1 × Sd−1), which trivially implies C1. K2S (Uniform gradient bound): The ambient gradient of the kernel with respect to x is: d 1\n∇xkc,α(x, y) = −log(α(1 −z + c)) ∇x(x⊤y) = (31) dz 1 −z + cy. The Riemannian gradient is the projection onto the tangent space TxSd−1: ∇S,xkc,α(x, y) = ProjTxS(∇xkc,α) = y −(x⊤y)x . (32) 1 −z + c", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 47, + "total_chunks": 50, + "char_count": 1074, + "word_count": 213, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4d02830b-27ba-403e-a61e-0130bf47f572", + "text": "Since x and y are unit vectors, ∥y −zx∥2 = ∥y∥2 −2z(x⊤y) + z2∥x∥2 = 1 −z2. Thus:\n1 −z2\n∥∇S,xkc,α(x, y)∥= (33) 1 −z + c. Since 1 −z2 ≤1 and 1 −z + c ≥c > 0, we have the uniform bound ∥∇S,xkc,α∥≤1c < ∞.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 48, + "total_chunks": 50, + "char_count": 200, + "word_count": 47, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "80e9038a-e425-44bd-93a8-a6c0f82451bd", + "text": "Gradient Flow Drifting K1S (Characteristic): We expand the kernel into a Taylor series: kc,α(x, y) = −log α −log (1 + c) −z (34)\n= −log α −log(1 + c) −log 1 − . (35)\n1 + c Using the Maclaurin series for −log(1 −x) = P∞n=1 xnn (which converges absolutely since 1+cz ≤ 1+c1 < 1), we\nobtain:\nkc,α(x, y) = −log α(1 + c) + X (x⊤y)n. (36)\nn(1 + c)n\n| {za0 } n=1\n| an{z }\nFrom the user condition α < 2+c1 < 1+c,1 we have α(1 + c) < 1, which strictly implies a0 > 0. For all n ≥1,\nsince 1 + c > 1, we clearly have an > 0. Because the kernel kc,α can be expressed as a power series f(x⊤y) =\nP∞n=0 an(x⊤y)n where an > 0 for all n ≥0, Schoenberg's theorem guarantees that it is strictly positive definite on\nSd−1 for any dimension d ≥2. Thus, it is a universal (and therefore characteristic) kernel. Remark J.6 (Spherical Score for Logarithmic Kernel). Following the same logic as Proposition J.4, the spherical KDE\nscore for this logarithmic kernel can be explicitly derived. Define the pairwise weight function Wc(x, y) := 1−x⊤y+c.1\nThe ambient gradient of µkde is R Wc(x, y)ydµ(y).", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 49, + "total_chunks": 50, + "char_count": 1073, + "word_count": 210, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f28ca07a-858c-4ed5-b497-815a9f91623e", + "text": "Consequently, the Riemannian score is: R Wc(x, y)ydµ(y)\n∇S log µkde(x) = ProjTxS . (37) R kc,α(x, y)dµ(y) Unlike the vMF kernel, the weighting factor Wc(x, y) here is polynomial (specifically, inversely proportional to distance\nsquared, analogous to the Euclidean IMQ kernel), which typically produces heavier tails and better global mode\ncoverage.", + "paper_id": "2603.10592", + "title": "Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences", + "authors": [ + "Jiarui Cao", + "Zixuan Wei", + "Yuxin Liu" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10592v1", + "chunk_index": 50, + "total_chunks": 50, + "char_count": 348, + "word_count": 52, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10597_semantic.json b/data/chunks/2603.10597_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..b31a509d4940a12659a3253f8e7784e30f625820 --- /dev/null +++ b/data/chunks/2603.10597_semantic.json @@ -0,0 +1,1082 @@ +[ + { + "chunk_id": "b8dea9a1-e4fb-4626-8c26-224742a81e6d", + "text": "Recover to Predict: Progressive Retrospective Learning for Variable-Length\nTrajectory Prediction Hao Zhou1,2,∗, Lu Qi3,∗, Jason Li4, Jie Zhang1,\nYi Liu5, Xu Yang6, Mingyu Fan5,†, Fei Luo1,†\n1Great Bay University 2Tsinghua SIGS 3Wuhan University 4NTU 5Donghua University 6CASIA Abstract2026\nLost Perception range Occluding vehicle Trajectory prediction is critical for autonomous driving, truck\nenabling safe and efficient planning in dense, dynamic trafEgo vehicle Ego vehicleMar fic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driv- Perception range\n11 ing often yields variable-length, incomplete observations, Newley entered vehicle\nposing a challenge to these methods. A common strategy (a) Newly entered vehicle (b) Tracking lost vehicle\nis to directly map features from incomplete observations to\nthose from complete ones. This one-shot mapping, however,\nstruggles to learn accurate representations for short trajectories due to significant information gaps. To address this[cs.RO] issue, we propose a Progressive Retrospective Framework\n(PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade\nof retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Pre- (c) mADE6 on Argoverse 2 (d) mFDE6 on Argoverse 2\ndiction Module (RPM), where RDM distills features and Figure 1. Fig. 1a and Fig. 1b display two common scenarios that\nRPM recovers previous timesteps using the distilled fea- yield variable-length, incomplete trajectories. Fig. 1c and Fig. 1d\ntures. Moreover, we propose a Rolling-Start Training Strat- respectively present the mADE6 and mFDE6 results for the origegy (RSTS) that enhances data efficiency during PRF train- inal DeMo [47], DeMo with Isolated Training (DeMo-IT), and\ning. PRF is plug-and-play with existing methods. Exten- DeMo with PRF (DeMo-PRF) under varying observation lengths.\nsive experiments on datasets Argoverse 2 and Argoverse 1\nHowever, complete historical observations are often undemonstrate the effectiveness of PRF. Code is available at\navailable in real-world settings.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 0, + "total_chunks": 45, + "char_count": 2181, + "word_count": 303, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1210ccc8-0ae5-4ddf-916a-ee585c48107e", + "text": "For example, when a vehihttps://github.com/zhouhao94/PRF.\ncle first enters the ego vehicle's perception range (Fig. 1a)\nor is detected again after being lost due to occlusion or\n1. Introduction tracking errors (Fig. 1b), the temporal context is insuffi-arXiv:2603.10597v1\nTrajectory prediction for dynamic agents in traffic scenarios cient to reconstruct a complete historical trajectory. Such\nis crucial for autonomous driving systems, enabling vehi- incomplete, variable-length observations pose a challenge\ncles to anticipate the future motions of road users, plan safe for existing methods. As shown in Fig. 1c and Fig. 1d,\nand efficient maneuvers, and avoid collisions.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 1, + "total_chunks": 45, + "char_count": 674, + "word_count": 95, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6114012a-75ca-4143-b54b-bb8875ac492e", + "text": "The numer- the performance of the state-of-the-art method DeMo [47]\nous learning-based methods [8, 13, 21, 26, 34, 39, 47, 51– degrades significantly as the number of observed timesteps\n53] have been proposed to improve prediction accuracy. This degradation can propagate to downstream\nthough these methods have made significant progress, they planning and control, increasing the risk of unsafe maneuprimarily focus on optimizing network architectures to im- vers and collisions in real-world driving.\nprove prediction accuracy using idealized standard-length A common, straightforward strategy for variable-length\nobservations as inputs. prediction is Isolate Training (IT). It trains a separate model\nfor each observation length and evaluates each model on in-\n* Equal contribution. † Co-corresponding authors. puts of the same length. Although IT yields modest gains", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 2, + "total_chunks": 45, + "char_count": 870, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "daeb8f8c-b5e5-4934-8964-b99e06c73005", + "text": "on variable-length prediction, as illustrated in Fig. 1c and quence, enhancing data efficiency for training PRF. Fig. 1d, it incurs substantial computational and memory • We perform extensive experiments on Argoverse 2 and\noverhead because it requires training and maintaining mul- Argoverse 1, demonstrating that PRF significantly imtiple models across observation lengths. To improve both proves variable-length prediction and achieves state-ofefficiency and performance, several learning-based meth- the-art results on standard benchmarks.\nods [19, 25, 31, 37, 44] have been proposed. Related Workof these methods is to directly map features from variablelength observations to a canonical representation, typically Trajectory Prediction. In autonomous driving, scene repaligned with either complete observations or a designated resentation is crucial for accurate prediction.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 3, + "total_chunks": 45, + "char_count": 879, + "word_count": 117, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "443def10-999d-4644-8363-4c85fa823fe3", + "text": "Traditional\ntarget length. This one-shot mapping strategy is relatively methods [1, 9, 29] rasterize driving scenarios and use CNNs\neffective when observations are close to standard length, but for context extraction. However, CNNs struggle to capit struggles to learn faithful representations for short trajec- ture scenario details, motivating vectorized scene representories due to pronounced information gaps. tations [12, 35, 50, 52] first introduced by VectorNet [8]. In this work, we propose a new method, Progressive Based on vectorization, attention [18, 23, 24] and graph\nRetrospective Framework (PRF) for variable-length trajec- convolutions [10, 15, 32, 38] have been widely explored\ntory prediction. Instead of directly mapping features from to model agent-scene interactions. Conditioned on the\nincomplete to complete observations, PRF progressively scene encoding, numerous methods have been proposed for\naligns them via a cascade of retrospective units. This de- multimodal trajectory prediction. Early works adopt goalcomposition reduces the learning difficulty, as each unit conditioned strategies [12, 22, 50] or probability distribuonly needs to bridge a small feature gap over a short tempo- tion heatmaps [9, 10]. Recently, with the rise of the Transral horizon.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 4, + "total_chunks": 45, + "char_count": 1284, + "word_count": 183, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a5ac94e7-ec06-4e11-875c-176a82a357fa", + "text": "Each unit consists of a Retrospective Distilla- former [40], Transformer-based models [14, 23, 26, 27, 34],\ntion Module (RDM) and a Retrospective Prediction Module such as QCNet [53] and DeMo [47], have become the\n(RPM). RDM distills features of an incomplete trajectory dominant paradigm. Moreover, techniques including preto its previous history timesteps, while RPM reconstructs training [3, 4, 17], post-refinement [5, 51], GPT [30, 33],\nthese missing timesteps from the distilled feature. PRF op- Diffusion [16], and Mamba [13] have further advanced pererates between the encoder and decoder, making it plug- formance. However, these methods show limitations when\nand-play with existing approaches.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 5, + "total_chunks": 45, + "char_count": 703, + "word_count": 102, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bba27a81-828f-485f-a71f-4b9643c2f0f2", + "text": "Fig. 1c and Fig. 1d show using variable-length observations as inputs.\nthat PRF yields significant improvements across observa- Variable-Length Trajectory Prediction. Incomplete and\ntion lengths using a single model trained once. variable-length trajectories are common in real-world appliSince a shared encoder extracts features for variable- cations and have attracted increasing attention. DTO [25]\nlength observations, na¨ıve distillation may lead to feature distills knowledge from a teacher trained on complete\nconflicts. Therefore, RDM adopts a residual-based distil- trajectories to a student that predicts from short inputs.\nlation strategy that models features at omitted timesteps as MOE [37] introduces a feature extractor for momentary oblearnable residuals. RPM employs a decoupled query de- servations and a pre-training scheme that recovers observasign that integrates anchor-free and anchor-based formu- tions and context. BCDiff [19] develops two coupled diflations, enabling coarse-to-fine historical retrospection. It fusions to infer historical and future trajectories from limprovides implicit supervision for RDM's distillation. More- ited observations. FLN [44] designs calibration and adaptaover, since each unit targets a specific observation length, tion modules to learn temporally invariant representations.\nincomplete observations can be used to train all units whose LaKD [20] proposes length-agnostic knowledge distillation\ntarget lengths they cover. Accordingly, we propose a to transfer knowledge across different observation lengths. Rolling-Start Training Strategy (RSTS) to generate multi- CLLS [31] employs contrastive learning to extract lengthple samples from one sequence, improving data efficiency. invariant features. Despite notable advances, these methods directly map variable-length observations to a canonical The main contributions are as follows:\nrepresentation. This works for near-standard inputs but of-\n• We design a Progressive Retrospective Framework (PRF)\nten fails on short trajectories due to large information gaps.\nfor variable-length prediction. PRF progressively aligns\nOur PRF progressively aligns them by a cascade of units,\nfeatures from variable-length observations with those\nthereby reducing learning difficulty.\nfrom complete ones via a cascade of retrospective units.\n• We propose a Retrospective Prediction Module (RPM) 3. Method\nand a Retrospective Distillation Module (RDM) to form\n3.1.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 6, + "total_chunks": 45, + "char_count": 2460, + "word_count": 329, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e9c59229-f156-4dc8-8963-bd6b6907dedf", + "text": "RPM distills features, while RDM recovers the omitted history using the distilled features. In a driving scenario, the vectorized map is denoted by\n• We introduce a Rolling-Start Training Strategy (RSTS) M ∈RP ×S×Cm, where P, S, and Cm are the number\nto generate multiple training samples from a single se- of map polylines, divided segments, and feature channels, Varying-length Inputs )\" Varying-length Inputs )\"#$ Varying-length Inputs )' Varying-length Inputs )$ Standard Inputs ) Encoder Encoder Encoder Encoder Encoder\n/\" /\"#$ /' /$\n0/$ 0/\"#$ 0/\"#' 0/. /\nRDM - Φ%\"#$ RDM - Φ%\"#$ ⋯ RDM - Φ%' RDM - Φ%$\nTrajectory\nUnit - Φ\" 0/\"#$ Unit - Φ\"#$ 0/\"#' Unit - Φ' 0/$ Unit - Φ$ 0/. Decoder\nRPM - Φ&\"#$ RPM - Φ&\"#$ RPM - Φ&' RPM - Φ&$ Historical Trajectories ∆*-\"#$ Historical Trajectories ∆*-\"#' Historical Trajectories ∆*-$ Historical Trajectories ∆*-. Future Trajectories *+\n: Training Only : Training & Inference Unit: Retrospective Unit RDM: Retrospective Distillation Module RPM: Retrospective Prediction Module A cascade of retrospective units progressively distills features of varying-length inputs, aligning them with\nthose from complete ones to improve prediction performance. Each unit includes a Retrospective Distillation Module (RDM) that distills\nfeatures to longer observations and a Retrospective Prediction Module (RPM) that recovers omitted history from the distilled features. The observed trajectories of agents are rep- the omitted timesteps in a single step is challenging due to\nresented by X ∈RN×To×Ca, where N, To, and Ca are the large information gap between short and standard obserthe number of agents, observed timesteps, and motion states vations. We therefore propose a Progressive Retrospective\n(e.g., position, heading angle, velocity). The future trajecto- Framework (PRF) that progressively maps incomplete traries of the target agents are represented by Y ∈RNa×Tf ×2, jectories to the standard length, as illustrated in Fig. 2.\nwhere Na is the number of selected agents and Tf is the Given a dataset with standard observations X of length\nprediction horizon. The standard trajectory prediction task To, PRF contains τ retrospective units, each responsible\nis to learn a generative method pθ(Y|X, M), that predicts for retrospecting observations of specific length to its forfuture trajectories Y based on the observed trajectories X mer ∆T timesteps. For example, unit Φv reconstructs\nand the vectorized map M. the segment ∆Xv−1 ∈ RNa×∆T ×2 between Xv and\nHowever, existing methods are sensitive to observation- Xv−1. The units process incomplete observations sequenlength mismatch, where performance degrades when the tially, progressively approximating the standard observaobservation length is shorter than the length used dur- tions.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 8, + "total_chunks": 45, + "char_count": 2772, + "word_count": 421, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "acc702c9-a9d9-4b10-b6a4-bb8c523fafd0", + "text": "Specifically, a incomplete input Xv is passed through\ning training. Our goal is therefore to design a predictor Φv, Φv−1, . . . , and Φ1 to reconstruct the observation\npϕ(Y|Xv, M) that remains effective with incomplete ob- ∆Xv−1, ∆Xv−2, . . . , ∆X0 until reaching the standardservations Xv and achieves results comparable to those length observation X.\nobtained with complete observations. We define Xv ∈ To make the framework plug-and-play and highly effiRN×Tv×Ca with observation length: cient, we employ a shared encoder to extract features from\nvariable-length observations:\nT _v = T_ov\\cdot\\DeltaT, (1)\n\\ mathbf {F}^v = \\o pe r a t o rname{Encoder}(\\mathbf{X}^v),v\\in\\dots\\tau(2)\nwhere v ∈{1, 2, . . . , τ} and τ = ∆TTo −1. Here, ∆T is the The unit Φv then takes Fv as input. Instead of retrospecting\ntemporal interval omitted at each step, τ is the maximum\nfeatures or trajectories, each unit retrospects both:\nadmissible number of omissions, and v indexes the number of omitted intervals. Thus, Xv denotes an incomplete \\beg i n {al i gned} \\\n(3)\nobservation in which the first v · ∆T timesteps are omitted. tilde {\\mathbf{F}}^ {v-1},\\Delta\\tilde{\\mathbf{X}}^{v-1}\\Phi^{v}(\\mathbf{F}_v),\\tilde{\\mathbf{X}}^{v-1}\\operatorname{Concat}(\\Delta\\tilde{\\mathbf{X}}^{v-1},\\mathbf{X}^{v}),{aligned}\n3.2. Progressive Retrospective Framework\nto approximate trajectory Xv−1 and its feature Fv−1. Fig. 1c and Fig. 1d show that the performance gap narSpecifically, each unit comprises a Retrospective Distillarows as the length of incomplete observations approaches\ntion Module (RDM) and a Retrospective Prediction Module\nthe standard length. This can be attributed to the increased\n(RPM), where RDM distills features while RPM recovers\nrobustness of features extracted from longer observations,\nomitted timesteps using the distilled features:\nwhich motivates us to retrospect the incomplete observation to the standard length.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 9, + "total_chunks": 45, + "char_count": 1920, + "word_count": 278, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e1e2602-93e4-46b4-b37d-a83077a10c78", + "text": "However, directly recovering \\ti l de {\\mat hbf {F } }^{ v-1} = \\Phi^v_D(\\mathbf{F}^v),\\Delta\\tilde{\\mathbf{X}}^{v-1}\\Phi^v_P(\\tilde{\\mathbf{F}}^{v-1}). (4) Anchor-Free Mode Queries for Trajectory Proposal Prediction ∆*+%'(\nMap's CrossAttn SelfAttn\nfeature $\" !\"\nSelfAttn MLP Predictor\nLogit Branch\nAgent's\nfeature $% MLP Predictor\nDistillated\nCrossAttn SelfAttn MLP feature &$%'( CrossAttn Mamba CrossAttn Mamba\nResidual Branch !# Anchor-Based State Queries for Trajectory Refinement (a) Retrospective Distillation Module (b) Retrospective Prediction Module : Element-Wise Multiplication : Element-Wise Addition : Mode Queries !\" : State Queries !# Illustration of the (a) RDM and (b) RPM. RDM employs a residual-based distillation strategy, featuring a logit branch that\ngenerates a gating vector and a residual branch that learns from the omitted history. RPM employs a decoupled query strategy, utilizing\nmode queries for multimodal trajectory proposals and state queries for trajectory refinement, with the proposals serving as anchors. At inference time, Fv is propagated iteratively through v gated and fused with the learned residual feature through a\n˜units to produce a standard-length feature F0, which is then shortcut connection:\npassed to a shared decoder to predict future trajectory ˜Y =\nDecoder(˜F0). \\ti l de {\\ m athbf{F}}^{v-1}\\mathbf{g}^v\\odot\\mathbf{F}^v\\mathbf{F}^v_r, (7)\nSince a shared encoder extracts teacher and student features in RDM, feature conflict may arise during distilla- where ⊙represents element-wise multiplication. We therefore design the RDM with a residual-based sion preserves reliable components through the gated shortdistillation strategy, which models the feature of omitted cut, imputes omissions via the residual, and maintains gra-\n∆T time steps as learnable residuals. To further strengthen dient flow for stable, efficient training.\ndistillation, we design the RPM to recover the omitted 3.4. Retrospective Prediction Module\ntimesteps from retrospected features, providing implicit suFig. 3b presents the RPM. RPM recovers the omitted ∆T\npervision for RDM and yielding additional performance\ntimesteps from feature ˜Fv−1. It adopts a decoupled query\ngains. These two modules enable the retrospective units to\nstrategy to integrate anchor-free and anchor-based schemes,\nrealize progressive feature distillation, significantly improvenabling coarse-to-fine trajectory retrospection. First, since\ning variable-length trajectory prediction.\nretrospection is inherently multimodal, similar to predic-\n3.3.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 10, + "total_chunks": 45, + "char_count": 2553, + "word_count": 340, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7143aa7c-02ed-4c56-9017-cc03f3d52353", + "text": "Retrospective Distillation Module tion, RPM uses mode queries to generate diversity yet\nFig. 3a illustrates the RDM. RDM models the teacher- coarse multimodal proposals. Second, state queries that\nstudent discrepancy induced by the omitted ∆T timesteps learns the temporal dynamics of agents treat these proposals\nas residual, and adopts a residual-based distillation strat- as anchors and further refine them.\negy. RDM ΦvD distills student feature Fv of length Tv to Anchor-Free Mode Queries for Multimodal Proposal.\nteacher feature Fv−1 of length Tv−1. Since the HD map RPM employs mode queries to recover multimodal hisis independent of trajectory length, RDM first conditions torical trajectories. Specifically, mode queries Qm ∈\nagent features on the scene context via cross-attention: RNa×K×C, where K denotes the number of motion modes,\nare first initialized by MLPs to preserve multimodal infor-\n\\ m athbf {F}_m ^ v = \\ o p eratorname{CrossAttn}(Q=\\mathbf{F}^v,K,V=\\mathbf{F}_m), (5) mation. Then, cross-attention is applied to extract scene\nfeatures from ˜Fv−1 for Qs. After that, self-attention is apwhere Fm denotes the encoded feature of map M, thereby plied to Qs to capture interactions among modes.\nextracting environment constraints for distillation. RDM\nthen employs two parallel branches, a logit branch that gen- \\ begin {a lig n e d } \\mat\nerates element-wise gates and a residual branch that learns\nhb f {Q}_m &= \\ ope ra t o rname { (8)the residual corresponding to the omitted timesteps:\nML P } ([m_1, m_ 2, \\ dotsm_M]),\\\\\\mathbf{Q}_m\\operatorname{CrossAtt}(Q=\\mathbf{Q}_m,K,V=\\tilde{\\mathbf{F}}^{v-1}),\\\\\\mathbf{Q}_m\\operatorname{SelfAttn}(Q,K,V=\\mathbf{Q}_m).{aligned}\n\\ b egin {align ed } \\math\nbf {H}_g^v &= \\operator name {S Finally, a predictor composed of MLPs is used to propose\n(6) multimodal trajectories using mode queries Qm:\nelf A ttn}(Q,K,V= \\m a t hbf {\nF}_ m ^v),\\\\ \\mathbf{g }^v &= \\operatorname{Sigmoid}(\\operatorname{LN}(\\operatorname{MLP}([\\mathbf{H}_g^v\\Vert\\mathbf{F_m}]))),\\\\\\mathbf{H}_r^v\\operatorname{SelfAttn}(Q,K,V=\\mathbf{F}_m^v),\\\\\\mathbf{F}^v_r\\operatorname{ReLU}(\\operatorname{LN}(\\operatorname{MLP}([\\mathbf{H}_r^v\\Vert\\mathbf{F_m}]))),{aligned} \\Delt a \\tilde {\\mathbf{X}}^{v-1}_k=\\operatorname{Predictor}(\\mathbf{Q}_m),(9) where [·∥·] denotes concatenation, gv is the gating vector, where ∆Xv−1k ∈RNa×K×∆T is the retrospected multiand Fvr is the residual feature. Finally, the student feature is modal proposals.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 11, + "total_chunks": 45, + "char_count": 2469, + "word_count": 318, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "af8fdc65-0be6-4b45-91a9-fc6b42b413d4", + "text": "Anchor-Based State Queries for Motion Refinement. Φ3, Φ2}, with Φ1 distilling the feature of [1, 40] to stanRPM regards proposals ∆Xv−1k as anchors and utilizes state dard length for decoder training. Similarity, for Tv = 30,\nqueries, which learns evolving motion dynamics, to further new sample pair is used to train {Φ4, Φ3} and decoder; for\nrefine them. Specifically, state queries Qs ∈RNa×∆T ×C Tv =20, new sample pair is used to train Φ4 and decoder.\nare first initialized by MLPs to preserve motion dynamics. As described above, a sequence yields 4 samples for deThen, cross-attention is adopted to extract scene features coder training and {4, 3, 2, 1} samples for the retrospective\nfor Qs. After that, Mamba is conducted on Qs to model units {Φ4, Φ3, Φ2, Φ1}, respectively.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 12, + "total_chunks": 45, + "char_count": 781, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cdd328d9-343d-4fd6-b46e-0e4c97cb3f7b", + "text": "The number of samagents' temporal dynamics. ples generated for each unit is inversely proportional to the\nobservation length of its input. This aligns with intuition,\n\\ begin {a lig n e d } \\m ath shorter observation windows are harder to retrospect their\nbf {Q}_s &= \\o p era to r n ame {ML (10) history, and therefore benefit from more training data. P} ( [t_1, t _ 2, \\dotst_{\\DeltaT}]),\\mathbf{Q}_s\\operatorname{CrossAttn}(Q=\\mathbf{Q}_s,K,V=\\tilde{\\mathbf{F}}^{v-1}),\\\\\\mathbf{Q}_s\\operatorname{Mamba}(U=\\mathbf{Q}_s).{aligned} 3.6. Loss Functions\nWe train the decoder, RPM, and RDM end-to-end.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 13, + "total_chunks": 45, + "char_count": 599, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ec2c5d97-fee7-4ec4-beb3-30107cf5e3c5", + "text": "AccordNext, proposals ∆Xv−1k are encoded to anchor features ingly, the overall objective comprises three components. Fv−1k , on which cross-attention is performed to extract mul- For the decoder, we adopt the same settings as QCNet [53]\ntimodal cues for Qs, followed by Mamba to model tempo- and DeMo [47], which use a smooth-L1 loss and a crossral dependencies, similar to Eq. 10. Finally, state queries entropy loss to supervise the trajectory regression and probthat integrate multimodal property and motion dynamics ability score classification, respectively. For RPM, we adopt\nare used to yield refined multimodal predictions, similar the same losses as the decoder, applying them twice to suto Eq. 9. The final retrospected trajectories ∆˜Xv−1 corre- pervise mode queries and state queries, respectively:\nspond to the highest-probability mode. Given Mamba [54]'s\nt calstrong sequence modeling capability, we employ it to model \\m a pm } =\\frac{1}{\\tau}\\sum\\nolimits_{v=1}^{v=\\tau(\\mathcal{L}_{mq}^v\\mathcal{L}_{sq}^v), (11) h {L}_{rstate queries over time in place of traditional attention. Since RPM recovers a fixed ∆T timesteps independent where Lvmq and Lvsq are losses for mode queries and state\nof observation lengths, one RPM is shared across all ret- queries in the v-th RPM, respectively. For RDM, we use a\nrospective units. During training, with progressive distilla- smooth-L1 loss to supervise retrospective distilling:\ntion done upstream, distilled features are batch-processed \\beg i n {aligned} \\ma thcal\nby RPM to accelerate training. During inference, RPM is d (12)\ndisabled. Overall, RPM adds no inference cost while im- {L}_ { st}^ =\\operatorname{SmoothL1}(\\tilde{\\mathbf{F}}^{v-1},\\mathbf{F}^{v-1}),\\\\\\mathcal{L}_{rdm}{1}{\\tau}\\sum\\nolimits_{v=1}^{v=\\tau}\\mathcal{L}_{dist}^v,{aligned}\ni v &\nproving training via shared, batched supervision.\nwhere Lvdist is the distilliation loss for the v-th RDM. Rolling-Start Training Strategy total loss sums the losses for the decoder, RPM, and RDM. Existing methods use fixed To steps to predict Tf steps, so\na sequence of length To+Tf yields only one training sam- 4. Experiments\nple, underutilizing training data. When Tv < To, the pair\n([1, Tv], [Tv + 1, Tv + Tf]) forms a distinct training win- 4.1.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 14, + "total_chunks": 45, + "char_count": 2269, + "word_count": 326, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e7511247-8d06-48ff-9d3c-89de7d687a8e", + "text": "Experimental Settings\ndow, yet prior works either cannot accommodate such par- Dataset. We evaluate PRF on two motion forecasting\ntial inputs or exhibit degraded performance. In contrast, datasets, Argoverse 2 [43] and Argoverse 1 [2]. Argovese 2\nPRF natively learns from shorter trajectories, enabling ef- contains 250,000 driving scenarios collected from six cities.\nfective training on partial histories. We exploit this property Each scenario is an 11 s sequence sampled at 10 Hz, with\nwith a Rolling-Start Training Strategy (RSTS) to improve the first 5 s as history and the subsequent 6 s forming the\ndata efficiency. prediction horizon. Argoverse 1 dataset comprises 324,557\nUsing Argoverse 2 [43] as a concrete example.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 15, + "total_chunks": 45, + "char_count": 727, + "word_count": 112, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d27df6ff-d45d-4a8d-8f08-83367055d6b3", + "text": "In this scenarios collected in Miami and Pittsburgh. Each scenario\nsetup, To = 50, Tf = 60, and ∆T = 10. To train on is a 5 s sequence sampled at 10 Hz, with the first 2 s as\nArgoverse 2, PRF includes four retrospective units Φ4, Φ3, history and the remaining 3 s as the prediction horizon.\nΦ2, Φ1, which distill features of lengths 10, 20, 30, and 40 Evaluation Metrics. We use minimum Average Displaceinto features of length 20, 30, 40, and 50, respectively. ment Error (mADEK), minimum Final Displacement ErRSTS begins with a standard sample ([1,50], [51,110]) ror (mFDEK), Brier-minimum Final Displacement Error\npair, where observation windows {[41,50], [31,50], [21,50], (b-mFDEK), and Miss Rate (MRK). Here, K denotes the\n[11,50], [1,50]} are encoded to train retrospective units number of motion modes, and following common practice,\n{Φ4, Φ3, Φ2, Φ1}, with the feature of [1,50] to train de- we report results for K = 1 and K = 6. mADEk calcucoder. The start point is then shifted to Tv = 40, yield- lates the average error between ground-truth and predicing a new sample pair ([1,40],[41,100]), and windows {[31, tion, while mFDEK measures the error at the endpoint. b-\n40], [21, 40], [11, 40], [1, 40]} are encoded to train {Φ4, mFDEK extends mFDEK by incorporating the predicted Argoverse 2 (mADE6/mFDE6) Argoverse 1 (mADE6/mFDE6)\nMethod\n10 20 30 40 50 Avg-∆50 5 10 15 20 Avg-∆20 QCNet-Ori 0.900/1.526 0.777/1.338 0.752/1.296 0.725/1.252 0.726/1.253 0.063/0.100 0.807/1.172 0.769/1.139 0.751/1.104 0.709/1.040 0.067/0.098\nQCNet-IT 0.741/1.293 0.734/1.279 0.730/1.276 0.726/1.267 0.726/1.253 0.007/0.034 0.747/1.083 0.721/1.058 0.714/1.043 0.709/1.040 0.018/0.021\nQCNet-DTO 0.768/1.315 0.739/1.270 0.735/1.269 0.732/1.261 0.731/1.258 0.012/0.021 0.764/1.102 0.722/1.057 0.709/1.046 0.702/1.034 0.030/0.034\nQCNet-FLN 0.752/1.274 0.735/1.253 0.731/1.243 0.729/1.231 0.724/1.231 0.013/0.019 0.760/1.088 0.719/1.041 0.710/1.027 0.699/1.017 0.031/0.035\nQCNet-LaKD 0.739/1.259 0.725/1.235 0.725/1.232 0.721/1.227 0.718/1.219 0.010/0.019 0.737/1.057 0.708/1.044 0.699/1.034 0.696/1.027 0.019/0.018\nQCNet-CLLS 0.735/1.247 0.727/1.232 0.725/1.227 0.719/1.222 0.714/1.215 0.013/0.017 0.729/1.041 0.708/1.023 0.697/1.016 0.697/1.012 0.014/0.015 QCNet-PRF 0.727/1.213 0.711/1.181 0.706/1.169 0.702/1.164 0.702/1.166 0.010/0.016 0.699/1.015 0.686/0.997 0.677/0.989 0.675/0.986 0.012/0.014\nDeMo-Ori 0.861/1.533 0.700/1.358 0.671/1.306 0.662/1.288 0.658/1.278 0.066/0.093 0.781/1.267 0.662/1.087 0.624/1.011 0.606/1.003 0.083/0.119\nDeMo-IT 0.675/1.318 0.661/1.296 0.660/1.293 0.659/1.287 0.658/1.278 0.006/0.021 0.669/1.078 0.634/1.031 0.612/0.988 0.606/1.003 0.032/0.029\nDeMo-DTO 0.672/1.307 0.658/1.291 0.650/1.279 0.647/1.268 0.645/1.265 0.012/0.021 0.662/1.064 0.628/1.025 0.605/0.991 0.599/1.010 0.033/0.017\nDeMo-FLN 0.651/1.262 0.644/1.258 0.637/1.254 0.628/1.238 0.621/1.231 0.019/0.022 0.646/1.043 0.607/0.994 0.599/0.974 0.592/0.957 0.025/0.047\nDeMo-LaKD 0.639/1.262 0.627/1.251 0.620/1.243 0.617/1.236 0.617/1.232 0.009/0.016 0.631/1.008 0.593/0.976 0.584/0.933 0.581/0.929 0.022/0.043\nDeMo-CLLS 0.641/1.258 0.630/1.249 0.623/1.234 0.614/1.225 0.615/1.223 0.012/0.019 0.634/0.998 0.587/0.959 0.580/0.919 0.579/0.922 0.021/0.037 DeMo-PRF 0.617/1.183 0.603/1.155 0.598/1.143 0.599/1.145 0.596/1.142 0.008/0.015 0.602/0.952 0.567/0.901 0.565/0.904 0.568/0.909 0.010/0.010 Variable-length trajectory prediction comparison on Argoverse 2 (left) and Argoverse 1 (right) validation sets. For Argoverse 2,\nAVG–∆50 is the average difference between {10, 20, 30, 40} and 50. For Argoverse 1, AVG–∆20 is the average difference between {5,\n10, 15} and 20.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 16, + "total_chunks": 45, + "char_count": 3647, + "word_count": 448, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca68056d-c139-4c21-895a-95c8e7d3f402", + "text": "Best results are in bold and second best are underlined. (a) DeMo-IT (b) DeMo-CLLS (c) DeMo-PRF (Our) (d) GT Qualitative results on the Argoverse 2 validation set. Incomplete observations, predicted trajectories, and ground truth trajectories\nare shown in yellow, green, and pink, respectively. Our predictions align more closely with the ground truth than other methods. mode probabilities, penalizing endpoint errors more heav- ing {5, 10, 15, 20}. In practice, if an observation length falls\nily when the assigned probability is low. MRK computes outside these sets, we truncate the observed trajectory to\nthe proportion of minFDEK that exceeds 2 meters. the nearest shorter admissible length (e.g., 32→30), retaining the most recent timesteps. Models are trained end-to-Backbone & Baselines. PRF is plug-and-play with existend for 60 epochs with a batch size of 16, using the Adaming prediction models. To demonstrate its compatibility,\noptimizer with an initial learning rate of 0.003 and a weightwe integrate PRF with two state-of-the-arts, QCNet [53]\ndecay of 0.01. All experiments are implemented in PyTorchand DeMo [47]. To verify its effectiveness, we compare\nand run on 8 Nvidia RTX 4090 GPUs.it with four closely related works, DTO [25], FLN [44],\nLaKD [20], and CLLS [31]. We also include two base- 4.2.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 17, + "total_chunks": 45, + "char_count": 1316, + "word_count": 204, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a0cf5a52-79bd-499b-8bc9-d355c29bc3b2", + "text": "Comparison with State-of-the-art\nlines, Ori trained on standard observations and evaluated Variable-Length Trajectory Prediction. The results of\non variable-length inputs, and IT (Isolated Training), which variable-length prediction on Argoverse 2 and Argoverse 1\ntrains a separate model for each observation length and eval- validation sets are reported in Tab. 1. Results show that\nuates on the matching length. PRF significantly outperforms Ori across all observation\nImplementation Details. We define different observation lengths, indicating the necessity of designing a framework\nlengths (timesteps) for each dataset. For Argoverse 2, with for variable-length prediction. Secondly, IT shows modest\nan observation horizon To = 50 and a prediction horizon improvement over Ori across variable-length observations,\nTf = 60, we set the omission interval to ∆T = 10, yielding which verifies that length-specific training is expensive and\nvariable observation lengths {10, 20, 30, 40, 50}. For Argo- brings only marginal gains.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 18, + "total_chunks": 45, + "char_count": 1027, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e6014ae2-79b0-4058-9100-1b3f6d31dfee", + "text": "Moreover, PRF outperforms\nverse 1, with To = 20 and Tf = 30, we set ∆T = 5, yield- baselines DTO, FLN, LaKD, and CLLS across observation Method b-mFDE6 mADE6 mFDE6 MR6 mADE1 mFDE1 Method b-mFDE6 mADE6 mFDE6 MR6\nFRM [28] 2.47 0.89 1.81 0.29 2.37 5.93 LaneGCN [21] 2.06 0.87 1.36 0.16\nHDGT [15] 2.24 0.84 1.60 0.21 2.08 5.37 mmTrans. [23] 2.03 0.84 1.34 0.15\nTHOMAS [11] 2.16 0.88 1.51 0.20 1.95 4.71 DenseTNT [12] 1.98 0.88 1.28 0.13\nSIMPL [48] 2.05 0.72 1.43 0.19 2.03 5.50 TPCN [45] 1.93 0.82 1.24 0.13\nHPTR [49] 2.03 0.73 1.43 0.19 1.84 4.61 SceneTrans. [27] 1.89 0.80 1.23 0.13\nGoRela [6] 2.01 0.76 1.48 0.22 1.82 4.62 HOME+GOME [9, 10] 1.86 0.89 1.29 0.08\nMTR [34] 1.98 0.73 1.44 0.15 1.74 4.39 HiVT [52] 1.84 0.77 1.17 0.13\nGANet [41] 1.96 0.72 1.34 0.17 1.77 4.48 MultiPath++ [39] 1.79 0.79 1.21 0.13\nDeMo [47] 1.92 0.65 1.25 0.15 1.58 3.96 GANet [41] 1.79 0.81 1.16 0.12\nQCNet [53] 1.91 0.65 1.29 0.16 1.69 4.30 PAGA [7] 1.76 0.80 1.21 0.11\nReMo [36] 1.89 0.66 1.24 0.15 1.59 3.93 MISC [46] 1.76 0.77 1.14 0.11\nTamba [13] 1.89 0.64 1.24 0.17 1.66 4.24 Wayformer [26] 1.74 0.77 1.16 0.12\nProphNet [42] 1.88 0.68 1.33 0.18 1.80 4.74 HPNet [38] 1.74 0.76 1.10 0.11\nSmartRefine [51] 1.86 0.63 1.23 0.15 1.65 4.17 QCNet [53] 1.69 0.73 1.07 0.11\nDeMo+ReMo [36, 47] 1.84 0.61 1.17 0.13 1.49 3.74 Tamba [13] 1.67 0.72 1.07 0.09 DeMo-PRF (Our) 1.81 0.60 1.14 0.13 1.49 3.72 DeMo-PRF (Our) 1.73 0.70 1.03 0.11 Comparison with state-of-the-arts on the Argoverse 2 Single Table 3. Comparison with state-of-the-arts on the ArAgent Motion Forecasting Leaderboard ranked by b-mFDE6. All results goverse 1 Motion Forecasting Leaderboard. All results\nare from a single model, without model ensembling. are from a single model, without model ensembling. mADE6/mFDE6 backbone. The second row adds RDM to distill features. RDM RPM RSTS\n10 20 30 40 50\nThis yields substantial gains across all observation lengths,\n0.876/1.455 0.769/1.337 0.756/1.286 0.726/1.252 0.725/1.256\n✓ 0.655/1.257 0.640/1.237 0.636/1.231 0.636/1.227 0.639/1.231 demonstrating its effectiveness. The third row further in-\n✓ ✓ 0.652/1.241 0.637/1.214 0.634/1.207 0.631/1.204 0.635/1.208 corporates RPM to recover omitted historical trajectories. ✓ ✓ ✓ 0.617/1.183 0.603/1.155 0.598/1.143 0.599/1.145 0.596/1.142\nThis provides implicit supervision for distillation and deTable 4. Ablation study of the core modules of our model on the livers additional improvements at all observation lengths. Argoverse 2 validation set.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 19, + "total_chunks": 45, + "char_count": 2478, + "word_count": 410, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "eb120c06-d40e-4fad-b877-f615b8453431", + "text": "The last row applies the RSTS. The results show consistent\ngains across various observation lengths, indicating that the\nlengths and achieves a small performance gap between inproposed training regime enhances data utilization.\ncomplete and standard observations, demonstrating stateEffects of attention layers in RDM. Self- and cross-of-the-art performance. Finally, PRF achieves the best reattention are key to retrospective distillation in RDM. Wesults with both QCNet and DeMo backbones, validating its\nablate the number of self- and cross-attention layers incompatibility. RDM, as shown in Tab. 5.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 20, + "total_chunks": 45, + "char_count": 602, + "word_count": 84, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "67eeea33-5dfb-4253-a7e5-aba5f86ae4f6", + "text": "The results show that increasing To qualitatively assess the superiority of PRF, we visuthe number of layers steadily improves performance acrossalize results for IT, the second-best CLLS, and PRF at the\ndifferent observation lengths. We therefore set three layersshortest observation length of 10 on the Argoverse 2 valof self- and cross-attention as the default in RDM.idation set, as shown in Fig. 4. The two samples respecEffects of attention and Mamba layers in RPM. Self-,tively present complex intersection and T-junction scenarios\ncross-attention, and Mamba are used in RPM to extract em-where the agent is about to turn. The visualization shows\nbedding for retrospective prediction. We ablate the numberthat PRF is accurate and closer to the ground truth comof layers for these components in RPM, as shown in Tab. 6.pared to other methods on backbone DeMo. The results show that a single layer yields the best mADE6,Standard Trajectory Prediction. PRF can also be exwhile three layers yield the best mFDE6. Since the gains intended to the standard trajectory prediction with a commFDE6 are larger than those in mADE6, we set three layersplete observation setting. As shown in Tab. 1, we further\nof self-, cross-attention, and Mamba as the default in RPM.compare PRF that uses DeMo as backbone with state-ofEffect of sequence modeling in RPM. Mamba is employedthe-art methods on the Argoverse 2 and Argoverse 1 moto model state queries over time in RPM. To assess its effec-tion forecasting benchmarks in the single agent setting with\ntiveness, we compare it with other modules, including GRUstandard-length inputs, as presented in Tab. 2 and Tab. 3.\nand Attention, as shown in Tab. 7. Mamba achieves the bestTab. 2 shows that PRF achieves the best performance across\nresults among these variants, confirming its superior abilityall metrics on the Argoverse 2 test set, while Tab. 3 shows\nto capture temporal dependencies in state queries.that PRF achieves the best performance among metrics\nmADE6 and mFDE6 on the Argoverse 1 test set. These Effects of data utilization in RSTS. RSTS improves data\nresults validate the generalization of PRF. utilization.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 21, + "total_chunks": 45, + "char_count": 2163, + "word_count": 340, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4dd41071-4aab-4a09-b20a-913ac4a04967", + "text": "For example, a standard Argoverse 2 sequence\nwith an observation length of 50 can generate additional\n4.3. Ablation Studies samples with observation windows {[0, 40], [0, 30], [0,\nEffects of modules. Tab. 4 reports ablations of the core 20]}.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 22, + "total_chunks": 45, + "char_count": 242, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a925c1f2-8231-4ac4-b531-3b2703980ec7", + "text": "We ablate these extra training samples, as shown in\nmodules of PRF. The first shows the results of the DeMo Tab. 8. The first row reports training with only standard ob- mADE6/mFDE6 mADE6/mFDE6\nNum Num\n10 20 30 40 50 10 20 30 40 50 1 0.661/1.280 0.646/1.256 0.642/1.249 0.641/1.248 0.642/1.243 1 0.647/1.252 0.632/1.227 0.628/1.214 0.627/1.211 0.631/1.226\n2 0.660/1.275 0.645/1.249 0.641/1.244 0.640/1.239 0.645/1.244 2 0.652/1.257 0.637/1.230 0.633/1.220 0.632/1.223 0.634/1.224\n3 0.655/1.257 0.640/1.237 0.636/1.231 0.636/1.227 0.639/1.231 3 0.652/1.241 0.637/1.214 0.634/1.207 0.631/1.204 0.635/1.208", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 23, + "total_chunks": 45, + "char_count": 603, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "001164ed-9983-4993-86a6-17779a061526", + "text": "Ablation study of the number of self- and cross-attention Table 6. Ablation of the number of self-, cross-attention, and\nlayers in RDM on the Argoverse 2 validation set. Mamba layers in RPM on the Argoverse 2 validation set. mADE6/mFDE6\nNum\n10 20 30 40 50 GRU 0.662/1.286 0.644/1.256 0.640/1.245 0.639/1.242 0.640/1.245\nAttn 0.653/1.261 0.639/1.235 0.635/1.229 0.634/1.225 0.634/1.224\nMamba 0.652/1.241 0.637/1.214 0.634/1.207 0.631/1.204 0.635/1.208 Ablation of the sequence modeling choices in RPM on\nthe Argoverse 2 validation set. mADE6/mFDE6\n[0,40] [0,30] [0,20] (a) Direct distillation (b) Progressive distillation 10 20 30 40 50\n0.652/1.241 0.637/1.214 0.634/1.207 0.631/1.204 0.635/1.208 Figure 5. t-SNE visualization of (a) direct and (b) progressive dis-\n✓ 0.631/1.211 0.618/1.189 0.613/1.178 0.615/1.185 0.613/1.182 tillation strategies. Features distilled from 10 to 50 are shown in\n✓ ✓ 0.624/1.201 0.608/1.174 0.606/1.167 0.606/1.167 0.606/1.165 yellow, while features of the standard length 50 are shown in blue. ✓ ✓ ✓ 0.617/1.183 0.603/1.155 0.598/1.143 0.599/1.145 0.596/1.142 Ablation study of data utilization in the RSTS on the Length 10 20 30 40 50\nArgoverse 2 validation set. Inference time (s) 0.268 0.236 0.203 0.172 0.140\nFLOPs (G) 1.651 1.581 1.513 1.443 1.375\nmADE6/mFDE6\nStrategy\n10 20 30 40 50 Table 10. Analysis of inference efficiency on the Argoverse 2 valiDirect 0.663/1.275 0.644/1.240 0.639/1.228 0.635/1.220 0.635/1.222 dation set. Results are measured with one multi-agent scenario per\nPRF (Our) 0.652/1.241 0.637/1.214 0.634/1.207 0.631/1.204 0.635/1.208\nforward pass, using an NVIDIA GeForce RTX 4090 GPU. Analysis of progressive distillation vs. direct distillation\non the Argoverse 2 validation set. RTST is not used in training. shown in Tab. 10.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 24, + "total_chunks": 45, + "char_count": 1787, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2e0f4326-89e8-49b4-8905-b9b44535c01c", + "text": "The cost increases almost linearly with\nthe number of retrospective stages as the observation length\nservations. The second through fourth rows gradually add decreases. Relative to the standard length of 50, each adobservation windows [0, 40], [0, 30], and [0, 20] for train- ditional retrospective stage adds about 0.07 G FLOPs and\ning. The results show that incorporating incomplete obser- 0.03 s of latency. This indicates that PRF improves predicvations yields steady gains in trajectory prediction, indicat- tion for incomplete lengths while incurring only a modest\ning that the RSTS improves data utilization. inference cost. PRF remains efficient because RDM and\n4.4. Analysis of Distillation Strategy RST are used only during training to provide extra supervision and data, thereby incurring no test-time computation.PRF adopts a progressive strategy to distill features from incomplete to complete observations. Existing methods typically use a direct one-shot distillation strategy. Conclusion\nour progressive distillation with this strategy by modifying\nThis paper presents a Progressive Retrospective Frame-PRF to directly distill features from lengths {10, 20, 30, 40}\nwork (PRF) for variable-length trajectory prediction. As shown in Tab. 9, our strategy outperforms\nconsists of a cascade of retrospective units that progres-direct distillation across all observation lengths, with larger\nsively map incomplete-length observations to a standardgains for shorter observations. These findings indicate that\nlength.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 25, + "total_chunks": 45, + "char_count": 1526, + "word_count": 215, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "022e7308-f16e-4ec7-a5be-08382568c23c", + "text": "Each unit includes a Retrospective Distillation Mod-progressive distillation reduces task difficulty and improves\nule (RDM) and a Retrospective Prediction Module (RFM).variable-length trajectory prediction. RDM distills features from shorter observations to align We also visualize the two strategies via t-SNE by emthem with those from longer ones, and RPM recovers thebedding the distilled (10→50) features together with the naomitted historical trajectories from the distilled features.tive 50-step features. Fig. 5a reveals that aligning 10-step\nTo better exploit training data with incomplete observa-features to 50-step ones is nontrivial, and direct distillation\ntions, we further propose a Rolling-Start Training Strat-struggles to bridge this gap. By contrast, Fig. 5b shows that\negy (RSTS) to improve data utilization. Extensive ex-our progressive strategy achieves substantially better alignperiments on Argoverse 2 and Argoverse 1 demonstratement, providing additional evidence of its effectiveness.\nthat PRF achieves state-of-the-art performance for variable-\n4.5. Inference Efficiency Analysis length trajectory prediction. PRF also achieves leading rePRF introduces extra inference overhead by iteratively ret- sults for standard trajectory prediction on the Argoverse 2\nrospecting features. We evaluate its inference cost, as and Argoverse 1 motion forecasting leaderboards. Acknowledgement [10] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 26, + "total_chunks": 45, + "char_count": 1504, + "word_count": 195, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1db9a7e7-97f2-452f-9881-b3df594107c3", + "text": "Gohome: GraphThis work was supported in part by the National oriented heatmap output for future motion estimation. In\nNatural Science Foundation of China under Grant 2022 international conference on robotics and automation\n62503084 and Grant 62202308, in part by the Guang- (ICRA), pages 9107–9114. IEEE, 2022. 2, 7\ndong Basic and Applied Basic Research Founda-\n[11] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogtion under Grant 2024A1515110124, and in part\ndan Stanciulescu, and Fabien Moutarde. Thomas: Trajectory\nby the Science and Technology Commission of\nheatmap output with learned multi-agent sampling. In InterShanghai Municipality under Grant 24ZR1400400.\nnational Conference on Learning Representations, 2022. 7\n[12] Junru Gu, Chen Sun, and Hang Zhao.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 27, + "total_chunks": 45, + "char_count": 770, + "word_count": 110, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "673539ac-1306-4252-a774-d37696d2e86b", + "text": "Densetnt: End-to-end\ntrajectory prediction from dense goal sets. In Proceedings of\nReferences the IEEE/CVF international conference on computer vision,\n[1] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir pages 15303–15312, 2021. 2, 7\nAnguelov. Multipath: Multiple probabilistic anchor trajec- [13] Yizhou Huang, Yihua Cheng, and Kezhi Wang. Trajectory\ntory hypotheses for behavior prediction. In Conference on mamba: Efficient attention-mamba forecasting model based\nRobot Learning, pages 86–99. PMLR, 2020. 2 on selective ssm. In Proceedings of the Computer Vision and\n[2] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jag- Pattern Recognition Conference, pages 12058–12067, 2025.\njeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter 1, 2, 7\nCarr, Simon Lucey, Deva Ramanan, and James Hays. Argov- [14] Zhiyu Huang, Haochen Liu, and Chen Lv.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 28, + "total_chunks": 45, + "char_count": 860, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "32673a0b-655f-4c5e-bef6-3e704fa7fa87", + "text": "Gameformer:\nerse: 3d tracking and forecasting with rich maps. In Confer- Game-theoretic modeling and learning of transformer-based\nence on Computer Vision and Pattern Recognition (CVPR), interactive prediction and planning for autonomous driving.\n2019. 5 In Proceedings of the IEEE/CVF International Conference\n[3] Hao Chen, Jiaze Wang, Kun Shao, Furui Liu, Jianye Hao, on Computer Vision, pages 3903–3913, 2023. 2\nChenyong Guan, Guangyong Chen, and Pheng-Ann Heng. [15] Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li,\nTraj-mae: Masked autoencoders for trajectory prediction. Hdgt: Heterogeneous driving graph transProceedings of the IEEE/CVF International Conference on former for multi-agent trajectory prediction via scene encodComputer Vision, pages 8351–8362, 2023. 2 ing. IEEE transactions on pattern analysis and machine in-\n[4] Jie Cheng, Xiaodong Mei, and Ming Liu. Forecast-mae: telligence, 45(11):13860–13875, 2023. 2, 7\nSelf-supervised pre-training for motion forecasting with [16] Chiyu Jiang, Andre Cornman, Cheolho Park, Benjamin\nmasked autoencoders. In Proceedings of the IEEE/CVF In- Sapp, Yin Zhou, Dragomir Anguelov, et al. Motiondiffuser:\nternational Conference on Computer Vision, pages 8679– Controllable multi-agent motion prediction using diffusion.\n8689, 2023. 2 In Proceedings of the IEEE/CVF conference on computer vi-\n[5] Sehwan Choi, Jungho Kim, Junyong Yun, and Jun Won sion and pattern recognition, pages 9644–9653, 2023. 2\nChoi. R-pred: Two-stage motion prediction via tube-query [17] Zhiqian Lan, Yuxuan Jiang, Yao Mu, Chen Chen, and\nattention-based trajectory refinement. In Proceedings of the Shengbo Eben Li.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 29, + "total_chunks": 45, + "char_count": 1652, + "word_count": 230, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c8190ee-1132-4650-9433-b2d63aa3f0f4", + "text": "SEPT: Towards efficient scene represenIEEE/CVF International Conference on Computer Vision, tation learning for motion prediction. In The Twelfth Interpages 8525–8535, 2023. 2 national Conference on Learning Representations, 2024. 2\n[6] Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, and [18] Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng,\nRaquel Urtasun. Gorela: Go relative for viewpoint-invariant Mengye Ren, Sean Segal, and Raquel Urtasun. End-tomotion forecasting. In 2023 IEEE International Confer- end contextual perception and prediction with interaction\nence on Robotics and Automation (ICRA), pages 7801–7807. transformer. In 2020 IEEE/RSJ International Conference\nIEEE, 2023. 7 on Intelligent Robots and Systems (IROS), pages 5784–5791.\n[7] Fang Da and Yu Zhang. Path-aware graph attention for hd IEEE, 2020. 2\nmaps in motion prediction. In 2022 International confer- [19] Rongqing Li, Changsheng Li, Dongchun Ren, Guangyi\nence on robotics and automation (ICRA), pages 6430–6436.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 30, + "total_chunks": 45, + "char_count": 1002, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8380e6ef-002d-47c2-829b-168c7e055cba", + "text": "Chen, Ye Yuan, and Guoren Wang. Bcdiff: Bidirectional\nIEEE, 2022. 7 consistent diffusion for instantaneous trajectory prediction.\n[8] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Advances in Neural Information Processing Systems, 36:\nAnguelov, Congcong Li, and Cordelia Schmid. Vectornet: 14400–14413, 2023. 2\nEncoding hd maps and agent dynamics from vectorized rep- [20] Yuhang Li, Changsheng Li, Ruilin Lv, Rongqing Li, Ye\nresentation. In Proceedings of the IEEE/CVF conference Yuan, and Guoren Wang.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 31, + "total_chunks": 45, + "char_count": 508, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "89dbc1dd-b4a6-4f73-aaf6-4ea6c9c762aa", + "text": "Lakd: Length-agnostic knowledge\non computer vision and pattern recognition, pages 11525– distillation for trajectory prediction with any length observa-\n11533, 2020. 1, 2 tions. Advances in Neural Information Processing Systems,\n[9] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan 37:28720–28744, 2024. 2, 6\nStanciulescu, and Fabien Moutarde. Home: Heatmap output [21] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song\nfor future motion estimation. In 2021 IEEE international Feng, and Raquel Urtasun. Learning lane graph representaintelligent transportation systems conference (ITSC), pages tions for motion forecasting.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 32, + "total_chunks": 45, + "char_count": 639, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "71994844-c926-4d2b-a351-75e342474c71", + "text": "In European Conference on\n500–507. IEEE, 2021. 2, 7 Computer Vision, pages 541–556. [22] Haochen Liu, Li Chen, Yu Qiao, Chen Lv, and Hongyang [33] Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou,\nLi.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 33, + "total_chunks": 45, + "char_count": 206, + "word_count": 36, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc8957c2-157b-42e4-be50-d136f46504f6", + "text": "Reasoning multi-agent behavioral topology for interac- Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and\ntive autonomous driving. Advances in Neural Information Benjamin Sapp. Motionlm: Multi-agent motion forecastProcessing Systems, 37:92605–92637, 2024. 2 ing as language modeling. In Proceedings of the IEEE/CVF\n[23] Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, International Conference on Computer Vision, pages 8579–\nand Bolei Zhou. Multimodal motion prediction with stacked 8590, 2023. 2\ntransformers. In Proceedings of the IEEE/CVF conference [34] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele.\non computer vision and pattern recognition, pages 7577– Motion transformer with global intention localization and lo-\n7586, 2021. 2, 7 cal movement refinement. Advances in Neural Information\n[24] Jean Mercat, Thomas Gilles, Nicole El Zoghby, Guil- Processing Systems, 35:6531–6543, 2022. 1, 2, 7\nlaume Sandou, Dominique Beauvois, and Guillermo Pita [35] Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Multi-head attention for multi-modal joint vehicle mo- Mtr++: Multi-agent motion prediction with symmetric scene\ntion forecasting. In 2020 IEEE International Conference on modeling and guided intention querying. IEEE Transactions\nRobotics and Automation (ICRA), pages 9638–9644. IEEE, on Pattern Analysis and Machine Intelligence, 46(5):3955–\n2020. 2 3971, 2024. 2\n[25] Alessio Monti, Angelo Porrello, Simone Calderara, Pasquale [36] Nan Song, Bozhou Zhang, Xiatian Zhu, and Li Zhang. MoCoscia, Lamberto Ballan, and Rita Cucchiara.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 34, + "total_chunks": 45, + "char_count": 1570, + "word_count": 216, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8dd59f39-aa29-4785-bc6e-f365e012e90a", + "text": "How many tion forecasting in continuous driving. Advances in Neural\nobservations are enough? knowledge distillation for trajec- Information Processing Systems, 37:78147–78168, 2024. 7\ntory forecasting. In Proceedings of the IEEE/CVF Confer- [37] Jianhua Sun, Yuxuan Li, Liang Chai, Hao-Shu Fang, Yongence on Computer Vision and Pattern Recognition, pages Lu Li, and Cewu Lu. Human trajectory prediction with mo-\n6553–6562, 2022. 2, 6 mentary observation. In Proceedings of the IEEE/CVF con-\n[26] Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth ference on computer vision and pattern recognition, pages\nGoel, Khaled S Refaat, and Benjamin Sapp. Wayformer: 6467–6476, 2022. 2\nMotion forecasting via simple & efficient attention networks. [38] Xiaolong Tang, Meina Kan, Shiguang Shan, Zhilong Ji, JinIn 2023 IEEE International Conference on Robotics and Au- feng Bai, and Xilin Chen. Hpnet: Dynamic trajectory foretomation (ICRA), pages 2980–2987. IEEE, 2023. 1, 2, 7 casting with historical prediction attention. In Proceedings\n[27] Jiquan Ngiam, Vijay Vasudevan, Benjamin Caine, Zheng- of the IEEE/CVF conference on computer vision and pattern\ndong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca recognition, pages 15261–15270, 2024. 2, 7\nRoelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. [39] Balakrishnan Varadarajan, Ahmed Hefny, Avikalp SrivasScene transformer: A unified architecture for predicting fu- tava, Khaled S Refaat, Nigamaa Nayakanti, Andre Cornman,\nture trajectories of multiple agents. In International Confer- Kan Chen, Bertrand Douillard, Chi Pang Lam, Dragomir\nence on Learning Representations, 2022. 2, 7 Anguelov, et al. Multipath++: Efficient information fu-\n[28] Daehee Park, Hobin Ryu, Yunseo Yang, Jegyeong Cho, sion and trajectory aggregation for behavior prediction. In\nJiwon Kim, and Kuk-Jin Yoon. Leveraging future rela- 2022 International Conference on Robotics and Automation\ntionship reasoning for vehicle trajectory prediction.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 35, + "total_chunks": 45, + "char_count": 1981, + "word_count": 277, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f659cd28-3f43-4dfb-920c-3e62377d9024", + "text": "In In- (ICRA), pages 7814–7821. IEEE, 2022. 1, 7\nternational Conference on Learning Representations (ICLR [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-\n2023). Eleventh International Conference on Learning Rep- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia\nresentations, 2023. 7 Polosukhin. Attention is all you need.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 36, + "total_chunks": 45, + "char_count": 344, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e93dba5-e6b5-453c-89cb-7153aa8577b3", + "text": "Advances in neural\n[29] Tung Phan-Minh, Elena Corina Grigore, Freddy A Boulton, information processing systems, 30, 2017. 2\nOscar Beijbom, and Eric M Wolff. Covernet: Multimodal [41] Mingkun Wang, Xinge Zhu, Changqian Yu, Wei Li, Yuexin\nbehavior prediction using trajectory sets. In Proceedings of Ma, Ruochun Jin, Xiaoguang Ren, Dongchun Ren, Mingxu\nthe IEEE/CVF conference on computer vision and pattern Wang, and Wenjing Yang. Ganet: Goal area network for morecognition, pages 14074–14083, 2020. 2 tion forecasting. In 2023 IEEE International Conference on\n[30] Jonah Philion, Xue Bin Peng, and Sanja Fidler. Trajeglish: Robotics and Automation (ICRA), pages 1609–1615. IEEE,\nTraffic modeling as next-token prediction. In The Twelfth In- 2023. 7\nternational Conference on Learning Representations, 2024. [42] Xishun Wang, Tong Su, Fang Da, and Xiaodong Yang.\n2 Prophnet: Efficient agent-centric motion forecasting with\n[31] Ruiqi Qiu, Jun Gong, Xinyu Zhang, Siqi Luo, Bowen Zhang, anchor-informed proposals. In Proceedings of the IEEE/CVF\nand Yi Cen. Adapting to observation length of trajectory conference on computer vision and pattern recognition,\nprediction via contrastive learning. In Proceedings of the pages 21995–22003, 2023. 7\nComputer Vision and Pattern Recognition Conference, pages [43] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam-\n1645–1654, 2025. 2, 6 bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat-\n[32] Luke Rowe, Martin Ethier, Eli-Henry Dykhne, and nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes,\nKrzysztof Czarnecki. Fjmp: Factorized joint multi-agent Deva Ramanan, Peter Carr, and James Hays. Argoverse 2:\nmotion prediction over learned directed acyclic interaction Next generation datasets for self-driving perception and foregraphs. In Proceedings of the IEEE/CVF Conference on casting. In Proceedings of the Neural Information ProcessComputer Vision and Pattern Recognition, pages 13745– ing Systems Track on Datasets and Benchmarks (NeurIPS\n13755, 2023. 2 Datasets and Benchmarks 2021), 2021. 5", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 37, + "total_chunks": 45, + "char_count": 2048, + "word_count": 290, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ea993e95-6064-41ae-a469-ac80856ad8c9", + "text": "Adapting to length shift: Flexilength\nnetwork for trajectory prediction. In Proceedings of the\nIEEE/CVF Conference on Computer Vision and Pattern\nRecognition, pages 15226–15237, 2024. 2, 6\n[45] Maosheng Ye, Tongyi Cao, and Qifeng Chen. Tpcn: Temporal point cloud networks for motion forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and\nPattern Recognition, pages 11318–11327, 2021. 7\n[46] Maosheng Ye, Jiamiao Xu, Xunnong Xu, Tengfei Wang,\nTongyi Cao, and Qifeng Chen. Bootstrap motion forecasting with self-consistent constraints. In Proceedings of the\nIEEE/CVF International Conference on Computer Vision,\npages 8504–8514, 2023. 7\n[47] Bozhou Zhang, Nan Song, and Li Zhang. Decoupling motion\nforecasting into directional intentions and dynamic states. Advances in Neural Information Processing Systems, 37:\n106582–106606, 2024. 1, 2, 5, 6, 7\n[48] Lu Zhang, Peiliang Li, Sikang Liu, and Shaojie Shen. Simpl:\nA simple and efficient multi-agent motion prediction baseline for autonomous driving. IEEE Robotics and Automation\nLetters, 9(4):3767–3774, 2024. 7\n[49] Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher\nYu, and Luc V Gool. Real-time motion prediction via heterogeneous polyline transformer with relative pose encoding. Advances in Neural Information Processing Systems,\n36:57481–57499, 2023. 7\n[50] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp,\nBalakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai,\nCordelia Schmid, et al. Tnt: Target-driven trajectory prediction. In Conference on robot learning, pages 895–904. PMLR, 2021. 2\n[51] Yang Zhou, Hao Shao, Letian Wang, Steven L Waslander,\nHongsheng Li, and Yu Liu.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 39, + "total_chunks": 45, + "char_count": 1666, + "word_count": 238, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a8d994f4-146c-42e5-9984-b4d991b982f7", + "text": "Smartrefine: A scenario-adaptive\nrefinement framework for efficient motion prediction. In\nProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15281–15290, 2024. 1,\n2, 7\n[52] Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, and Kejie Lu. Hivt: Hierarchical vector transformer for multi-agent\nmotion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages\n8823–8833, 2022. 2, 7\n[53] Zikang Zhou, Jianping Wang, Yung-Hui Li, and Yu-Kai\nHuang. Query-centric trajectory prediction. In Proceedings\nof the IEEE/CVF conference on computer vision and pattern\nrecognition, pages 17863–17873, 2023. 1, 2, 5, 6, 7\n[54] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang,\nWenyu Liu, and Xinggang Wang. Vision mamba: efficient\nvisual representation learning with bidirectional state space\nmodel. In Proceedings of the 41st International Conference\non Machine Learning, pages 62429–62442, 2024. 5 Recover to Predict: Progressive Retrospective Learning for Variable-Length\nTrajectory Prediction", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 40, + "total_chunks": 45, + "char_count": 1063, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "56adc14d-1c16-4c14-a88f-4832920107c6", + "text": "Supplementary Material In this supplementary file, we provide additional de- &' = 50 Trains decoder !\n[1, 50] Encoder\ntails and results to demonstrate the benefits of the proposed Trains ∅% ! [11, 50] Encoder\nframework further. The contents include the following ap- [21, 50] Encoder Trains ∅$ !\nTrains ∅# !pendices: [31, 50] Encoder\n[41, 50] Encoder Trains ∅\" !\n• Appendix for RSTS (Section 7)\n• Appendix for Loss Functions (Section 8) Observation Future .\n• Appendix for Evaluation Metrics (Section 9) 0 10 20 30 40 50 60 70 80 90 100 110\n• Appendix for Qualitative Evaluations. (Section 10) &' = 40 Trains decoder !\n• Appendix for Interpretability Analysis. (Section 11) Inference ∅% ❄\n[1, 40] Encoder\nTrains ∅$ !\n[11, 40] Encoder\n7. Appendix for RSTS [21, 40] Encoder Trains ∅# !\nTrains ∅\" ! [31, 40] Encoder\nThe proposed Rolling-Start Training Strategy (RSTS), de- Observation Future\nscribed in Section 3.5, improves data efficiency by incorpo- .\nrating incomplete observations during training. Fig. 6 illus- 0 10 20 30 40 50 60 70 80 90 100 110\n&' = 30 Trains decoder !trates the applications of RSTS on the Argoverse 2 dataset,\nInference ∅% ❄\nwith a standard observation horizon of To = 50 and a pre- Inference ∅$ ❄\ndiction horizon of Tf = 60. [1, 30] Encoder Trains ∅# ! [11, 30] Encoder\nWhen Tv = 50, which corresponds to the standard ob- [21, 30] Encoder Trains ∅\" !\nservation horizon, a standard sample pair ([1,50], [51,110]) Observation Future\ncan be segmented into observation windows {[41,50], .\n[31,50], [21,50], [11,50], [1,50]}. These observation win- 0 10 20 30 40 50 60 70 80 90 100 110\n&' = 20 Trains decoder !dows are then encoded to train retrospective units {Φ4, Φ3,\nInference ∅% ❄\nΦ2, Φ1}, with the encoded feature of the standard-length Inference ∅$ ❄\nobservation window [1,50] being used to train the decoder. Inference ∅$ ❄ [1, 20] Encoder\nThen, the start point is shifted to Tv = 40, generating [11, 20] Encoder Trains ∅\" !\na sample pair ([1,40],[41,100]). This sample pair is seg- Observation Future\nmented into observation windows {[31, 40], [21, 40], [11, .\n40], [1, 40]}. These observation windows are encoded to 0 10 20 30 40 50 60 70 80 90 100 110\ntrain respective units {Φ4, Φ3, Φ2}. The encoded feature\nof the incomplete observation window [1, 40] is distilled by Figure 6. Illustration of the RSTS on the Argoverse 2 dataset with\nunit Φ1 to match the standard observation length, which is a standard observation horizon of To = 50 and a prediction horizon of Tf = 60. As the prediction start point shifts from 50 to 40,\nthen used to train the decoder.\n30, and 20, additional training samples are generated to train the\nSubsequently, the start point is shifted to Tv = 30, pro- retrospective units and the decoder.\nducing a sample pair ([1, 30],[31,90]). This sample pair is\nsegmented into observation windows {[21, 30], [11, 30], [1, train the decoder.\n30]}. These observation windows are encoded to train re- In summary, RSTS generates {4,3,2,1} samples to train\nspective units {Φ4, Φ3}. The encoded feature of the incom- the retrospective units {Φ4, Φ3, Φ2, Φ1}, respectively, and\nplete observation window [1,30] is sequentially distilled by 4 samples to train the decoder, using a standard training seunits Φ2 and Φ1 to match the standard observation length, quence.\nwhich is used to train the encoder.\n8.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 41, + "total_chunks": 45, + "char_count": 3346, + "word_count": 583, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d04ac485-8507-44e9-8ca4-74df152f30a9", + "text": "Appendix for Loss Functions Finally, the start point is shifted to Tv = 20, yielding a\nsample pair ([1,20],[21,80]). This sample pair is segmented A smooth-L1 loss and a cross-entropy loss are employed to\ninto observation windows {[11, 20], [1, 20]}. The two ob- train the decoder and RPM, as introduced in Section 3.6.\nservation windows are encoded to train unit Φ4, with the The ground-truth future trajectories, predicted future traencoded feature of the incomplete observation window [1, jectories, and their probability are represented by Y ∈\n20] being sequentially distilled by units Φ3, Φ2, and Φ1 RNa×Tf ×2, ˜Y ∈RNa×K×Tf ×2, and P ∈RNa×K, where\nto match the standard observation length, which is used to Na, K, Tf, and 2 represents the number of predicted agents, (a) DeMo-IT (b) DeMo-CLLS (c) DeMo-PRF (Our) (d) GT More qualitative results on the Argoverse 2 validation set. Incomplete observations, predicted trajectories, and ground truth\ntrajectories are shown in yellow, green, and pink, respectively. The absence of an observation trajectory indicates that the vehicle is\nstationary. Our predictions align more closely with the ground truth compared to other methods. (a) Scenario 1 (b) Scenario 2 (c) Scenario 3 (d) Scenario 4 Failure cases of DeMo-PRF on the Argoverse 2 validation set.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 42, + "total_chunks": 45, + "char_count": 1302, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8342f162-b5d5-4d2d-a66a-374938371ce5", + "text": "The first and second rows visualize the predicted trajectories and\nground-truth trajectories, respectively. Incomplete observations, predicted trajectories, and ground truth trajectories are shown in yellow,\ngreen, and pink, respectively. The absence of an observation trajectory indicates that the vehicle is stationary.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 43, + "total_chunks": 45, + "char_count": 321, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a87cc9ac-f517-4202-9c42-6db667ac705b", + "text": "the number of predicted modes, the prediction horizon, and Then, mADEK and mFDEK are calculated as the average\nthe coordinate dimensions, respectively. These variables are minimum ADE and FDE over K modes, respectively:\nused to compute the smooth-L1 loss and cross-entropy loss. Smooth-L1 regression loss. The smooth-L1 regression i\n\\be g l ed} hrm {mA\nloss is computed using the ground-truth future trajectories n \\mat\nignY and predicted future trajectories ˜Y as follows: (15)\n{ \\t DE}_K r {1} \\sum _{i=1}^{N_a}\\minkK}\\mathrm{ADE}_{i,k},\\\\\\mathrm{mFDE}_K{1}{N_a}\\sum_{i=1}^{N_a}\\minkK}\\mathrm{FDE}_{i,k}.{aligned} t\n\\m a L e {reg}} = \\frac { 1} { N_a T_f}_{i=1}^{N_a}{SmoothL1}\\bigl(\\tilde{\\mathbf{Y}}_{i,k_i^\\star\\mathbf{Y}_{i,t}\\bigr = ac {N_a} hcal }_{ xt\nThe b-mFDEK metric augments mFDEK with a Brierwhere k⋆i denotes the index of the best predicted mode for style penalty based on the probability of the best predicted\nagent i. mode:\nCross-entropy classification loss. For probability score\nb\\classification, the index to the mode with m k⋆i , corresponding t \\math r {-}mFDE} _ K = \\frac {1 }{N_a}\\sum_{i=1}^{N_a}\\left\\mathrm{FDE}_{i,k_i^\\star\\biglP_{i,k_i^\\star\\bigr\\right(16)the smallest ADE of agent i, is used as the ground-truth\nclass label. Then, using the predicted probability P and the ext\nground-truth class label, the classification loss is calculated The MRK metric measures the fraction of agents for which\nas: even the best of the K predicted trajectories deviates from l\nh the ground truth by more than a threshold δ = 2.0 meters at { ls}}{1}{N_a}\\sum_{i=1}^{N_a}\\mathbf{P}_{i,k_i^\\star(13) \\m a t { {\\t ext c ca the final time step: L}_\nThe overall training loss for the decoder and RPM is the rm\na _ { } 1}{N_a}\\sum_{i=1}^{N_a}\\mathbf\\Bigl\\mathrm{FDE}_{i,k_i^\\star>\\delta\\Bigr(17)sum of the regression and classification terms. \\ m K = \\fra c th\n9. Appendix for Evaluation Metrics {MR\nwhere 1(·) is the indicator function that returns 1 if the con-We adopt commonly used metrics, namely mADEK,\ndition is true and 0 otherwise. These metrics can be ex-mFDEK, b-mFDEK, and MRK, to evaluate PRF, as detended to the entire dataset by averaging over the total num-scribed in Section 4.1. The ground-truth future trajectober of predicted agents across all scenes.ries Y, predicted future trajectories ˜Y, and their associated\nprobabilities P are used to compute these metrics. Appendix for Qualitative Evaluations\nically, for each agent i and mode k, the ADE and FDE are\ndefined as follows: Additional qualitative results, complementing those presented in Fig. 4, are shown in Fig. 7.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 44, + "total_chunks": 45, + "char_count": 2603, + "word_count": 384, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3988b5fa-a865-4d26-bcc8-62d81a701f7a", + "text": "All results are preal\nn dicted from very short observation horizons of only 10 \\beg i i d}\\mathrm {ADE , timesteps. In some scenarios, the absence of an observa- { }_{i (14) gne\ntion trajectory indicates that the vehicle is stationary during\nk} &= \\ fra c {1}{T_ f } \\su m_{ t=1}^{T_f}\\left\\lVert\\tilde{\\mathbf{Y}}_{i,k,t}\\mathbf{Y}_{i,t}\\right\\rVert\\mathrm{FDE}_{i,k}\\left\\lVert\\tilde{\\mathbf{Y}}_{i,k,T_f}\\mathbf{Y}_{i,T_f}\\right\\rVert{aligned} the observation window. These qualitative results, spanning Figure 9. t-SNE visualization of features distilled by the progressive strategy. Green and orange points represent features extracted\nfrom trajectories with observation lengths of 10 and 50 timesteps,\nrespectively. Red, purple, brown, and blue points correspond to\nfeatures distilled from 10-step observations to those with 20, 30,\n40, and 50 steps, which gradually shift from the manifold of 10-\nstep observations toward that of 50-step observations.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 45, + "total_chunks": 45, + "char_count": 958, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "efe583c7-e716-46f9-b569-2e7fd8e40f7a", + "text": "various driving scenarios, further highlight the state-of-theart performance of the proposed PRF. Fig. 8 illustrates four failure cases when using very short observation horizons of only 10 timesteps. Fig. 8a and Fig. 8b depict failure cases in U-turn scenarios. Fig. 8c shows a failure case in a compound turn scenario,\nwhile Fig. 8d presents a failure case in the on-ramp merging\nscenario. These scenarios present long-tail problems for trajectory prediction, even with complete observation lengths. With such short and incomplete observations, the proposed\nPRF initially tracks the ground-truth motion but eventually\ndeviates as the maneuver becomes more complex. To improve predictions in these scenarios, future work could focus on enhancing the modeling of interactions among multiple agents and incorporating additional high-level context,\nsuch as traffic signals and right-of-way rules, as structural\nconstraints on the predicted trajectories. Appendix for Interpretability Analysis Additional interpretability analysis, complementary to\nFig. 5, is presented in Fig. 9. This figure visualizes the tSNE of features extracted from observation lengths of 10\nand 50 timesteps, as well as features distilled from 10-step\nobservations to those with 20, 30, 40, and 50 timesteps. The visualization shows that, as progressive distillation proceeds, features distilled from trajectories with an observation length of 10 timesteps gradually converge toward the\nfeatures obtained from 50-step observations. This demonstrates that decomposing direct distillation into a sequence\nof progressive distillation steps reduces the difficulty of the\ndistillation process and effectively distills representations\nfrom short trajectories into those of complete trajectories.", + "paper_id": "2603.10597", + "title": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", + "authors": [ + "Hao Zhou", + "Lu Qi", + "Jason Li", + "Jie Zhang", + "Yi Liu", + "Xu Yang", + "Mingyu Fan", + "Fei Luo" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10597v1", + "chunk_index": 46, + "total_chunks": 45, + "char_count": 1761, + "word_count": 245, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10599_semantic.json b/data/chunks/2603.10599_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..79d65d04613e78b902294abc7c17e1082899be16 --- /dev/null +++ b/data/chunks/2603.10599_semantic.json @@ -0,0 +1,110 @@ +[ + { + "chunk_id": "81eb2bb8-9f86-4956-8d44-a2b09e9ca15f", + "text": "Ivan Biolia,b Mikel Mendibe Abarrategic,d a Dipartimento di Matematica, Universit`a degli Studi di Pavia, Via A. Ferrata 5, 27100 Pavia, Italy\nb Dipartimento di Ingegneria Civile e Architettura, Universit`a degli Studi di Pavia, Via A. Ferrata 3, 27100 Pavia, Italy\nc University of the Basque Country, 48013 Bilbao, Spain\nd TECNALIA, Basque Research & Technology Alliance (BRTA), 48160 Derio, Spain We present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods,\nfully compatible with JAX and building on the Optimistix [4] optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants (SSBFGS, SSDFP, SSBroyden),2026 together with a Zoom line search satisfying the strong Wolfe conditions.", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 1, + "total_chunks": 6, + "char_count": 755, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "620bca97-a578-477c-a53a-0579cc81ba5e", + "text": "This is a short technical\nnote, not a research paper, as it does not claim any novel contribution; its purpose is to document\nthe implementation and ease the adoption of these optimisers within the JAX community. The code is available at https://github.com/IvanBioli/ssbroyden_optimistix.git.Mar\n11 1 Introduction Optimistix [4] is a JAX library for nonlinear solvers providing modular and composable optimisation\nalgorithms. While Optimistix includes a standard BFGS implementation paired with a backtracking\nArmijo line search, it lacks both the Zoom line search, which satisfies the strong Wolfe conditions, and the\nbroader family of Self-Scaled Broyden methods [1, 3, 2]. This work provides a pure-JAX implementation[cs.MS] of these methods, designed to be fully compatible with the Optimistix solver interface: the new solvers\ncan be used as drop-in replacements, composed with existing Optimistix descents and searches, and\nbenefit from all JAX transformations. Specifically, our implementation addresses the following gaps in\nthe current Optimistix offerings: We integrate the Zoom line search (Algorithm 3.6 in [5]) into Optimistix, ensuring\nthat the strong Wolfe conditions are satisfied at each step. The implementation is adapted from\nbagibence/zoom linesearch1 with minor modifications to fit the new Optimistix interface. Self-Scaled Broyden family. We implement the full Self-Scaled Broyden family of quasi-Newton\nHessian updates, encompassing Broyden, BFGS, DFP, and their Self-Scaled variants SSBroyden, SSBFGS, and SSDFP. We provide a wrapper that distinguishes between actual quasi-Newton iterations\nand internal line search steps, which Optimistix does not separate by design choice. This allows for\nmore refined comparisons between solvers.arXiv:2603.10599v1\n2 The Self-Scaled Broyden Family The Self-Scaled Broyden family of quasi-Newton methods generalises the classic Broyden, BFGS, and\nDFP updates to minimize a function f : RN →R [1, 3, 2]. At each iteration k, these methods maintain\nan approximation Hk of the inverse Hessian of f and compute a search direction dk = −Hk∇f(xk). After a line search determines a step size αk, the iterate is updated as xk+1 = xk + αkdk.", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 2, + "total_chunks": 6, + "char_count": 2195, + "word_count": 321, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6d14e4c8-fd94-4491-a40e-b9cee20bcc50", + "text": "The inverse\nHessian approximation is then updated using the step sk = xk+1 −xk and the gradient difference yk =\n∇f(xk+1) −∇f(xk). The Self-Scaled Broyden family update is parameterised by two scalars, θk and τk. In its most general\nform, the update is given by 1 k Hk Hk+1 = Hk + ϕk (y⊤k Hkyk) vkv⊤k + ρk sks⊤k , (1) τk −Hkyky⊤y⊤k Hkyk", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 3, + "total_chunks": 6, + "char_count": 335, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "09789958-2fcf-468e-8499-79dfea78fe5f", + "text": "1https://github.com/bagibence/optimistix/tree/zoom_linesearch ρk = ,\ny⊤k sk\nsk Hkyk\nvk = , y⊤k sk − y⊤k Hkyk\n1 ϕk = −θk ,\n1 + (hkbk −1)θk\ns⊤k Bksk\nbk = ,\ny⊤k sk\ny⊤k Hkyk\nhk = .\ny⊤k sk The parameter θk interpolates between the BFGS (θk = 0) and DFP (θk = 1) updates, with the more\ngeneral Broyden family obtained by computing θk dynamically at each iteration as: 1 θk = max θ−k , min θ+k , −bk , bk 1/2\nak , 1+akwhere, letting ak = bkhk −1 and ck =\nρ−k 1 ρ−k = min 1, hk(1 , θ−k = −1 , θ+k = . −ck) ak ρ−k The parameter τk controls the Self-Scaled variant (τk = 1 means no scaling) and is computed as  min ρ+k σ(1−N)k , σk , if θk ≤0, τk =\notherwise, ρ+k min σ(1−N)k , θk1 ,  where\n1 1\n1−N . ρ+k = min 1, , σk = 1 + θk ak, σ(1−N)k = |σk| bk Table 1 summarises the six concrete solvers obtained by choosing θk and τk. Table 1: Solvers implemented as special cases of the Self-Scaled Broyden family. Solver θk τk Description BFGS 0 1 Classic BFGS\nSSBFGS 0 computed Self-Scaled BFGS\nDFP 1 1 Classic DFP\nSSDFP 1 computed Self-Scaled DFP\nBroyden computed 1 Broyden family (no scaling)\nSSBroyden computed computed Self-Scaled Broyden family The implementation follows a class hierarchy that mirrors the mathematical structure of the update\nfamily, building on top of the AbstractQuasiNewton base class already present in Optimistix: AbstractSSBroydenFamily implements the shared logic: Hessian initialisation, computation of the\nauxiliary quantities, and the dispatch to subclass-specific update terms. It exposes two hooks: compute thetak\nand compute tauk, which subclasses override to fix or compute θk and τk.", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 4, + "total_chunks": 6, + "char_count": 1607, + "word_count": 296, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ef709232-d6a2-4617-89f2-98a82cb32e20", + "text": "AbstractSSBroyden implements the general update term (1) and computes both θk and τk dynamically. AbstractBroyden inherits from it but overrides compute tauk ≡1. AbstractSSBFGS fixes θk = 0, which simplifies the update to the BFGS Woodbury form. AbstractBFGS\nspecialises further with τk = 1. AbstractSSDFP fixes θk = 1, eliminating the vk term. AbstractDFP specialises further with τk = 1. Each abstract class has a concrete counterpart (e.g. BFGS, SSBFGS) that binds a default descent\n(NewtonDescent) and a default search (Zoom line search). Users can subclass the abstract variants to\nplug in alternative descents or searches. 3 Numerical Example: PINNs for the 3D Poisson Equation The SSBroyden family of optimizers has recently shown improved performance over BFGS for Physics\nInformed Neural Networks (PINNs) [2]. In this numerical example, available as example.py in our\nrepository, we solve the Poisson equation −∆u = f on Ω= [0, 1]3 with Dirichlet boundary conditions,\nwhere the exact solution is u∗(x) = Q3i=1 sin(πxi).", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 5, + "total_chunks": 6, + "char_count": 1028, + "word_count": 159, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "149ef9f0-55ad-4f79-8941-cdbd3eec9603", + "text": "The solution is approximated by a fully connected\nneural network (3 hidden layers of 32 units with tanh activations). NΩ NΓ\n1 2 1\n= X ∆uθ(xi) + f(xi) + X uθ(xj) 2, L(θ) 2NΩ 2NΓ −u∗(xj) i=1 j=1 with NΩ= 5000 interior and NΓ = 800 boundary collocation points, sampled uniformly at random. Figure 1 compares the convergence of the implemented solvers (BFGS, SSBFGS, Broyden, SSBroyden). The self-scaled variants converge notably faster in terms of both loss reduction and relative L2 and\nH1 errors. Loss L2 Error Convergence H1 Error Convergence\n102 101\nBFGS BFGS BFGS\n100 SSBFGS 100 SSBFGS 100 SSBFGS\nBroyden Broyden Broyden\n10−1\nSSBroyden 10−2 SSBroyden error 10−1 SSBroyden error\nLoss 10−4 L2 10−2 H1 10−2\n10−6 10−3 10−3 Relative 10−4 Relative 10−8\n10−4\n10−10 10−5\n10−5\n0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000\nIteration Iteration Iteration Figure 1: Convergence of quasi-Newton solvers on the 3D Poisson PINN problem. The self-scaled variants\n(SSBFGS, SSBroyden) achieve lower errors in fewer iterations compared to the standard BFGS and\nBroyden methods. This work has received funding from the European Union's Horizon Europe research and innovation\nprogramme under the Marie Sklodowska-Curie Action MSCA-DN-101119556 (IN-DEEP). Ivan Bioli is\nmember of the Gruppo Nazionale Calcolo Scientifico - Istituto Nazionale di Alta Matematica (GNCSINdAM).", + "paper_id": "2603.10599", + "title": "Self-Scaled Broyden Family of Quasi-Newton Methods in JAX", + "authors": [ + "Ivan Bioli", + "Mikel Mendibe Abarrategi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10599v1", + "chunk_index": 6, + "total_chunks": 6, + "char_count": 1391, + "word_count": 222, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10600_semantic.json b/data/chunks/2603.10600_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..8e99caa48bbeb621820e7cbf0bcff0d0839718e6 --- /dev/null +++ b/data/chunks/2603.10600_semantic.json @@ -0,0 +1,1014 @@ +[ + { + "chunk_id": "c9385325-a4d9-4dea-b6c8-1ce04556cace", + "text": "Gaodan Fang, Vatche Isahagian, K. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum,\nGegi Thomas∗\nAgents and Automation Lab, IBM Research\nUSA Abstract done at IBM. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/\nLLM-powered agents face a persistent challenge: learning from nnnnnnn.nnnnnnn\ntheir execution experiences to improve future performance. While\nagents can successfully complete many tasks, they often repeat 1 Introduction\ninefficient patterns, fail to recover from similar errors, and miss opLarge Language Model (LLM) powered agents have enabled inportunities to apply successful strategies from past executions. We\ncreasingly sophisticated automation of tasks ranging from web2026 present a novel framework for automatically extracting actionable\nnavigation to API orchestration. These agents operate by iteratively\nlearnings from agent execution trajectories and utilizing them to\nreasoning about tasks, selecting actions, executing them, and obimprove future performance through contextual memory retrieval.\nserving results. However, a fundamental limitation persists: AgentsMar Our approach comprises four components: (1) a Trajectory Intelli- have amnesia because most LLMs are stateless. Agents lack systemgence Extractor that performs semantic analysis of agent reasoning\natic mechanisms to learn from their execution experiences [4, 17].\npatterns, (2) a Decision Attribution Analyzer that identifies which11 An agent that struggles with a particular API authentication flow\ndecisions and reasoning steps led to failures, recoveries, or ineffitoday will struggle with the same flow tomorrow unless its prompts\nciencies, (3) a Contextual Learning Generator that produces three\nare manually updated. An agent that discovers an efficient strategy\ntypes of guidance—strategy tips from successful patterns, recovery\nfor a task cannot automatically apply that strategy to similar future\ntips from failure handling, and optimization tips from inefficient\ntasks. An agent that successfully recovers from an error provides\nbut successful executions—and (4) an Adaptive Memory Retrieval\nno benefit to future executions that encounter similar errors. System that injects relevant learnings into agent prompts based on[cs.AI] Consider a simple e-commerce task: adding items to a shopmulti-dimensional similarity. Unlike existing memory systems that\nping cart and completing checkout. An agent might successfully\nstore generic conversational facts, our framework understands execomplete this task but do so inefficiently—for instance, by callcution patterns, extracts structured learnings with provenance, and\ning amazon_remove_from_cart(item_id) in a loop to empty the\nretrieves guidance tailored to specific task contexts. Evaluation on\ncart when a single amazon_empty_cart() call would suffice.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 1, + "total_chunks": 44, + "char_count": 2816, + "word_count": 379, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "26271868-4136-4bff-a701-4db4443bbb95", + "text": "In\nthe AppWorld benchmark demonstrates consistent improvements,\nanother execution, the agent might fail entirely because it attempts\nwith up to 14.3 percentage point gains in scenario goal completion\ncheckout without first adding a payment method, then successfully\non held-out tasks and particularly strong benefits on complex tasks\nrecover by recognizing the error and adding payment information.\n(28.5 pp scenario goal improvement, a 149% relative increase). In yet another execution, the agent might execute the task cleanly\nfrom the start by systematically verifying prerequisites before each CCS Concepts\noperation.\n• Computing methodologies →Information extraction; Multi- Each of these trajectories contains valuable learnings (for future\nagent systems; Knowledge representation and reasoning; • executions), but of different types. The inefficient success suggests\nInformation systems →Enterprise applications; Information re- an optimization tip: when emptying a cart with multiple items,\ntrieval. use the bulk operation rather than iterating through individual removals. The failure-then-recovery suggests a recovery tip: when\nKeywordsarXiv:2603.10600v1 checkout fails due to missing payment method, verify payment inagentic memory, self evolving agents formation is configured before retrying. The clean success suggests\nACM Reference Format: a strategy tip: before initiating checkout operations, systematically\nGaodan Fang, Vatche Isahagian, K. Jayaram, Ritesh Kumar, Vinod Muthusamy, verify all prerequisites including cart contents, shipping address,\nPunleuk Oum, Gegi Thomas. 2026.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 2, + "total_chunks": 44, + "char_count": 1598, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e0bf6ad7-58c8-4d88-bda7-b06bb2258c1b", + "text": "Trajectory-Informed Memory Generation and payment method availability.\nfor Self-Improving Agent Systems. In Technical Report describing research Current approaches to agent improvement are inadequate for\n∗Author names listed alphabetically. capturing these diverse learning opportunities. Rule-based systems\nrequire developers to manually anticipate patterns and encode\nPermission to make digital or hard copies of all or part of this work for personal or them as decision rules, making them brittle and unable to adapt\nclassroom use is granted without fee provided that copies are not made or distributed\nfor profit or commercial advantage and that copies bear this notice and the full citation to unforeseen situations. Prompt engineering improves common\non the first page. Copyrights for third-party components of this work must be honored. patterns through iteratively refined instructions and examples, but\nFor all other uses, contact the owner/author(s). this guidance is generic rather than specific to actual deployment\nTechnical Report, Yorktown Heights, NY\n© 2026 Copyright held by the owner/author(s). experiences, and there is no mechanism for automatic improvement\nhttps://doi.org/10.1145/nnnnnnn.nnnnnnn based on observed outcomes. Generic memory systems [2, 15] store Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 3, + "total_chunks": 44, + "char_count": 1353, + "word_count": 189, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a9d0f7d9-46fd-4cc8-adcd-a3ebe7614c6c", + "text": "facts extracted from conversations in vector databases for later re- 2 Problem Statement\ntrieval, but these systems lack several critical capabilities: they have 2.1 The Agent Learning Challenge\nno understanding of agent execution patterns and reasoning flows,\nLLM-powered agents execute tasks by iteratively reasoning, select-they cannot perform causal analysis to identify which decisions led\ning actions, and observing outcomes. Each execution trajectory—theto failures or inefficiencies, they lack structured learning extraction\ncomplete sequence of thoughts, actions, and results from initialwith categories like strategy, recovery, and optimization, and they\nrequest to final outcome—contains patterns that could inform fu-provide no provenance tracking from learnings back to source trature executions [11]. However, extracting actionable learnings fromjectories. Recent work has begun extracting reusable knowledge\nthese trajectories is non-trivial for several reasons.from agent trajectories—including workflows from successful exeFirst, valuable patterns exist across diverse outcome cate-cutions [6, 13], procedural instructions [5], reasoning strategies [9],\ngories. Not all learning opportunities arise from failures. An agentand evolving context playbooks [16]—but these approaches typithat successfully completes a task may have employed an elegantcally learn only from successful trajectories, lack explicit causal\nstrategy worth replicating, discovered an efficient API usage pat-attribution of failures, or produce monolithic documents rather\ntern, or executed a thorough validation sequence that preventedthan structured, retrievable memory entries. Empirical studies furerrors. Conversely, an agent that ultimately succeeds may havether demonstrate that naive experience accumulation leads to error\ndone so inefficiently—taking unnecessary steps, making redundantpropagation and misaligned replay [14], underscoring the need for\nAPI calls, or using granular operations where bulk operations exist.quality-aware memory curation. And agents that encounter failures may successfully recover, with We present a framework that addresses these limitations through\nthe recovery pattern itself being valuable to capture.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 4, + "total_chunks": 44, + "char_count": 2232, + "word_count": 282, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3002c2d7-bf07-4fcb-bae5-6949da8240d3", + "text": "A comprehen-trajectory-informed memory generation and retrieval. Our key insive learning system must extract insights from clean successes,sight is that agent execution histories—trajectories—contain rich\ninefficient successes, failure-then-recovery sequences, and com-semantic information about not just what happened, but why agents\nplete failures.made decisions, how they reasoned about tasks, which strategies\nSecond, causality is often non-obvious from raw logs. Whensucceeded, which patterns proved inefficient, and where decision\nan agent fails at step 15 of an execution, the problematic decisionchains led to failures and recoveries. By analyzing these trajectomay have occurred at step 3.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 5, + "total_chunks": 44, + "char_count": 698, + "word_count": 90, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3d0ab9e7-739b-4187-b1a0-92477beb52de", + "text": "When an agent successfully recoversries with semantic understanding, we can automatically extract\nfrom an error, identifying which specific reasoning led to the recov-actionable learnings across multiple categories, attribute failures\nery requires semantic understanding of the agent's thoughts, notand inefficiencies to specific decisions and reasoning steps, generate\njust observation of the final outcome. When an agent completes acontext-aware guidance, and retrieve relevant learnings based on\ntask inefficiently, determining which alternative approach wouldmultiple contextual dimensions.\nbe more efficient requires understanding both what the agent did Our contributions are as follows:\nand what other options were available. Third, learnings must be contextually retrieved.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 6, + "total_chunks": 44, + "char_count": 781, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "70240ea8-95cf-479a-a386-836db4439cb2", + "text": "An opti-\n• We introduce trajectory intelligence extraction that moves mization tip about using bulk cart operations is relevant when the\nagent is performing cart management but irrelevant for email com- beyond raw logging to semantic understanding of agent\nposition tasks. A recovery tip about handling authentication failures reasoning patterns, including analytical thoughts, planning\nis critical for tasks involving authenticated APIs but unnecessary patterns, validation behaviors, reflection patterns, and selffor read-only operations. The retrieval system must match learn- correction sequences.\n• We present automated decision attribution that distinguishes ings to contexts based on multiple dimensions: task type, domain,\nsemantic similarity to current request, and the specific execution immediate causes, proximate causes, and root causes of failpatterns involved. The importance of precise retrieval is amplified ures, while also identifying which decisions led to successful\nby empirical evidence that agents closely follow retrieved experi- recoveries and which execution patterns prove inefficient\nences [14], making mismatched retrieval a direct source of degraded despite succeeding.\n• We develop contextual learning generation that produces performance. Fourth, learnings must be actionable and specific. Generic three distinct types of guidance: strategy tips encoding sucadvice like \"be careful with API calls\" provides little value. Effective cessful patterns from clean executions, recovery tips capturlearnings specify concrete validation checks, particular API usage ing failure handling and error correction approaches, and\npatterns, specific error recovery sequences, or explicit prerequisite optimization tips identifying efficiency improvements from\nverification steps. They must be formulated in terms the agent can successful but suboptimal executions.\n• We design adaptive memory retrieval that combines seman- directly apply: \"Before initiating checkout, verify payment method\nis configured by calling get_payment_methods() and checking for tic similarity with metadata filtering and priority-based ranknon-empty results\" is actionable; \"make sure payment works\" is ing to ensure agents receive the most relevant guidance for\nnot. their specific context, including task type, domain, and exeFifth, learnings must be traceable to their source. Each learn- cution patterns.\n• We demonstrate the framework's effectiveness on the App- ing must maintain provenance—a link back to the specific trajectory\nand outcome from which it was derived [3]. This enables validation World benchmark, showing consistent improvements across\nof whether learnings are effective (do similar failures still occur all difficulty levels, with particularly strong gains on complex\ntasks where learned experience is most valuable.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 7, + "total_chunks": 44, + "char_count": 2835, + "word_count": 383, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "11dee733-beb0-49a2-9e5b-450875fdb5c7", + "text": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY after the learning is deployed?), investigation of why certain guid- identify and classify distinct reasoning modes—analytical thoughts\nance was generated, and auditing of the learning system's decisions. (examining data or constraints), planning thoughts (formulating\nWithout provenance, it is impossible to debug incorrect guidance, action sequences), validation thoughts (checking prerequisites or\nassess learning quality over time, or build trust in the system's intermediate results), reflection thoughts (evaluating past actions),\nrecommendations. and self-correction sequences (recognizing and recovering from\nerrors)—to understand how agents reasoned about tasks and where\ntheir reasoning succeeded or failed. This structured understanding\n2.2 Learning Requirements of reasoning flows is what enables the extraction of meaningful\nFor agents that reason and act iteratively (e.g., ReAct-style, plan-and- learnings from trajectories rather than surface-level pattern matchexecute), the learning system must satisfy several requirements. ing on actions alone. Strategy extraction from successful patterns: When an agent\nexecutes a task cleanly—without errors, unnecessary steps, or recovery sequences—its approach often embodies effective strategies. 2.3 Limitations of Existing Approaches\nThe system must identify these patterns: Did the agent verify prereq- Existing approaches to agent improvement fail to address these\nuisites before attempting operations? Did it systematically explore challenges comprehensively.\navailable APIs before selecting one? Did it validate intermediate Rule-based systems encode decision rules based on anticipated\nresults before proceeding to dependent steps? These successful pat- patterns, but they cannot adapt to unforeseen situations and reterns should be encoded as strategy tips that guide future executions quire constant manual maintenance as new patterns emerge. They\ntoward similarly effective approaches. also cannot automatically extract rules from observed execution\nRecovery extraction from failure handling: When an agent trajectories—each rule must be manually crafted by developers who\nencounters an error but successfully recovers, the recovery se- may not have visibility into actual deployment patterns.\nquence is valuable. The system must identify what went wrong, Prompt engineering improves agent performance through iterwhat the agent recognized about the failure, how it adjusted its atively refined guidance and examples, but this guidance is generic\napproach, and what specific actions led to successful recovery. For rather than specific to actual deployment experiences.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 8, + "total_chunks": 44, + "char_count": 2768, + "word_count": 363, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0b39824d-b370-48e7-b978-693450f55610", + "text": "If an agent\nexample, if an agent attempts checkout without payment config- repeatedly fails at a particular API authentication flow, prompt\nured, receives an error, recognizes the missing payment method, engineering might eventually capture this pattern, but only after\nadds payment information, and successfully retries, this entire se- manual observation and prompt modification. There is no mechaquence should be encoded as a recovery tip including the failure nism for automatic improvement based on observed outcomes, and\npattern, recognition signals, and correction steps. no systematic way to capture the full range of learning opportuniOptimization extraction from inefficient successes: When ties from successes, failures, and recoveries.\nan agent successfully completes a task but does so suboptimally, Generic memory systems represent a more sophisticated apthe system must identify the inefficiency and determine the more proach but still fall short. Systems like Mem0 [2] and Letta [10]\nefficient alternative. This requires understanding not just what the store facts extracted from conversations in vector databases for\nagent did, but what other options were available. For example, if later retrieval.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 9, + "total_chunks": 44, + "char_count": 1216, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "13d6e50e-1613-4360-a6fd-5fd84d476fcb", + "text": "However, these systems lack several critical capaan agent removes items from a cart one-by-one in a loop when a bilities for agent learning. They have no understanding of agent\nbulk empty_cart() operation exists, the system must recognize execution patterns—they treat all memories uniformly rather than\nthis pattern, identify the more efficient alternative, and encode an distinguishing between strategy patterns, recovery sequences, and\noptimization tip specifying when and how to use the bulk operation. optimization opportunities. They cannot perform causal analysis to\nStep-level decision attribution: When failures or inefficiencies identify which decisions led to failures or inefficiencies—they store\noccur, the system must identify which specific reasoning steps and outcomes but not the decision chains that produced them. They\ndecisions led to the outcome.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 10, + "total_chunks": 44, + "char_count": 867, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3994d66d-7f17-43d4-bf8f-9aaafef56a9f", + "text": "This requires semantic analysis of the lack structured learning extraction with categories, priorities, and\nagent's thoughts, not just observation of actions. If an agent fails actionable steps—memories are typically free-form text without the\nbecause it assumed an API was available without verifying, the structure needed for agent guidance. They provide no provenance\nattribution must identify the assumption step, explain why it was tracking from learnings back to source trajectories, making it improblematic, and specify what verification should have occurred. possible to validate whether learnings are effective or to investigate\nThought pattern recognition: Agents often exhibit meta-cognitive why certain guidance was generated [17].\nbehaviors that indicate their reasoning quality. An agent that explic- Reinforcement learning approaches learn from reward sigitly validates prerequisites is demonstrating a positive pattern. An nals but have their own limitations for this problem. They require\nagent that recognizes its own errors and self-corrects is exhibiting extensive training data to learn effective policies, which may not be\nreflection. An agent that makes assumptions without verification is available when failures are rare but consequential. They are compuexhibiting a negative pattern. The system must identify these cog- tationally expensive to train and update, making them impractical\nnitive patterns semantically—recognizing that \"I should verify all for continuously evolving agent systems. They provide limited interAPIs are available\" exhibits a validation pattern even without using pretability regarding why certain decisions improve outcomes—the\nthe word \"validate\"—and use them to guide learning extraction. learned policy is often a black box. For scenarios where underSemantic reasoning analysis: Beyond recognizing individual standing the reasoning behind improvements is valuable (such as\nthought patterns, the system must move beyond raw execution logs debugging or auditing agent behavior), RL approaches provide into understand the full structure of agent reasoning. The system must sufficient transparency. Additionally, RL approaches struggle with Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 11, + "total_chunks": 44, + "char_count": 2262, + "word_count": 310, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2711bbc9-6d73-4134-ac37-f2eb7ce0456a", + "text": "the multi-category learning problem—they optimize for overall re- what traditional logging provides: why agents made particular deward but do not naturally distinguish between strategy patterns, cisions, how they validated their reasoning, where they exhibited\nrecovery sequences, and optimization opportunities. self-corrective behavior, and what patterns characterized successful\nversus unsuccessful executions.\n3 Approach The component receives raw agent trajectories containing sequential steps with agent invocations, prompts or contexts, agentAs illustrated in Figure 1, we propose a framework that transresponses including thoughts and reflections, actions taken andforms raw agent execution trajectories into actionable, contextuallytheir results, and optionally, evaluation reports or ground-truthretrieved guidance for future invocations. The framework operates\noutcome assessments. Each trajectory represents a complete taskas a three-phase pipeline:\nexecution from initial user request through final outcome. Crucially,\n(1) Phase 1: Trajectory Analysis and Tips Extraction. Given ground-truth outcome labels (success or failure) are not required:\nan agent's execution trajectory for a completed task, the sys- when they are available—for instance, from a benchmark evaluation\ntem analyzes the reasoning trace to identify causal decision harness—the system uses them directly to classify the trajectory;\nchains—why outcomes occurred—and extracts structured when they are absent, the system infers outcome from the agent's\ntips capturing effective strategies, recovery patterns, and own self-reflective signals identified in subsequent stages.\noptimization opportunities. Tips are extracted at two com- The first processing stage parses agent responses to identify and\nplementary granularities: task-level tips that capture holistic categorize reasoning into four types based on cognitive function:\nend-to-end patterns, and subtask-level tips that decompose Analytical thoughts where the agent analyzes the situation and\ntrajectories into reusable logical phases (authentication, data assesses constraints; Planning thoughts where the agent decides\nretrieval, processing, etc.) for cross-task transfer. what actions to take and in what sequence; Validation thoughts\n(2) Phase 2: Tip Storage and Management. Extracted tips where the agent checks assumptions or verifies preconditions; and\nare generalized, clustered, and consolidated before storage. Reflection thoughts where the agent reconsiders its approach,\nSubtask descriptions are abstracted to remove entity-specific often triggered by unexpected results. Beyond categorization, the\ndetails, enabling semantic clustering of tips from different extractor identifies status indicators, execution summaries, and\ntasks that share common subtask patterns. An LLM-based error recognition statements, enabling downstream components to\nmerging process consolidates redundant or overlapping tips understand the reasoning process that led to actions.\nwithin each cluster, producing a curated memory of non- The second stage uses an LLM to identify cognitive patterns\nredundant, high-quality guidance. Tips are stored with dual within extracted thoughts through semantic understanding rather\nrepresentations—vector embeddings for semantic search and than keyword matching. The system recognizes: Validation patstructured metadata for filtering. terns—any expression of checking or verifying assumptions, even\n(3) Phase 3: Runtime Retrieval. When an agent is invoked for without validation-related keywords (e.g., \"I need to ensure all rea new task, the system retrieves relevant tips from memory quired APIs are included\" exhibits validation behavior); Reflection\nand injects them into the agent's prompt as guidelines before patterns—reconsideration of previous decisions, often after errors;\nreasoning begins. Two retrieval strategies are supported: Self-correction patterns—proactively identifying and fixing ercosine similarity retrieval (fast, no LLM call) and LLM-guided rors before external signals; Error recognition patterns—noticing\nselection (richer reasoning about task context at the cost of problems that may affect task completion; API discovery patan additional LLM invocation). terns—systematic exploration of available APIs; and Efficiency\nThese phases form a self-reinforcing cycle: as more trajectories awareness patterns—considering whether more efficient alterare processed, the memory system accumulates increasingly com- natives exist. This semantic approach generalizes across linguistic\nprehensive and refined guidance.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 12, + "total_chunks": 44, + "char_count": 4600, + "word_count": 589, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e99ce193-5537-4530-bdb7-45fb12e1eecf", + "text": "Agents that receive this guidance variations, unlike rule-based keyword matching.\nproduce higher-quality trajectories that may reveal subtler patterns The third stage determines the trajectory outcome. When groundfor further learning.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 13, + "total_chunks": 44, + "char_count": 234, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f1b8e3a3-1782-4d1d-841a-a81b343192bf", + "text": "The following subsections detail each phase. truth evaluation reports are present, the stage interprets them with\nsemantic understanding: a report stating \"API response returned\n3.1 Phase 1: Trajectory Analysis and Tips 400 Bad Request\" is converted into \"Checkout API failed because reExtraction quired payment method parameter was not provided,\" and for each\noutcome indicator, the module determines what the test validates,This phase analyzes completed agent trajectories to extract strucwhy it failed (if applicable), the impact on task completion, andtured, actionable tips. It comprises three stages: trajectory intellioverall quality assessment. When ground-truth labels are absent,gence extraction, decision attribution analysis, and tip generation.\nthe stage instead synthesizes outcome from the self-reflective sig-A key design dimension of the tip generation stage is the granunals extracted in stages 1 and 2—reflection thoughts, self-correctionlarity at which tips are extracted—either at the level of entire task\npatterns, and error recognition patterns—to infer whether the agenttrajectories (task-level) or at the level of individual logical subtasks\nsucceeded, failed, or recovered. In both cases, the result is an out-within a trajectory (subtask-level). We explore both granularities\ncome classification used by downstream components.and compare their effectiveness in Section 4. A fourth stage specifically analyzes successful executions, dis-\n3.1.1 Trajectory Intelligence Extractor. The Trajectory Intelligence tinguishing: Clean successes—task completed without errors or\nExtractor transforms raw agent execution data into a structured in- unnecessary steps, with patterns that are candidates for strategy\ntermediate representation that captures semantic meaning beyond tips; Inefficient successes—task completed but suboptimally (e.g., Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY EXTRACTION STORAGE & MGMT RETRIEVAL & USAGE Trajectory Intelligence Extractor Description Generalization Cosine similarity or top-k selection TIPS TIPS\nDecision Attribution Analyzer Semantic Clustering LLM-guided selection Contextual Learning Generator Tip Merging and Consolidation Priority Weighted Ranking Subtask-level Decomposition Dual-Indexed Store Prompt integration Agent Trajectory Agent Task Description Figure 1: Overview of our approach", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 14, + "total_chunks": 44, + "char_count": 2439, + "word_count": 312, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "55158338-5cee-4250-92cb-59ce19836a64", + "text": "repeated operations that could be batched), yielding candidates for what made execution suboptimal, what more efficient alternative\noptimization tips; and Recovery sequences—successful error han- exists, why the alternative is better, and whether the agent was\ndling within otherwise successful executions, yielding candidates aware of the inefficiency. For success patterns, it identifies what\nfor recovery tips. strategies contributed to clean success, why they were effective,\nThe output is a structured intermediate representation enriched and what made the approach particularly good.\nwith extracted thoughts, identified patterns with confidence scores, The final stage generates specific prevention or improvement\nevaluation intelligence, success patterns, and metadata including steps for each attributed decision point. These steps must be actrajectory identifier, task intent, step count, and overall outcome tionable—the agent can actually perform them; specific—concrete\nclassification. actions rather than vague advice; causal—directly addressing the\nroot cause; and preventive or improving—stopping similar failures from occurring or specifying more efficient approaches.3.1.2 Decision Attribution Analyzer. The Decision Attribution Analyzer performs automated causal analysis to identify which deci- 3.1.3 Contextual Learning Generator. The Contextual Learning\nsions and reasoning steps led to observed outcomes. It analyzes all Generator converts decision analyses into reusable memory entries\noutcome types—not just failures. that are actionable, contextually rich, and properly categorized. The first stage scans the intermediate representation for outcome The key innovation is generating three distinct tip types based on\nindicators across four categories: Failure indicators—failed evalua- trajectory outcomes.\ntions, error messages, task incompletion signals; Recovery indica- Strategy tips encode effective patterns from clean successful\ntors—failure followed by successful completion, error recognition executions—what worked well and should be replicated. Example:\nfollowed by corrective actions; Inefficiency indicators—repeated\noperations that could be batched, unnecessary intermediate steps, Content: \"When performing checkout operations,\ngranular operations where bulk alternatives exist; and Success pat- systematically verify all prerequisites (cart\nterns—clean completion, systematic prerequisite verification, effi- contents, shipping address, payment method) before\ncient API usage. For each detected outcome, contextual information initiating the checkout sequence.\"\nis extracted as the starting point for causal analysis. Importantly, Category: strategy\nthe outcome location is typically not the cause location. Steps:\nThe causal analysis module uses an LLM to trace backwards 1. Call get_cart_items() to verify cart is not empty\nthrough the agent's reasoning steps to identify which decisions led 2. Call get_shipping_address() to verify address is\nto the observed outcome. For failures, the analysis distinguishes: configured\nthe immediate cause (what directly triggered the failure), the prox- 3. Call get_payment_methods() to verify payment method\nimate cause (recent decisions that enabled it), the root cause (the exists\nunderlying issue that originated the chain), and contributing fac- 4.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 15, + "total_chunks": 44, + "char_count": 3332, + "word_count": 430, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b99488d6-5b7d-43f5-9455-d2ff858f4b51", + "text": "Only proceed with checkout if all prerequisites are\ntors. For recoveries, it identifies what enabled the failure, how the satisfied\nagent recognized the problem, what corrective action was taken, Trigger: \"When task involves checkout, purchase, or\nand why the correction succeeded. For inefficiencies, it identifies payment operations\" Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 16, + "total_chunks": 44, + "char_count": 406, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c3461736-5e40-4c2b-ab15-33355ded4ac9", + "text": "Recovery tips encode both the failure pattern and the recovery 3.1.4 Task-Level vs. Subtask-Level Extraction. The tip generation\npattern from failure-then-recovery sequences. Example: stage can operate at two granularities. Task-level extraction treats an\nentire trajectory as a unit, producing holistic tips that capture endContent: \"When checkout fails with ´payment method\nto-end execution patterns. Subtask-level extraction first decomposes\nrequired´ error, verify payment configuration and\nthe trajectory into logical subtasks and then extracts focused tips\nadd payment method if missing before retrying.\"\nfor each subtask independently. Category: recovery\nThe two approaches offer different tradeoffs. Task-level tips are\nSteps:\nstraightforward to extract and capture overarching strategies span-\n1. Recognize error message indicating missing payment\nning the full task. However, their reusability is limited by task specimethod\nficity: a tip extracted from \"Name the artist most recommended to\n2. Call get_payment_methods() to check current\nme on Spotify\" may not transfer to \"Move my go-to-sleep phone\nconfiguration\nalarm to 20 minutes later,\" even though both share common sub-\n3. If empty, call add_payment_method() with appropriate\ntasks such as authentication and paginated data retrieval. Task-level\ndetails\ntips also bundle concerns from distinct execution phases, reducing\n4.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 17, + "total_chunks": 44, + "char_count": 1390, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f11c3838-fc22-40d8-b254-b1b87de061dc", + "text": "Retry the checkout operation\nretrieval precision. Trigger: \"When checkout or payment operations fail\"\nSubtask-level extraction addresses these limitations by scoping\nNegative Example: \"Do not simply retry without addressing\neach tip to a single logical phase. Many tasks share common subthe missing payment method.\"\ntasks that generalize across contexts:\nOptimization tips identify efficiency improvements from suc- • Authentication subtasks follow a common pattern across\ncessful but suboptimal executions. Example: apps (Spotify, Phone, Venmo): retrieve credentials from a\nContent: \"When emptying a shopping cart with multiple supervisor, login, and store the access token.\nitems, use empty_cart() instead of iterating • Data retrieval subtasks share pagination patterns: issue\nremove_from_cart(item_id) for each item.\" paginated API calls, aggregate results, and store them for\nCategory: optimization downstream processing. Steps: • Data processing subtasks involve domain-independent op-\n1.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 18, + "total_chunks": 44, + "char_count": 994, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c7d7e686-ea71-4df0-8300-c5dda0f33064", + "text": "Check if cart has multiple items to remove erations: counting, filtering, aggregation, and transformation\n2. Instead of looping remove_from_cart(), call of retrieved data.\nempty_cart() once • Task completion subtasks are near-universal: reporting\n3. Verify cart is empty with get_cart_items() results and marking tasks complete. Trigger: \"When task requires removing all items from cart\" By extracting tips at this granularity, we enable cross-task transNegative Example: \"Do not use for i in items: fer (authentication tips from Spotify tasks help with Phone app\nremove_from_cart(i) when emptying the entire cart.\" tasks), better matching (a task about updating alarms retrieves tips\nfrom a \"retrieve all alarms\" subtask even if the original task was\nThe system analyzes trajectories to determine contextual diabout deleting alarms), and compositional learning (new complex\nmensions for both generation and retrieval: the application context\ntasks leverage tips from multiple simpler subtasks).\n(which domain the task involves), the task category (type of operation within the domain), and the complexity level. Tip content is Two-Phase Extraction Pipeline. The subtask-level extraction opgenerated using specialized prompts for each category, incorpo- erates as a two-phase LLM-based pipeline.\nrating the relevant execution patterns, and each prompt includes Phase A: Trajectory Segmentation. An LLM analyzes the full\nguidelines for generating actionable, specific, generalizable tips. agent trajectory and segments it into logical subtasks. For each\nEach generated memory entry contains: a unique identifier, tip subtask, the model produces a generalized description (deliberately\ncategory (strategy, recovery, optimization), actionable content, ex- generic, e.g., \"Authenticate with Spotify\" rather than \"Login as\nplanatory purpose, concrete implementation steps, trigger condi- user@gmail.com\"), the set of applications involved, the step range in\ntion, optional negative example, application context (or null for the original trajectory (maintaining traceability), and the subtask's\ngeneric tips), task category (or null for generic tips), priority level purpose. The segmentation prompt instructs the model to iden-\n(critical/high/medium/low based on outcome severity), source tra- tify natural boundaries between distinct logical phases—transitions\njectory ID, and source outcome description. from authentication to data retrieval, from data retrieval to processThe system also generates both domain-specific and generic tips ing, and so on.\nfrom the same trajectory, maximizing precision and coverage. From For example, a trajectory for \"Name the artist most recommended\na failure involving missing payment APIs in e-commerce checkout, to me on Spotify\" might be segmented into: (1) discover relevant\nthe system generates a domain-specific tip (\"For e-commerce tasks APIs and their specifications, (2) authenticate with Spotify, (3) reinvolving checkout, verify payment method is configured before trieve recommended songs via paginated requests, and (4) analyze\ninitiating checkout\") and a generic tip (\"When initiating operations recommendations to determine the most recommended artist.\nthat have prerequisites, systematically verify all prerequisites before Phase B: Per-Subtask Tips Extraction. An LLM independently\nbeginning\").", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 19, + "total_chunks": 44, + "char_count": 3342, + "word_count": 452, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "19a91586-dc98-4773-b5e4-923ae202d819", + "text": "This dual-level generalization ensures high precision extracts 2–4 actionable tips for each subtask. By scoping each exwhen context matches domain-specific tips and broad coverage traction call to a single subtask, the prompts remain focused and the\nthrough generic tips that apply even in novel domains. tips avoid conflating concerns from different execution phases. Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 20, + "total_chunks": 44, + "char_count": 498, + "word_count": 68, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b41f6e35-9a00-4069-a47e-93c2f0139a45", + "text": "tips are constrained to be concrete (specific API patterns rather than • Context removal: Strips task-specific contextual qualifiers\nvague advice), generalizable (avoiding task-specific details such as that do not affect the subtask's core operation. \"Retrieve creparticular email addresses, song names, or payment amounts), and dentials in order to check subscription status\" is reduced to\nactionable (directly applicable by an agent encountering a similar \"Retrieve service account credentials,\" since the downstream\nsubtask). Optionally, different models can be used for Phase A and purpose does not change how credential retrieval should be\nPhase B—a more capable model for segmentation and a lighter performed.\nmodel for per-subtask extraction—to balance cost and quality. Example output for the \"Authenticate with Spotify\" subtask: These transformations are applied using an LLM with a prompt\nthat instructs it to produce maximally abstract descriptions while Tips:\npreserving the core operation. The generalized descriptions serve 1. \"Always retrieve account credentials from\nas the basis for clustering: tips whose generalized subtask descrip- supervisor.show_account_passwords() before\ntions are semantically similar are likely to contain overlapping or attempting authentication\"\ncomplementary guidance. 2. \"Immediately store and validate access tokens after\nlogin to ensure successful subsequent API calls\"\n3. \"Filter credentials by app name to select the correct 3.2.2 Semantic Clustering. The system clusters tips by computing\npassword for the target service\" cosine similarity between the vector embeddings of their generalized subtask descriptions, then applying hierarchical agglomerative Subtask-level and task-level tips are complementary rather than\nclustering with a similarity threshold. Two generalized descrip-competing. Task-level tips capture holistic patterns about end-totions such as \"Retrieve service account credentials\" and \"Authenti-end execution strategy (e.g., \"verify all prerequisites before checkcate with external service\" may describe distinct subtasks despiteout\"), while subtask-level tips capture focused patterns about spesurface-level relatedness, while \"Retrieve service account creden-cific execution phases (e.g., \"use paginated retrieval when fetching\ntials\" and \"Obtain application login credentials\" describe the samelarge result sets\"). Both levels are stored in the same memory system\noperation. Hierarchical clustering with an appropriate thresholdand can be retrieved together during Phase 3.\n(empirically, ∼0.85 on generalized descriptions) groups truly equiva-\n3.2 Phase 2: Tip Storage and Management lent subtask descriptions while keeping distinct operations separate. Within each cluster, all associated tips are collected regardless\nAs tips accumulate from many trajectories across diverse tasks, of their source trajectory, task context, or extraction granularity.\nthe memory system must address redundancy, inconsistency, and A cluster for \"Retrieve service account credentials\" might contain\nscalability. Two trajectories involving e-commerce checkout may in- tips from Spotify authentication trajectories, Venmo login trajecdependently produce tips about verifying payment methods; dozens tories, and Phone app credential retrieval—all reflecting the same\nof trajectories across different apps will produce authentication- underlying subtask pattern observed across different tasks.\nrelated tips with overlapping guidance. Without consolidation, the\nmemory grows linearly with the number of processed trajectories,\nretrieval quality degrades as near-duplicate tips compete for limited 3.2.3 Tip Consolidation and Merging. Within each cluster, an LLMprompt space, and contradictory tips from different trajectories may based consolidation process merges redundant tips, resolves conconfuse the agent. flicts, and produces a curated set of non-overlapping guidance. The\nPhase 2 addresses these challenges through a pipeline of subtask consolidation operates in three steps:\ndescription generalization, semantic clustering, and LLM-based tip Deduplication. Tips with near-identical content are identified\nconsolidation. and merged. \"Always call show_account_passwords() before login\" and \"Retrieve credentials using the supervisor password API\n3.2.1 Subtask Description Generalization.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 21, + "total_chunks": 44, + "char_count": 4346, + "word_count": 561, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6a324e8b-4476-433e-806d-3cd2c54fd7ce", + "text": "Subtask descriptions pro- before authentication\" convey the same guidance; the consolidation\nduced by Phase 1 contain varying levels of specificity that hinder produces a single canonical tip that captures the shared insight.\nclustering. \"Retrieve Spotify password for john.doe@email.com Conflict resolution. When tips from different trajectories offer\nusing supervisor API,\" \"Get Venmo login credentials for user al- contradictory guidance (e.g., one tip recommends retrying failed auice_smith,\" and \"Fetch Phone app password from supervisor\" all thentication immediately while another recommends re-retrieving\ndescribe the same abstract operation: retrieving service credentials. credentials first), the system uses outcome metadata—tip category,\nTo enable meaningful clustering, the system generalizes subtask priority level, and source trajectory success/failure status—to deterdescriptions through three transformations: mine which guidance is more reliable. Tips derived from successful\n• Entity abstraction: Replaces specific user names, email trajectories take precedence over those from failed ones, and recovaddresses, app names, item IDs, and other entity references ery tips that encode proven correction patterns take precedence\nwith generic placeholders. \"Retrieve Spotify password for over speculative prevention strategies.\njohn.doe@email.com\" becomes \"Retrieve service account Synthesis. Complementary tips that address different aspects\ncredentials.\" of the same subtask are synthesized into coherent, comprehensive\n• Action normalization: Maps semantically equivalent verbs guidance. If one tip covers credential retrieval and another covers\nand phrases to canonical forms. \"Get,\" \"fetch,\" \"retrieve,\" and token validation after login, the consolidated output combines both\n\"obtain\" are normalized to a single canonical verb. \"Log in,\" into a single tip with ordered steps covering the full authentication\n\"sign in,\" and \"authenticate\" are similarly unified. workflow. Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. The consolidation also produces a canonical cluster description—a the threshold, preventing prompt bloat when many stored\nsingle generalized subtask description that represents the clus- tasks are moderately similar.\nter for retrieval purposes. This description is re-embedded and In practice, these two mechanisms are combined: the system\nstored alongside the consolidated tips, replacing the individual retrieves all tips with similarity ≥𝜏, then selects the top 𝑘by\nper-trajectory descriptions. similarity score.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 22, + "total_chunks": 44, + "char_count": 2574, + "word_count": 336, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e0b96541-1c97-47c0-9894-5828b29cefa7", + "text": "Typical values are 𝜏∈[0.5, 0.7] and 𝑘∈[5, 10]. 3.2.4 Storage Representation. Each consolidated memory entry is 3.3.2 LLM-Guided Selection. A more expressive approach uses an\nstored with two complementary representations. The vector em- LLM at retrieval time to analyze the task description 𝑑, detect the\nbedding is a dense vector computed from the tip content and pur- application context and task category, and reason about which types\npose using a text embedding model. This captures semantic mean- of guidance are most relevant. The LLM constructs a structured\ning, enabling similarity search across different terminology—for retrieval query that combines:\ninstance, a tip about \"renewing a subscription\" can match a task de-\n• Metadata filters: The LLM identifies that a task about \"Com-scription mentioning \"extending my membership,\" and a tip about\n\"scheduling a recurring event\" can match \"set up a weekly meeting.\" plete my pending Venmo payment requests\" involves the\nThe structured metadata consists of filterable attributes: tip cate- Venmo application and payment operations, and constrains\ngory (strategy, recovery, optimization), priority level, application retrieval to tips from the payment domain (or generic tips\ncontext, task category, source trajectory IDs (plural, since consol- with null application context).\n• Category awareness: Based on the task description, theidated tips may derive from multiple trajectories), and creation\ntimestamp. LLM may determine that recovery tips are particularly releTips are indexed by their canonical cluster description for subtask- vant (e.g., the task mentions retrying a failed payment) or\nlevel tips, and by the original task description for task-level tips, that strategy tips should be prioritized (e.g., the task involves\ncreating natural groupings that enable retrieval at both granulari- a multi-step workflow).\nties. LLM-guided selection is more expensive (requiring an additional\nLLM call per task) but can reason about nuances that pure embed-\n3.3 Phase 3: Runtime Retrieval ding similarity misses. For instance, an LLM can recognize that\nWhen an agent is invoked to execute a new task with description \"Delete all my read emails older than 30 days\" and \"Clean up my\n𝑑, the system retrieves relevant tips from memory and injects them inbox by removing old messages\" are the same task even when their\ninto the agent's prompt as guidelines before reasoning begins. The embeddings diverge, and it can infer that a task involving \"checkretrieval strategy directly affects whether the agent receives rel- out\" implies payment-related tips are relevant even if \"payment\" is\nevant, actionable guidance or is distracted by irrelevant tips. We never mentioned in the task description.\nconsider two strategies with different cost-accuracy tradeoffs. Cosine similarity retrieval is simple, fast, and\nrequires no LLM calls at runtime—making it suitable for latency-\n3.3.1 Cosine Similarity Retrieval.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 23, + "total_chunks": 44, + "char_count": 2958, + "word_count": 440, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "97ab7e49-397e-44f7-a65e-71b67fbf6af5", + "text": "The most straightforward ap- sensitive or cost-constrained deployments. LLM-guided selection\nproach embeds the incoming task description 𝑑and computes co- provides richer reasoning about task context at the cost of an addisine similarity against the embeddings of stored task (and subtask) tional LLM invocation. We evaluate both strategies empirically in\ndescriptions. Tips associated with the most similar stored descrip- Section 4.\ntions are retrieved and injected into the prompt. This strategy requires no LLM calls at retrieval time and is fast and inexpensive—a 3.3.3 Prompt Integration.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 24, + "total_chunks": 44, + "char_count": 594, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3dbd0712-61ed-47d1-bfad-53f5d27f8eea", + "text": "Regardless of retrieval strategy, the sepure vector database lookup. lected tips are injected into the agent's prompt as a \"guidelines\"\nTwo complementary mechanisms control which tips are selected: section positioned after the task context but before the standard\nagent instructions. Each tip is formatted to be quickly scannable and\n• Similarity threshold 𝜏: Only tips whose source description\nactionable, highlighting priority level, category, actionable content,\nhas cosine similarity ≥𝜏with 𝑑are eligible. A high threshold\npurpose, implementation steps, and trigger condition. For example:\n(e.g., 𝜏≥0.85) ensures retrieved tips are closely related to\nthe current task, but risks excluding tips from tasks that are [PRIORITY: HIGH] Recovery Tip:\nsemantically equivalent yet phrased differently. For example, When a login attempt fails with \"invalid credentials,\"\n\"I want an Amazon Prime membership\" and \"Sign me up verify you are using the correct app-specific\nfor Amazon Prime\" describe the same task but may have password by re-calling\ncosine similarity below 0.85 due to lexical differences. A low supervisor.show_account_passwords() and filtering by\nthreshold (e.g., 𝜏≤0.6) casts a wider net, but risks pulling the target app name.\nin tips from unrelated tasks—tips from \"Book a flight to\nNew York\" are unlikely to help an agent executing \"Update Apply when: Authentication fails on any app after an\nmy calendar for next week,\" yet both involve scheduling- initial login attempt.\nadjacent language that could produce moderate similarity Steps:\nscores. 1. Re-retrieve credentials from supervisor\n• Top-𝑘selection: After filtering by threshold, the system 2. Filter for the specific app name (exact match)\nselects the 𝑘highest-scoring tips. This bounds the number 3. Retry login with the correct credentials\nof tips injected into the prompt regardless of how many pass Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 25, + "total_chunks": 44, + "char_count": 2003, + "word_count": 292, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4ea58e86-dc51-4cb5-8ddc-f0e59db77fdb", + "text": "This formatting enables agents to quickly identify the type of determines the task is complete or encounters an unrecoverable\nguidance, prioritize critical tips, and understand both what to do failure. Both the agent and the tip extraction pipeline use GPT-4.1.\nand why. The prompt integration creates a feedback loop: agents The base agent (without memory) receives only the task inreceiving relevant tips avoid failure patterns, execute more effi- struction and standard prompting that includes its role descripciently, and apply successful strategies, producing higher-quality tion, available APIs, and general guidelines for task execution. The\ntrajectories that reinforce the memory system's value. memory-enhanced agent additionally receives retrieved tips from\nthe memory system, injected into the prompt before the agent\n4 Evaluation begins reasoning. We evaluate our trajectory-informed memory generation framework on the AppWorld benchmark, a comprehensive evaluation 4.1.3 Tip Extraction Configurations. We evaluate two tip extraction\nsuite for LLM agents that perform complex tasks across multiple ap- granularities:\nplications. Our evaluation examines two dimensions: (1) the effect Task-level tips are extracted from entire trajectories as deof tip extraction granularity (task-level vs. subtask-level tips), and scribed in Section 3.1.3. Each trajectory produces a holistic set of\n(2) the effect of retrieval strategy (cosine similarity vs. LLM-guided strategy, recovery, and optimization tips that capture end-to-end\nselection). The evaluation demonstrates that agents equipped with execution patterns. Task-level tips are well-suited for capturing\nlearned memory from past executions substantially outperform overarching strategies (e.g., \"verify all prerequisites before checkagents without memory, with particularly strong improvements on out\") but may bundle unrelated concerns from different execution\nchallenging tasks. phases. Subtask-level tips are extracted using the two-phase pipeline\n4.1 Experimental Setup described in Section 3.1.4. Trajectories are first segmented into\nlogical subtasks (authentication, data retrieval, data processing,4.1.1 Benchmark Description. AppWorld is a benchmark designed\netc.), and tips are then extracted independently for each subtask.to evaluate LLM agents on realistic task completion across diverse\nSubtask-level tips are more focused and reusable across tasks thatapplication domains. The benchmark contains tasks spanning eshare common subtasks.commerce, email, calendar, file management, and other common\nBoth tip types were generated from agent executions on theapplication scenarios.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 26, + "total_chunks": 44, + "char_count": 2651, + "word_count": 351, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5b5c2626-7d42-45f6-8f54-ec5e7575a6a1", + "text": "Each task consists of a natural language\nAppWorld training and development partitions, processed throughinstruction that the agent must execute by interacting with APIs\nour pipeline.provided for various applications.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 27, + "total_chunks": 44, + "char_count": 216, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7d65fd8e-81b4-497b-bcba-96ce65d2358b", + "text": "The benchmark includes two key evaluation metrics:\n4.1.4 Retrieval Strategy Configurations. We evaluate two retrieval Task Goal Completion (TGC) measures the percentage of instrategies for selecting which tips to inject into the agent's promptdividual tasks where the agent passes all programmatic unit tests,\nat runtime:which verify correct API usage, database state changes, and exCosine similarity retrieval embeds the task instruction usingpected end states. Each task is a complex, multi-step, app-based\na text embedding model and retrieves the top-𝑘tips whose vectorchallenge that typically requires multiple API calls across an avembeddings have the highest cosine similarity to the query embed-erage of 1.8 apps and 9.5 APIs. A task is successful only if all unit\nding. This is a standard retrieval approach that requires no LLMtests pass.\ncalls at retrieval time and is fast and inexpensive. Scenario Goal Completion (SGC) measures the percentage\nLLM-guided selection uses an LLM to analyze the task instruc-of scenarios where the agent correctly completes all task variants\ntion, detect the application context and task category, and construct(typically three) associated with a given scenario, testing for consisa retrieval query that combines semantic similarity with metadatatency across related tasks. A scenario is only counted as successful\nfiltering and priority-weighted ranking.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 28, + "total_chunks": 44, + "char_count": 1397, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7dcdedf8-00c9-49f7-aee8-bb922da42d47", + "text": "This approach is more ex-if every variant passes, making this a stricter metric than TGC.\npressive—it can reason about which tip categories are most relevant Tasks in AppWorld are categorized by difficulty level:\nand ensure critical tips surface—but requires an additional LLM call\n• Difficulty 1 (Easy): Simple tasks requiring basic API interat retrieval time.\nactions, typically single-domain with straightforward execuFor both strategies, the top 5 tips are retrieved and injected into\ntion sequences\nthe agent's prompt before reasoning begins.\n• Difficulty 2 (Medium): Moderate complexity tasks that\nmay span multiple domains or require conditional logic and\n4.1.5 Evaluation Protocol. We evaluated configurations on three\nerror handling\npartitions of AppWorld: (1) the test-normal partition, which con-\n• Difficulty 3 (Hard): Complex multi-step tasks requiring\ntains held-out tasks not seen during memory generation, measuring\ncareful planning, prerequisite management, cross-domain\nthe agent's ability to generalize learned patterns to novel tasks; (2)\ncoordination, and robust error recovery, often involving 50+\nthe train partition, from which tips were generated, measuring\nlines of equivalent code and up to 26 APIs\nhow effectively tips improve performance when the same task is\n4.1.2 Agent Configuration. We evaluate using a single-agent con- encountered again; and (3) the dev partition, also used during tip\nfiguration implementing a simplified ReAct-style reasoning and generation, providing a complementary view.\naction loop. The agent iteratively reasons about the current task Each task was executed independently with a maximum of 30\nstate, selects actions to take, executes those actions via API calls, reasoning-action steps. Task and scenario goal completion were asand observes the results. The agent continues this loop until it sessed using AppWorld's automated evaluation framework, which Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 29, + "total_chunks": 44, + "char_count": 1984, + "word_count": 285, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4e6c670-2d4c-4ab1-a3bb-63bd67688a8b", + "text": "verifies that all explicit requirements (task goals) and implicit re- configurations to examine the effect of threshold and top-𝑘selecquirements (scenario goals) are satisfied by examining the final tion.\nstate of all involved applications. Table 3: Task-Level Tips + Cosine (𝜏≥0.5, top-3): Test-Normal\n4.2 Held-Out Results (Test-Normal)\nType Task Goal Scenario Goal\nThe test-normal partition contains tasks not seen during memory\nAggregate 66.7 48.2\ngeneration, providing the most rigorous evaluation of the memory\nDifficulty 1 86.0 68.4\nsystem's ability to generalize learned patterns to novel tasks. We\nDifficulty 2 70.8 56.2\npresent results for multiple configurations. Difficulty 3 46.0 23.8\n4.2.1 Subtask-Level Tips with LLM-Guided Selection. Tables 1 and 2\npresent results for subtask-level tips with LLM-guided selection—\nthe best-performing configuration for scenario goal completion. Table 4: Task-Level Tips + Cosine (𝜏≥0.6): Test-Normal Table 1: Subtask Tips + LLM Selection: Test-Normal Type Task Goal Scenario Goal\nAggregate 72.0 62.5\nType Task Goal Scenario Goal Difficulty 1 91.2 84.2\nAggregate 73.2 64.3 Difficulty 2 72.9 68.8\nDifficulty 1 91.2 89.5 Difficulty 3 54.0 38.1\nDifficulty 2 70.8 56.2\nDifficulty 3 58.7 47.6\nTable 5: Task-Level Tips + Cosine (𝜏≥0.5): Test-Normal Type Task Goal Scenario Goal Table 2: Baseline Agent (No Memory): Test-Normal\nAggregate 70.2 57.1\nDifficulty 1 91.2 84.2\nType Task Goal Scenario Goal\nDifficulty 2 64.6 43.8\nAggregate 69.6 50.0\nDifficulty 3 55.6 42.9\nDifficulty 1 89.5 79.0\nDifficulty 2 66.7 56.2\nDifficulty 3 54.0 19.1 The three cosine similarity configurations reveal important interactions between threshold, top-𝑘selection, and task complexity. Top-𝑘restriction hurts performance. The most restrictive\nThe memory-enhanced agent achieves 73.2% TGC compared to\nconfiguration (𝜏≥0.5, top-3) performs below the baseline at the\n69.6% for the baseline (+3.6 pp) and 64.3% SGC compared to 50.0%\naggregate level (66.7% TGC, 48.2% SGC), a drop of −2.9 pp and\n(+14.3 pp). The larger SGC improvement suggests that the memory\n−1.8 pp respectively. The top-3 restriction limits the agent to tips\nsystem not only helps agents complete individual tasks correctly\nfrom only three matched task descriptions, which may exclude\nbut substantially improves consistency across task variants within\nrelevant guidance. This is especially damaging for complex tasks:\nscenarios.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 30, + "total_chunks": 44, + "char_count": 2412, + "word_count": 353, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "382b10d0-0ded-4d22-a72d-74ddac22aeb1", + "text": "Since SGC requires all variants to pass, it is sensitive\nDifficulty 3 drops to 46.0% TGC (−8.0 pp from baseline).\nto sporadic failures—exactly the brittleness that learned tips help\nThreshold 𝜏= 0.6 is the sweet spot. The configuration with\nmitigate.\n𝜏≥0.6 (no top-𝑘restriction) achieves the strongest overall results\nThe benefits scale with task complexity. Difficulty 1 tasks show\namong cosine similarity configurations: 72.0% TGC (+2.4 pp) and\nimprovements of +1.7 pp TGC and +10.5 pp SGC, with the baseline\n62.5% SGC (+12.5 pp). This threshold strikes an effective balance:\nalready achieving high TGC. Difficulty 2 tasks show +4.1 pp TGC\ntight enough to exclude tips from unrelated tasks, yet loose enough\nwith no SGC change, benefiting from learned patterns around erto capture semantically equivalent task descriptions that differ\nror handling and prerequisite verification. Difficulty 3 tasks show\nlexically (e.g., \"I want an Amazon Prime membership\" and \"Sign\nthe most dramatic improvements: +4.7 pp on TGC and a remarkme up for Amazon Prime\"). The Difficulty 3 SGC improvement is\nable +28.5 pp on SGC (19.1% →47.6%), a 149% relative increase.\nstriking: 19.1% →38.1% (+19.0 pp), a 99% relative increase. These complex tasks require sophisticated planning and robust\nLower threshold includes noise.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 31, + "total_chunks": 44, + "char_count": 1305, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "25f0ae16-fd9f-408c-a4f6-cbfb0172bd74", + "text": "Dropping to 𝜏≥0.5 (no\nerror recovery—areas where the memory system provides the most\ntop-𝑘) yields 70.2% TGC (+0.6 pp) and 57.1% SGC (+7.1 pp)—better\nguidance.\nthan the baseline but weaker than 𝜏≥0.6 on both metrics. The\n4.2.2 Task-Level Tips with Cosine Similarity Retrieval. We next lower threshold admits tips from marginally related tasks, diluting\nevaluate task-level tips with cosine similarity retrieval. Task-level the signal. Interestingly, Difficulty 3 TGC is slightly higher with\ntips extract holistic insights from entire trajectories rather than 𝜏≥0.5 (55.6%) than with 𝜏≥0.6 (54.0%), suggesting that for the\ndecomposing them into subtasks. At retrieval time, the incoming most complex tasks, casting a wider net occasionally surfaces useful\ntask description is embedded and compared against stored task de- tips from loosely related tasks. However, the reverse pattern holds\nscription embeddings; tips from descriptions exceeding a similarity for Difficulty 2 (64.6% vs. 72.9%), where the noise from irrelevant\nthreshold 𝜏are retrieved. We evaluate three retrieval parameter tips is more damaging.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 32, + "total_chunks": 44, + "char_count": 1111, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "787bf578-fb05-4498-9781-1fb86467bacf", + "text": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY 4.2.3 Subtask-Level Tips with Cosine Similarity Retrieval. To isolate for this by reasoning about the overall task context and ensuring\nthe effect of the retrieval strategy from the effect of tip granularity, consistent tip selection across variants.\nwe also evaluate subtask-level tips with cosine similarity retrieval All configurations substantially outperform the baseline, con-\n(𝜏≥0.6, no top-𝑘)—the same retrieval parameters as the best task- firming that the memory system provides genuine value regardless\nlevel cosine configuration, but with subtask-level tips instead. of the specific configuration chosen. The best configuration depends\non the deployment objective: subtask-level tips with LLM-guided\nselection for the best overall performance, subtask-level tips with Table 6: Subtask Tips + Cosine (𝜏≥0.6): Test-Normal\ncosine similarity for the highest individual task accuracy at lower\nretrieval cost, or task-level tips with cosine similarity for a strong\nType Task Goal Scenario Goal\nbalance without LLM retrieval overhead. Aggregate 73.8 57.1\nDifficulty 1 91.2 73.7\n4.3 Source Partition Results (Train and Dev) Difficulty 2 72.9 56.2\nDifficulty 3 58.7 42.9 The train and dev partitions were used during tip generation: tips\nwere extracted from agent trajectories on these tasks. Results on\nthese partitions measure a distinct scenario from test-normal: what\nThis configuration achieves 73.8% TGC (+4.2 pp over baseline)— happens when the agent encounters the same or structurally identhe highest TGC of any configuration—and 57.1% SGC (+7.1 pp). tical tasks again, augmented with tips derived from its own prior\nComparing with subtask-level tips with LLM-guided selection (Ta- executions? This setting evaluates the memory system's ability to\nble 1) isolates the effect of the retrieval strategy while holding tip enable self-improvement on recurring tasks, complementing the\ngranularity constant: TGC is slightly higher with cosine retrieval generalization evaluation on test-normal.\n(73.8% vs. 73.2%), but SGC drops substantially (57.1% vs. 64.3%, a Tables 8–11 present results for subtask-level tips with LLM-\n7.2 pp gap). This divergence is most pronounced on Difficulty 3, guided selection on the source partitions.\nwhere SGC drops from 47.6% to 42.9%. The LLM-guided selection's As expected, improvements on the source partitions are larger\nability to reason about task context and prioritize tip categories than on test-normal: +4.4 pp TGC / +10.0 pp SGC on train, and\nappears critical for cross-variant consistency, even though simple +12.3 pp TGC / +26.3 pp SGC on dev. Tips are most contextually\ncosine retrieval suffices (and marginally excels) for individual task relevant when the agent encounters tasks structurally similar to\ncompletion. those from which the tips were derived, so these larger gains are\nexpected.\n4.2.4 Configuration Comparison. Table 7 compares all configuraTwo partition-specific patterns are worth noting. On train Diffi-tions on the held-out test-normal partition, using 𝜏≥0.6 (no top-𝑘)\nculty 1 tasks where the baseline already achieves 100%, the memoryfor both cosine similarity configurations.\nenhanced agent scores slightly lower (94.4% TGC, 83.3% SGC), sugThe three configurations reveal a clear separation between what\ngesting that for simple tasks where the agent already performs\ndrives task goal completion versus scenario goal completion.\noptimally, injecting additional tips can introduce minor interferTip granularity drives TGC. Subtask-level tips outperform\nence. On dev, the Difficulty 3 baseline already achieves 100% TGC\ntask-level tips on TGC regardless of retrieval strategy: 73.8% (cosine)\nand 100% SGC, so the aggregate dev gains (+12.3 pp TGC, +26.3 pp\nand 73.2% (LLM-guided) versus 72.0% (task-level cosine). The finerSGC) are driven entirely by Difficulty 1 and 2 improvements.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 33, + "total_chunks": 44, + "char_count": 3984, + "word_count": 574, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6fcabf84-0f2a-42d0-b357-0a6958bb35f9", + "text": "In\ngrained decomposition into reusable subtask patterns provides\nboth cases, the overall gains on the tasks that benefit from memory\nmore targeted guidance for completing individual tasks, particularly\nsubstantially outweigh any ceiling or interference effects.\nfor Difficulty 3 tasks where subtask-level tips yield 58.7% TGC\nversus 54.0% for task-level (+4.7 pp). Retrieval strategy drives SGC. LLM-guided selection dramati- 4.4 Cross-Configuration Summary\ncally improves scenario goal completion compared to cosine simi- Table 12 summarizes the aggregate improvements for subtask-level\nlarity at the same tip granularity: 64.3% versus 57.1% for subtask- tips with LLM-guided selection across all three partitions.\nlevel tips (+7.2 pp). This gap is consistent across difficulty levels, Several observations emerge. First, the memory system improves\nwith Difficulty 1 showing the largest difference (89.5% vs. 73.7%, performance on all three partitions, confirming that the benefits\n+15.8 pp).", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 34, + "total_chunks": 44, + "char_count": 993, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "401e8083-ae89-497f-9041-7f99c60d6055", + "text": "The LLM's ability to reason about task context, prioritize are not limited to tasks that generated the tips. The test-normal\ntip categories, and apply metadata filters produces more consis- gains (+3.6 TGC, +14.3 SGC) demonstrate genuine generalization\ntent guidance across task variants within a scenario, reducing the to unseen tasks. Second, the source partitions show larger TGC\nsporadic failures that SGC penalizes. improvements, as expected—tips are most contextually relevant\nInteraction effect. Interestingly, task-level tips with cosine when the agent re-encounters tasks from which the tips were desimilarity achieve higher SGC (62.5%) than subtask-level tips with rived. Interestingly, the test-normal SGC gain (+14.3 pp) exceeds\ncosine similarity (57.1%), despite lower TGC. Task-level tips encode the train SGC gain (+10.0 pp), suggesting that the subtask-level\nholistic end-to-end strategies that promote uniform execution pat- decomposition and LLM-guided retrieval generalize particularly\nterns across related task variants, while subtask-level tips—though well for improving cross-variant consistency. Third, the SGC immore precise for individual task completion—may retrieve different provements consistently exceed the TGC improvements across all\nsubsets of subtask tips for different variants of the same scenario, in- partitions, indicating that the memory system is particularly eftroducing behavioral variance. LLM-guided selection compensates fective at improving consistency across task variants.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 35, + "total_chunks": 44, + "char_count": 1521, + "word_count": 202, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4f2e5e73-f795-47ce-9efe-7fed3bce0287", + "text": "Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. Table 7: Configuration Comparison on Test-Normal (Aggregate) Tip Granularity Retrieval Strategy TGC Δ TGC SGC Δ SGC\nBaseline (no memory) 69.6 — 50.0 —\nSubtask-level LLM-guided selection 73.2 +3.6 64.3 +14.3\nSubtask-level Cosine sim. (𝜏≥0.6) 73.8 +4.2 57.1 +7.1\nTask-level Cosine sim. (𝜏≥0.6) 72.0 +2.4 62.5 +12.5 Table 8: Subtask Tips + LLM Selection: Train 5.1 Memory Taxonomies and Architectures\nTwo recent surveys provide comprehensive taxonomies of memory\nType Task Goal Scenario Goal in LLM-based agents. Zhang et al. [17] organize the design space\nAggregate 91.1 83.3 along three dimensions—memory sources (agent-environment inDifficulty 1 94.4 83.3 teractions, internal reasoning, user feedback), memory forms (natuDifficulty 2 88.9 83.3 ral language, embeddings, databases, structured knowledge), and\nDifficulty 3 88.9 83.3 memory operations (read, write, reflect, manage)—and identify key\nlimitations of existing work: overly simplistic representations, unsophisticated operations for deciding what to remember or forget,\nTable 9: Baseline Agent (No Memory): Train and fragmented evaluation. Du et al. [4] take a complementary\noperations-centric view, defining six atomic memory operations:\nType Task Goal Scenario Goal consolidation, updating, indexing, forgetting, retrieval, and comAggregate 86.7 73.3 pression. In their vocabulary, our tip extraction constitutes a form\nDifficulty 1 100.0 100.0 of consolidation (converting raw trajectories into abstract tips), tip\nDifficulty 2 77.8 58.3 refinement is updating, and selective retention is forgetting. Both\nDifficulty 3 77.8 50.0 surveys note that most existing systems store raw or lightly processed text, lacking the structured abstraction and quality-aware\ncuration that effective agent memory requires. Our framework\nTable 10: Subtask Tips + LLM Selection: Dev directly addresses these identified gaps.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 36, + "total_chunks": 44, + "char_count": 1940, + "word_count": 270, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3829f22a-fe01-4b22-a52a-b31e90bcfa40", + "text": "Type Task Goal Scenario Goal 5.2 Semantic Memory Systems\nAggregate 89.5 73.7\nThe most widely deployed agent memory systems operate at the se- Difficulty 1 90.0 80.0\nmantic level, storing factual knowledge extracted from interactions. Difficulty 2 87.5 62.5\nMem0 [2] extracts and consolidates factual snippets—user pref- Difficulty 3 100.0 100.0\nerences, entities, relationships—from conversations into a vector\nstore, achieving strong latency and token efficiency for conversaTable 11: Baseline Agent (No Memory): Dev tional personalization. A-MEM [15] introduces a self-organizing\nmemory architecture inspired by the Zettelkasten method, where\neach memory is stored as a structured note with contextual descrip- Type Task Goal Scenario Goal\ntions, keywords, and explicit links to related memories, creating\nAggregate 77.2 47.4\nan emergent knowledge network. While both systems are wellDifficulty 1 80.0 60.0\nengineered for their purposes, they fundamentally store declarative\nDifficulty 2 70.8 25.0\nknowledge (what is known) rather than procedural or experienDifficulty 3 100.0 100.0\ntial knowledge (what to do and what was learned from doing it).", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 37, + "total_chunks": 44, + "char_count": 1148, + "word_count": 164, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "01a6528d-dea4-436d-8a5f-4120cff81fa1", + "text": "They have no mechanism for analyzing execution trajectories, performing causal attribution of failures, or generating categorized\ntips and strategy tips encode prerequisite verification and error behavioral guidance. Our framework addresses this gap by extracthandling patterns that reduce behavioral variance, enabling the ing structured, actionable tips from execution experience rather\nagent to reliably complete all variants rather than succeeding on than conversational facts.\nsome and failing on others.\n5.3 Learning from Execution Trajectories\n5 Related Work A growing body of work addresses how agents can learn from\nOur work sits at the intersection of agent memory systems, trajectory- their past execution traces, which is most directly related to our\nbased learning, and self-improving agents. We organize related contribution.\nwork along three axes: memory architectures for LLM agents, sys- Workflow and procedure extraction. Agent Workflow Memtems that learn from execution trajectories, and approaches to agent ory (AWM) [13] extracts reusable multi-step workflows from sucself-improvement through experience. cessful agent trajectories in web navigation, achieving 24.6% and Trajectory-Informed Memory Generation for Self-Improving Agent Systems Technical Report, Published Feb 2026, Yorktown Heights, NY Table 12: Summary of Aggregate Improvements: Subtask Tips + LLM Selection Partition Task Goal Task Goal Scenario Goal Scenario Goal\n(Baseline) (+Memory) (Baseline) (+Memory)\nTest-Normal 69.6 73.2 (+3.6) 50.0 64.3 (+14.3)\nTrain 86.7 91.1 (+4.4) 73.3 83.3 (+10.0)\nDev 77.2 89.5 (+12.3) 47.4 73.7 (+26.3) 51.1% relative improvements on Mind2Web and WebArena respec- selective deletion yields a 10% absolute performance gain over naive\ntively. AWM demonstrates a compelling \"snowball effect\" where memory growth. These findings directly motivate our structured\nsimple workflows compose into more complex ones.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 38, + "total_chunks": 44, + "char_count": 1927, + "word_count": 264, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "306e733d-21c4-4cca-a21d-291e77bf7bbf", + "text": "However, approach: by extracting abstract tips with explicit applicability conAWM only learns from successful trajectories—it has no mecha- ditions rather than storing raw trajectories, and by categorizing\nnism for extracting lessons from failures, recoveries, or inefficient tips with metadata for precise contextual matching, our framework\nexecutions. Mem𝑝[5] treats procedural memory as a first-class mitigates both failure modes.\noptimization object, systematically exploring strategies for building memory from trajectories, retrieving relevant procedures, and 6 Conclusions\nupdating entries over time. While Mem𝑝addresses the full mem- We presented a framework for automatically extracting actionable\nory lifecycle, it focuses on procedural instructions (\"how to do X\") learnings from LLM-agent execution trajectories and storing them\nrather than the diagnostic behavioral insights (\"what went wrong as structured memory tips that improve future agent performance.\nand why\") that our tip categories capture. AgentRR [6] borrows Our four-component pipeline—trajectory intelligence extraction,\nthe record-and-replay paradigm from software engineering, record- decision attribution analysis, contextual learning generation, and\ning complete agent interaction traces and summarizing them into adaptive memory retrieval—captures the full spectrum of learning\nstructured experiences for future replay. Like AWM, it primarily opportunities across failures, recoveries, inefficient successes, and\nlearns from successful executions. clean successes. Evaluation on the AppWorld benchmark demonReasoning and strategy extraction. ReasoningBank [9] is strates consistent improvements, with up to 14.3 percentage point\namong the closest works to ours, distilling generalizable reasoning gains in scenario goal completion on held-out tasks, and particularly\nstrategies from an agent's self-judged successful and failed expe- strong benefits on complex, multi-step tasks (28.5 pp SGC improveriences. It shares our insight that agents should learn from both ment, a 149% relative increase). The framework naturally extends\nsuccesses and failures.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 39, + "total_chunks": 44, + "char_count": 2135, + "word_count": 276, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fbfb0611-da63-47b0-b555-44c9ff71f25a", + "text": "The key distinction is in abstraction level: to multi-agent systems with cross-agent attribution and agent-roleReasoningBank focuses on meta-cognitive reasoning strategies, aware guidance, which we leave to future work. We also plan to\nwhile our tips focus on concrete behavioral guidance derived from evaluate the framework with additional state-of-the-art and openspecific execution patterns. The two approaches are complemen- source models—such as Qwen [12] and GPT-OSS [1]—to assess how\ntary. tip quality and retrieval effectiveness vary across model families. Context engineering and self-improvement. ACE (Agentic The techniques described in this paper are being applied to IBM's\nContext Engineering) [16] treats an agent's context as an evolv- Configurable Generalist Agent (CUGA) [7, 8] platform for building \"playbook\" that accumulates and refines strategies through ing and deploying enterprise agentic systems, where trajectorya generate-reflect-curate cycle, achieving a 10.6 percentage point informed memory enables agents to continuously improve from\nimprovement on AppWorld. Our framework differs from ACE in operational experience.\nseveral respects: we produce structured memory entries with typed\ncategories (strategy, recovery, optimization), rich metadata, and References\nselective retrieval rather than an evolving text document included [1] . 2025.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 40, + "total_chunks": 44, + "char_count": 1369, + "word_count": 183, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ee0eabd2-682f-441b-a36c-4c30913efdac", + "text": "TODO: Add GPT-OSS reference. Placeholder — please replace with the\nin full; we perform explicit causal attribution tracing outcomes to correct GPT-OSS citation.\nspecific decisions; and we maintain provenance tracking from tips [2] Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term\nto source trajectories. Memory. arXiv preprint arXiv:2504.19413 (2025). Experience replay with learned retrieval.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 41, + "total_chunks": 44, + "char_count": 494, + "word_count": 66, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "585e6976-1fab-404b-8172-dccc170b0073", + "text": "Memento [18] in- [3] Chad DeChant. 2025. Episodic Memory in AI Agents Poses Risks That Should Be\nStudied and Mitigated. arXiv preprint arXiv:2501.11739 (2025).\ntroduces a memory-augmented MDP formalization where a learned [4] Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sébastien Montella,\nneural policy selects which stored trajectories to retrieve for a given Mirella Lapata, Kam-Fai Wong, and Jeff Z. Rethinking Memory\ntask. However, Memento stores raw trajectories without abstracting in AI: Taxonomy, Operations, Topics, and Future Directions. arXiv preprint\nthem into transferable insights—the consolidation from trajectory [5] Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun\nto actionable lesson is left to the LLM's in-context reasoning. Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. 2025. Mem𝑝: Exploring Agent\nProcedural Memory. arXiv preprint arXiv:2508.06433 (2025).\n[6] Erhu Feng, Wenbo Zhou, Zibin Liu, Le Chen, Yunpeng Dong, Cheng Zhang,\n5.4 Empirical Foundations Yisheng Zhao, Dong Du, Zhichao Hua, Yubin Xia, and Haibo Chen. 2025. Get\nExperience from Practice: LLM Agents with Record & Replay. arXiv preprint\nXiong et al. [14] provide critical empirical grounding for trajectory- arXiv:2505.17716 (2025).\nbased memory systems, identifying the experience-following prop- [7] IBM. 2025. CUGA: Configurable Generalist Agent. https://github.com/cugaproject/cuga-agent.\nerty and two failure modes: error propagation and misaligned ex- [8] Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad\nperience replay. They find that combining selective addition with Sela, Asaf Adi, and Nir Mashkif. 2025. Towards Enterprise-Ready Computer Using", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 42, + "total_chunks": 44, + "char_count": 1709, + "word_count": 239, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53d3c5c1-ee4c-432c-80b6-26e444c6139a", + "text": "Technical Report, Published Feb 2026, Yorktown Heights, NY Fang et al. Generalist Agent. arXiv preprint arXiv:2503.01861 (2025). Agents: An Empirical Study of Experience-Following Behavior. arXiv preprint\n[9] Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun arXiv:2505.16067 (2025). Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George [15] Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv preprint arXiv:2502.12110\n2025. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. (2025).\narXiv preprint arXiv:2509.25140 (2025). [16] Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong,\n[10] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish\nStoica, and Joseph E.", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 43, + "total_chunks": 44, + "char_count": 980, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cee74117-fc84-4456-b652-b5a6974c225e", + "text": "MemGPT: Towards LLMs as Operating Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context EngineerSystems. arXiv preprint arXiv:2310.08560 (2023). ing: Evolving Contexts for Self-Improving Language Models. arXiv preprint\n[11] Mathis Pink, Qinyuan Wu, Vy Ai Vo, Javier Turek, Jianing Mu, Alexander Huth, arXiv:2510.04618 (2025).\nand Mariya Toneva. 2025. Position: Episodic Memory is the Missing Piece for [17] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu,\nLong-Term LLM Agents. arXiv preprint arXiv:2502.06975 (2025). Zhenhua Dong, and Ji-Rong Wen. 2025. A Survey on the Memory Mechanism of\n[12] Qwen Team. 2025. Qwen2.5 Technical Report. arXiv preprint arXiv:2412.15115 Large Language Model based Agents. ACM Transactions on Information Systems\n(2025). (TOIS) (2025). doi:10.1145/3748302 arXiv:2404.13501.\n[13] Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024. Agent [18] Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang,\nWorkflow Memory. arXiv preprint arXiv:2409.07429 (2024). Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. 2025. Me-\n[14] Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu mento: Fine-tuning LLM Agents without Fine-tuning LLMs. arXiv preprint\nLakkaraju, and Zhen Xiang. 2025. How Memory Management Impacts LLM arXiv:2508.16153 (2025).", + "paper_id": "2603.10600", + "title": "Trajectory-Informed Memory Generation for Self-Improving Agent Systems", + "authors": [ + "Gaodan Fang", + "Vatche Isahagian", + "K. R. Jayaram", + "Ritesh Kumar", + "Vinod Muthusamy", + "Punleuk Oum", + "Gegi Thomas" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10600v1", + "chunk_index": 44, + "total_chunks": 44, + "char_count": 1365, + "word_count": 196, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10623_semantic.json b/data/chunks/2603.10623_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..4b0a4afe32c46fda09d56e933c8589c562cc3344 --- /dev/null +++ b/data/chunks/2603.10623_semantic.json @@ -0,0 +1,1872 @@ +[ + { + "chunk_id": "dd48b6d7-f4cf-45e5-bc4f-b72398e5c960", + "text": "Geo-ATBench: A Benchmark for Geospatial Audio\nTagging with Geospatial Semantic Context Yuanbo Houa,1,∗, Yanru Wub,1, Qiaoqiao Renc, Shengchen Lib, Stephen\nRobertsa, Dick Botteldoorend aMachine Learning Research Group, Engineering Science, University of Oxford, UK\nbDepartment of Intelligent Science, Xi'an Jiaotong-Liverpool University, China\ncEECS, KTH Royal Institute of Technology, Sweden\ndWAVES Research Group, Information Technology, Ghent University, Belgium2026\nMar\nAbstract\nEnvironmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from wave-[eess.AS] forms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 0, + "total_chunks": 85, + "char_count": 1060, + "word_count": 133, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "363ca725-8b3b-4dd1-ad01-f9daedb93727", + "text": "A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. benchmark Geo-AT, the Geo-ATBench dataset is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of realworld audio across 28 event categories; each clip is paired with a POI-derivedarXiv:2603.10623v1 GSC representation constructed from 11 semantic context categories. Furthermore, GeoFusion-AT is proposed as a unified geo-audio fusion framework that ∗Corresponding author: Yuanbo Hou, Machine Learning Research Group, University of\nOxford, UK. Email: Yuanbo.Hou@eng.ox.ac.uk\n1Equal contribution.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 1, + "total_chunks": 85, + "char_count": 724, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac51f657-e9e9-44dc-8849-815f02a5a28e", + "text": "evaluates feature-level, representation-level, and decision-level fusion on three representative audio backbones, with audio-only and GSC-only baselines. Experiments show that incorporating GSC generally improves AT performance, especially on acoustically confounded labels, indicating that geospatial semantics can provide an effective prior beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between the models on the Geo-ATBench labels and on aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. Overall, the proposed Geo-AT task, the open benchmark Geo-ATBench, and the reproducible geo-audio fusion framework GeoFusion-AT provide a solid foundation for studying audio tagging with geospatial semantic context within For the dataset, source code, and models, please see the project homepage (https://github.com/WuYanru2002/Geo-ATBench). Computational auditory scene analysis, Multi-label audio tagging, Geospatial semantic context, Points of interest, Multimodal fusion Environmental sound understanding is one of the core goals of computational auditory scene analysis (CASA) [1].", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 2, + "total_chunks": 85, + "char_count": 1207, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4352110-acab-4cc9-a69e-5bdcd79c18d7", + "text": "In many practical applications, the target output is multi-label audio tagging (AT) [2], where each recording may contain multiple sound events and the system predicts the set of event AT supports applications such as acoustic surveillance [3], smart-city sensing [4], multimedia retrieval [5], and intelligent domestic assistants [6]. Despite strong progress in deep learning models for environmental audio, AT is commonly treated as an audio-only recognition problem [7, 8]. AT backbones, including convolutional neural networks (CNNs) and Transformers, learn powerful acoustic representations from time-frequency features such as Mel spectrograms [9].", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 3, + "total_chunks": 85, + "char_count": 654, + "word_count": 90, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1a443eb-7de8-4bdb-99c2-2666e396f1df", + "text": "However, a persistent drawback remains. Acoustic similarity can make certain events difficult to distinguish from waveforms alone, especially when different sources produce highly similar time-frequency In such cases, disambiguating cues often lie beyond the A key source of such cues is the physical environment in Sound events are produced by sources embedded in specific places, and their occurrence is shaped by location-tied environmental factors [14]. Location-tied conditions can induce systematic associations between event labels and geospatial semantic context (GSC) [15]. provide complementary cues when waveforms alone are ambiguous. This work focuses on sound source-associated GSC, which refers to locationtied environmental priors derived from geographic information systems data, such as points of interest (POI) [16]. Compared with raw GPS coordinates, POI-derived GSC provides structured semantic descriptions of the physical environment surrounding sound sources that can be aligned with audio representations [17]. Progress in this direction remains limited by the lack of", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 4, + "total_chunks": 85, + "char_count": 1092, + "word_count": 150, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9be526c6-2315-4467-a187-57a1b2bed215", + "text": "standardized tasks and benchmark datasets that pair audio with reliable, structured GSC under reproducible evaluation [18]. Recent mobile recording devices and location-aware media platforms increasingly associate recordings with geographic coordinates [18], making relevant audio-GSC pairs increasingly accessible. This trend creates a timely opportunity to investigate how to leverage GSC to support multi-label AT in the real world. To address the gap that AT is often formulated without sound sourceassociated location-tied GSC, this paper proposes the geospatial audio tagging (Geo-AT) task, which conditions multi-label AT on GSC alongside audio Geo-AT aims to assess whether location-tied environmental priors help disambiguate events that are difficult to distinguish from audio alone. To benchmark Geo-AT, we release the Geo-ATBench dataset, a geographi- cally annotated polyphonic audio benchmark containing 3,854 clips with 28 event labels; each clip is paired with a GSC representation constructed from POI semantics over 11 context categories, enabling reproducible studies of how geospatial semantics interact with acoustic representations in multi-label AT. The proposed benchmark design of Geo-ATBench does not specify how GSC should be integrated into AT models [2, 3, 19], and different integration choices may lead to different outcomes.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 5, + "total_chunks": 85, + "char_count": 1356, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "12c2f043-92b3-4423-ba40-bc042517e5a8", + "text": "Therefore, GeoFusion-AT is introduced as a unified geo-audio fusion framework for the proposed Geo-AT task to benchmark representative fusion strategies and to report reference results on GeoATBench. Specifically, GeoFusion-AT evaluates three typical fusion strategies,", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 6, + "total_chunks": 85, + "char_count": 269, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7567179f-f45e-4c52-bc24-edffa6582648", + "text": "feature-level, representation-level, and decision-level fusion, across three representative audio backbones, the CNN-based pretrained audio neural networks (PANNs) [20], the Transformer-based audio spectrogram Transformer (AST) [9], and contrastive language-audio pretraining (CLAP) [21]. GSC-only baselines are included to isolate the contribution of each modality and to identify when fusion improves performance beyond either input alone. The main contributions are: 1) Geo-AT is introduced as a standardized", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 7, + "total_chunks": 85, + "char_count": 511, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "12b14b74-8935-4106-8c2f-449152bac194", + "text": "task formulation for multi-label audio tagging in CASA that integrates audio with geospatial semantic context (GSC); 2) Geo-ATBench is released as an open benchmark for reproducible Geo-AT evaluation, containing 3,854 realworld polyphonic audio clips annotated with 28 event labels, where each clip is paired with a GSC representation constructed from POI semantics over 11 semantic context categories; 3) GeoFusion-AT is introduced as a unified geo-audio fusion framework that benchmarks representative fusion strategies across representative audio backbones on Geo-ATBench to report reference results; 4) A crowdsourced listening study with 10 participants on 579 samples is conducted, showing that model performance is comparable when evaluated against GeoATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. We have released the dataset, code, and models. The rest of this paper is organized as follows. Section 2 reviews related work Section 3 formalizes the Geo-AT task. the Geo-ATBench dataset. Section 5 presents the GeoFusion-AT framework with fusion strategies based on representative audio backbones.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 8, + "total_chunks": 85, + "char_count": 1155, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c5d7dce-815b-47dd-9863-88c92f4989fd", + "text": "Section 6 reports experimental results and analysis. Section 7 details the human evaluation Section 8 concludes the paper.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 9, + "total_chunks": 85, + "char_count": 122, + "word_count": 18, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "37921fd9-dbb4-467a-9001-bfcdad153911", + "text": "This section positions the proposed geospatial audio tagging (Geo-AT) task within prior work on multi-label audio tagging (AT), context-aware sound understanding, and POI-derived geospatial semantic context from geographic information systems. The discussion motivates the need for a standardized GeoAT task under reproducible evaluation.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 10, + "total_chunks": 85, + "char_count": 338, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2f70767c-9465-4b5a-9660-453b1de9820f", + "text": "Multi-Label Audio Tagging and Acoustic Ambiguity Multi-label AT is a central task in CASA [1], where an audio clip may contain polyphonic sound events, and the goal is to predict the set of event", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 11, + "total_chunks": 85, + "char_count": 195, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "91d5963f-6add-429f-bbf8-8dba755c04b8", + "text": "Large-scale benchmarks and challenges [22, 23] have driven steady progress in model architectures and backbones, such as CNN-based PANNs [20] and MobileNet [24], Transformer-based Hierarchical Token-Semantic Audio Transformer [25] and AST [9], with contrastive learning-based CLAP that aligns audio and language representations [21]. These backbones have become", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 12, + "total_chunks": 85, + "char_count": 361, + "word_count": 47, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f1cb5a77-30ef-4099-a5ab-776a48aa1592", + "text": "common reference points for representation learning in AT tasks. Despite architectural advances, AT in real-world conditions continues to face persistent ambiguity [26]. Polyphonic recordings often contain overlapping sources, and different events can produce similar time-frequency patterns [10, 11], leading audio-only AT to struggle with confusable events and mis- External priors like sound source-associated GSC provide complementary cues by encoding location-tied environmental priors into a structured POI-derived semantic representation [16], such as nearby place categories and their composition around the sound source.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 13, + "total_chunks": 85, + "char_count": 629, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db1b706c-a6f4-41f4-946c-5c4a35102183", + "text": "Location-tied GSC constrains the set of plausible events for a scene and can support disambiguation when acoustic evidence alone is insufficient. Context and Auxiliary Information for Sound Understanding Context-aware sound understanding extends AT by incorporating information beyond acoustic representations.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 14, + "total_chunks": 85, + "char_count": 310, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "31443641-466c-45d9-9996-e20452b155b0", + "text": "Prior work [27, 28] can be divided into two groups, distinguished by whether the additional signal is time-aligned with One group [29] uses paired sensory streams, where video frames or other time-aligned inputs are available together with the audio. group [30] uses auxiliary metadata that is linked to the recording environment but is not time-aligned with the audio signal. Geo-AT concerns the second group. Location-tied descriptors operate as scene-level priors and remain available in many deployments. Existing studies [31] that incorporate auxiliary metadata vary in metadata representation, audio-metadata pairing rules, data splits, and reporting practice. and metadata-only baselines are not always reported. These inconsistencies limit reproducible comparison across studies and motivate a standardized task for evaluating auxiliary metadata in multi-label AT.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 15, + "total_chunks": 85, + "char_count": 872, + "word_count": 121, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6403981f-276b-4018-b2bc-0e5023f53601", + "text": "POI-Derived Geospatial Semantic Context (GSC) Geospatial information has become increasingly available in audio collections due to mobile recording devices and location-aware media platforms that associate recordings with geographic coordinates [18]. Several datasets include geographic or location-related annotations, enabling spatial analyses of urban sound environments and regional differences [15]. However, geospatial information is usually used for organization, mapping, or descriptive analysis rather", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 16, + "total_chunks": 85, + "char_count": 510, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "90bdce0f-0ee7-419d-aaa6-2f4a12c7631c", + "text": "than as an explicit model input for sound event recognition [32]. Points of interest (POI) in geographic information systems translate location into interpretable semantic descriptors. POI encodes nearby places and compositions, representing location-tied environmental priors [16]. GSC contains scene-level descriptors that can be paired with audio recordings. However, prior work rarely formalizes POI-derived GSC as part of the AT The lack of consistent task definitions and benchmarks makes it difficult to assess whether and how geospatial semantics should be integrated.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 17, + "total_chunks": 85, + "char_count": 576, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35749f64-28c1-4932-9267-38752119656c", + "text": "Taken together, prior work leaves AT largely audio-only and rarely evaluates POI-derived GSC as a task input under reproducible protocols. missing piece is a standardized Geo-AT task definition and a benchmark that enables controlled comparisons. The Geo-AT task addresses this gap by defining AT conditioned on sound source-associated, POI-derived GSC alongside audio, enabling controlled evaluation of geospatial priors in AT tasks. The proposed Geospatial Audio Tagging (Geo-AT) task Geospatial audio tagging (Geo-AT) formalizes AT conditioned on sound source-associated geospatial semantic context (GSC) derived from geographic information systems resources, such as Points of Interest (POI). multimodal learning task that enables controlled study of how POI-derived GSC interact with acoustic representations in AT tasks [2, 19]. Given each recording is represented by an acoustic representation A and a GSC vector g ∈RDGSC constructed from geographic information systems, Geo- AT uses a paired input (A, g). Geo-AT assumes that g is available as recording", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 18, + "total_chunks": 85, + "char_count": 1061, + "word_count": 151, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "50bdbced-b81c-4fc5-ab87-561869ea6204", + "text": "metadata at inference time, alongside A. The learning objective is to predict the set of event labels present in the clip. Let Y denote the event label set. target for each clip is a multi-label vector y ∈{0, 1}|Y|, where yk = 1 indicates the presence of event k in the clip. Geo-AT aims to learn a function f : (A, g) → y, where g encodes information about the surrounding environment through POI-derived semantic descriptors (e.g., proximity to beaches, highways, train stations, residential areas, or industrial facilities). Geo-AT does not prescribe a", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 19, + "total_chunks": 85, + "char_count": 555, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f404e8de-785e-4b9e-b578-2ec3932b7912", + "text": "specific integration mechanism between A and g, leaving model design choices open for evaluation under a shared task definition. Geo-AT focuses on multi-label tagging rather than single-label classification, emphasizing label prediction under polyphonic conditions, where multiple events may co-occur in a clip. The purpose of Geo-AT is not to replace the AT task, but to study when and how spatial evidence complements audio representations, particularly for acoustically confusable events and polyphonic Geo-AT is motivated by the use of contextual knowledge in auditory perception and location-tied metadata in real deployments [33]. Geo-AT provides a framework for building and evaluating more robust machine listening systems in geographically diverse environments, including urban noise monitoring, context-aware assistive hearing, and scalable acoustic surveillance [18] [34]. The benchmark dataset for the Geo-AT task: Geo-ATBench The audio recordings for Geo-ATBench are sourced from Freesound.org [35], a public repository of user-contributed sounds, as well as from the dataset presented in [33], which includes audio files with GPS information and a diverse Figure 1: The number of recordings with GPS information uploaded to Freesound each year. range of sound events.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 20, + "total_chunks": 85, + "char_count": 1281, + "word_count": 181, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c8928dbb-e759-4738-831f-7ea995934545", + "text": "Audio clips were selected based on the inclusion of geotagging information, specifically latitude and longitude coordinates provided by the uploaders, and underwent careful manual review of coordinate validity and obvious mismatches between tags and location for quality control.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 21, + "total_chunks": 85, + "char_count": 279, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5c096c7b-a72c-4364-9227-3c48ea8516da", + "text": "Sound event and GSC annotation GSC construction: For recordings sourced from Freesound, we specifically select data spanning from 2012 to 2025. This temporal filtering is applied because the scale of geo-tagged audio prior to 2012 is relatively limited, as shown in Fig. 1, and the geographical information of regions may differ across long time spans.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 22, + "total_chunks": 85, + "char_count": 352, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7a905e92-64ce-4838-abca-1077081f84f6", + "text": "The GPS coordinates of each recording were obtained from Freesound or the original dataset for others. These coordinates are used to query the OpenStreetMap (OSM) geospatial database via the Overpass For each recording with GPS coordinates, a square with a fixed side length is drawn around the location, and OSM entities within this square are identified based on 11 OSM feature keys, covering categories such as land use, amenities, and natural. While a circular region may be conceptually aligned with the isotropic nature of sound propagation, a square region is adopted to", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 23, + "total_chunks": 85, + "char_count": 577, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99d1049d-92b9-4d2d-97af-04d4c304de52", + "text": "enable efficient bounding-box queries within standard OSM-based geographic This choice provides a computationally practical approximation of the local acoustic environment while maintaining spatial consistency The resulting GSC representation is a POI-derived semantic", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 24, + "total_chunks": 85, + "char_count": 268, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "50619350-2c99-46fb-a7bb-7cf9ccf2c4c8", + "text": "Natural Sounds Human Sounds Sounds of Things Bird sounds 1024 8191 Speech 794 5133 Car 463 3068\nCrickets 343 3091 Footsteps 288 2225 Plane 340 3092\nFalling water 325 2922 Music Instru. 188 1593 Train 165 1291\nFlowing water 319 2774 Music 144 1330 Bell 121 835\nWaves 307 2754 Singing 81 624 Boat 115 927\nInsects(Flying) 137 824 Shout/Scream 79 249 Tram 111 731\nWind 83 737 Laughter 53 125 Vehicle horn 107 293\nExplosion 93 431\nBus 74 461\nSiren 69 509\nMetro 63 454\nHelicopter 58 496\nDog 209 969\nTruck 42 237 Table 1: Sound classes in Geo-ATBench, grouped by Natural, Human, and Thing. Dur.\ndenotes the total duration (in seconds) of each class, and Cnt denotes sample count. Musical\ninstru. abbreviates Musical instrument, and Falling water denotes Falling water/Rain.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 25, + "total_chunks": 85, + "char_count": 766, + "word_count": 134, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d35047cf-41cc-4b2b-8718-f05d3c741d1d", + "text": "descriptor extracted from these OSM annotations and used as the location-tied input described in Section 3. The square side length and the 11 feature keys", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 26, + "total_chunks": 85, + "char_count": 154, + "word_count": 25, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "66e72f4e-8852-495c-be19-7af0870734fc", + "text": "are the same for all clips to keep GSC extraction consistent across the dataset. Sound event annotation: Many Freesound clips include user-provided tags, and the perception of audio events is usually based on human hearing. Therefore, each recording is manually reviewed by listening to the audio track and assigning the heard event labels. When a label is uncertain, the recording is replayed and re-checked until a decision can be made. After manual annotation, the labels are cross-validated with the user-provided tags on Freesound.org Recordings with disagreements are re-examined and corrected, and when", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 27, + "total_chunks": 85, + "char_count": 609, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e420c4e-ff2b-4286-9d2e-55d276630e13", + "text": "needed, the corresponding GPS metadata is used to extract POI-derived OSM annotations as an auxiliary cue to support label verification. each audio clip is paired with its POI-derived OSM annotations to form an Audio–GSC pair in Geo-ATBench. The initial annotation took about 600 person-hours, and cross-validation and re-checking took about 200 additional person-hours, for a total of about 800 person-hours over four months. End-toend dataset collection, preparation, and annotation took about six months. A curation process is performed to map unstructured annotation labels into Figure 2: Summary of sound classes and acoustic similarity. (Left) Distribution of three\ncoarse-grained sound classes. (Right) Intra-class similarity across 28 sound event classes\ncomputed from log-Mel spectrogram features. a controlled vocabulary, resulting in 28 sound event classes. are grouped into three main categories aligned with the AudioSet taxonomic structure [26]: 1) Natural Sounds, which include sounds originating from nature; 2) Human Sounds, which encompass sounds produced by humans; and 3) Sounds of Things, which represent mechanical and man-made noises.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 28, + "total_chunks": 85, + "char_count": 1157, + "word_count": 164, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "845bde56-2d14-4ec3-8244-76db8660a88b", + "text": "The sample counts and total durations for these categories are illustrated in Table 1, while the coarse-grained distributions and corresponding intra-class similarities are visualized in Fig. 2 (right), where the violin-plot similarities are calculated based on log-Mel spectrogram features, and similarity is measured using cosine similarity between feature vectors, a widely adopted metric in audio and sound analysis [37]. Additionally, Fig. 3 provides an overview of the dataset's composition, encompassing 28 event types and 11 OSM categories. The dataset is inherently multi-label, accounting for the co-occurrence of multiple sound events within a single 10-second recording. Dataset Organization and Statistics Following cleaning and selection, the final Geo-ATBench dataset comprises 3,854 audio clips, totaling 10.71 hours of audio. Each data point consists of a", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 29, + "total_chunks": 85, + "char_count": 872, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d918ae5d-7751-4eb4-a73a-821bf90f2384", + "text": "triplet: (i) a 10-second audio clip, (ii) a multi-label clip-level label vector over 28 event classes, and (iii) a POI-derived GSC representation constructed from OSM annotations over 11 semantic context categories. To ensure consistency", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 30, + "total_chunks": 85, + "char_count": 237, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cb5ce65d-63a0-4bad-85b3-35e0c1c45acb", + "text": "Figure 3: Sankey diagram summarizing co-occurrence links from left to right: 3 coarsegrained sound classes, 28 fine-grained sound event classes, GSC types, and the Geo-ATBench\ndataset. Flow width indicates co-occurrence strength. This diagram represents the distribution of audio events and GSC types within the dataset, and is not intended to imply precise\nreal-world relationships, as sound occurrences can vary significantly depending on the specific\ngeographical context (e.g., residential roads vs highways). for modeling tasks, all collected recordings are processed into a standardized Each audio clip has a fixed duration of 10 seconds, encoded as a singlechannel (mono) WAV file with a sampling rate of 16 kHz and a bit depth of 16.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 31, + "total_chunks": 85, + "char_count": 741, + "word_count": 112, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0cf4c744-8f0d-41bf-8bce-78bd06332d87", + "text": "For more details and access to the dataset, please visit the project homepage. The GeoFusion-AT framework and instantiations The GeoFusion-AT framework As shown in Fig. 4, GeoFusion-AT provides reference implementations of three typical fusion points for the Geo-AT task on the Geo-ATBench dataset. All variants take paired inputs (A, g) and output multi-label logits z ∈RC for C event classes, followed by a sigmoid for tag probabilities. Figure 4: Overall architecture of the GeoFusion-AT framework for Geo-AT task. GeoFusion-Early: feature-level fusion Early fusion [38], also known as feature-level fusion, integrates geospatial context and acoustic information at the input of the network. begins by transforming the raw audio waveform into a log-Mel spectrogram A ∈R1×T ×F , where T and F denote the number of time frames and frequency bins, respectively. Concurrently, GSC vector g ∈RDGSC is projected into a length-F vector g′ ∈RF via a linear transformation: g′ = Wprojg,", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 32, + "total_chunks": 85, + "char_count": 980, + "word_count": 150, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "883fad76-c055-47a7-9b90-8927d706cbbb", + "text": "where Wproj ∈RF ×DGSC is a learnable projection matrix. frequency resolution F for the projection so that g′ can be interpreted as a location-conditioned spectral prior (i.e., a per-frequency weighting/gating signal): different geographic contexts tend to correlate with different dominant sound sources and background noise, which manifest as characteristic energy distributions over frequency bands. The projected vector g′ is then broadcast across the temporal dimension to form a broadcast GSC tensor G ∈R1×T ×F . The audio spectrogram and the broadcast GSC tensor are concatenated along the channel dimension to produce the fused representation Xfused = Concat(A, G) ∈R2×T ×F , which serves as the input to the backbone network. When a backbone does not accept a two-channel spectrogram input, an input adapter is applied to map Xfused into the backbone's expected input shape and channel format; all subsequent backbone components remain unchanged.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 33, + "total_chunks": 85, + "char_count": 954, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8a256db4-4d58-4b45-9cca-880c002a2d73", + "text": "GeoFusion-Inter: representation-level fusion Intermediate fusion [39], or representation-level fusion, combines information in the latent space after each modality has been processed by separate Let Φaudio be an audio encoder that maps an input spectrogram A to an audio embedding Eaudio ∈RDemb, where Demb is the embedding dimension. Similarly, the GSC vector g is processed through a multi-layer perceptron (MLP) projection to produce a GSC embedding EGSC ∈RDemb of the same Here, both embeddings are clip-level representations, implying that temporal information in A has been aggregated by Φaudio prior to fusion. Intermediate fusion implements a symmetric cross-modal attention [40] module that supports bidirectional refinement between the audio and GSC embeddings. Given Q, K, V are the query, key and value, attention is computed as QKTAttention(Q, K, V) = softmax V, where K = V, the factor √Demb √Demb stabilizes optimization [40].", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 34, + "total_chunks": 85, + "char_count": 941, + "word_count": 140, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c37895eb-f0d4-40ed-8c37-0b4a056206c2", + "text": "Accordingly, the cross-modal attention operates on global embeddings rather than temporal tokens, serving as feature-wise conditioning instead of frame-level alignment. The audio embedding Eaudio is enhanced by treating it as a query and the GSC embedding EGSC as the Symmetrically, EGSC is enhanced using Eaudio as context. complementary fusion streams are formed by residual mixing. combines the cross-attention refined audio embedding with the original GSC The other stream combines the cross-attention refined GSC embedding with the original audio embedding. This symmetric design preserves both the cross-modal updates and the original modality information. streams are then concatenated and passed through a learnable linear projection to produce a single fused embedding, which is fed to the classification head to output multi-label tagging logits. GeoFusion-Late: decision-level fusion", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 35, + "total_chunks": 85, + "char_count": 894, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d71c7a57-6f39-4eac-b090-fdb913b77001", + "text": "Late fusion [38], or decision-level fusion, combines the outputs of two independent streams, one for each modality. In this paradigm, an audio branch, Φaudio, processes the audio representation A to produce class-wise logits, zaudio ∈ RC, where C is the number of event classes.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 36, + "total_chunks": 85, + "char_count": 278, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b02a0634-0620-4f9e-8e61-9fd60e01bc55", + "text": "In parallel, a GSC branch, ΦGSC, takes the POI-derived GSC vector g as input and produces its own logits, zGSC. The fusion is performed by a weighted combination of these two logits. than using a single scalar weight, a learnable, class-specific weighting vector This design assigns a separate GSC weight to each class while keeping the audio branch unchanged.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 37, + "total_chunks": 85, + "char_count": 360, + "word_count": 59, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3b2f7b66-ae13-426c-91b3-5f25336f0b15", + "text": "The fused logits zfused are computed as: zfused = zaudio + λ ⊙zGSC (1) where ⊙denotes element-wise multiplication, λ is constrained to be nonnegative via a softplus activation function [41], λ = softplus(λraw), and λraw The fusion is performed in the logit (pre-sigmoid) domain, where z denotes class-wise log-odds scores.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 38, + "total_chunks": 85, + "char_count": 322, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd4f821c-7958-413f-8668-0a4b26fadcc4", + "text": "Thus, Eq. (1) combines modalityspecific evidence before the final sigmoid mapping to probabilities. class probabilities are obtained by applying a sigmoid function to zfused. The GeoFusion-AT framework uses the standard multi-label AT objective Auxiliary losses and regularizers are optional and not", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 39, + "total_chunks": 85, + "char_count": 299, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "03dbb76c-a582-439e-93b8-dca1a961e352", + "text": "required by the framework definition. Instantiations of the GeoFusion-AT framework GeoFusion-AT is instantiated on three representative audio backbones to provide benchmark results for the Geo-AT task. PANNs [20] is a CNN-based pretrained audio backbone, AST [9] is a patch-based Transformer backbone that applies attention over spectrogram patch embeddings, and CLAP [21] is a contrastively pretrained audio–text backbone.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 40, + "total_chunks": 85, + "char_count": 423, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7b651527-28a6-4016-964d-10e130776da2", + "text": "All instantiations follow the definitions in Section 5.1: feature-level fusion (GeoFusion-Early), representation- Figure 5: Instantiations of GeoFusion-Early (feature-level fusion). level fusion (GeoFusion-Inter), and decision-level fusion (GeoFusion-Late). code and model checkpoints are available on the project homepage. Instantiations of GeoFusion-Early GeoFusion-Early implements feature-level fusion by constructing an acoustic representation tensor and a broadcast GSC tensor, as shown in Fig. 5. GeoFusion-Early-PANNs. The instantiation on PANNs [20] follows Section 5.1.1. The GSC vector g ∈RDGSC (with DGSC = 768) is linearly projected to a length-F vector and broadcast along time to form a broadcast GSC tensor Audio preprocessing operations are applied to A before fusion. The fused input is Xfused = Concat(A, G) ∈R2×T ×F . The first convolutional layer is adapted to accept two input channels. Weights for the audio channel", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 41, + "total_chunks": 85, + "char_count": 938, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5b0e3e1-81c4-477e-ba7b-73926b812f39", + "text": "are initialized from the PANNs checkpoint, and weights for the GSC channel are zero-initialized to preserve the pretrained audio pathway at initialization and let the model learn to use g during fine-tuning. For AST [9], GeoFusion-Early is implemented as feature-level fusion in the token sequence. Instead of channel-wise concatenation, the GSC vector g is mapped to the AST embedding dimension and injected as a dedicated [GSC] token. The Transformer input sequence contains the standard [CLS] token, the [GSC] token, and the audio patch tokens. The positional embedding table is expanded to (1, Npatches + 2, Demb) (with Demb = 768), and the new [GSC] position is zero-initialized while the original positions retain their pretrained values from the AST checkpoint. uses the output embedding of the [CLS] token. GeoFusion-Early-CLAP. The CLAP audio encoder [21] accepts a spectrogram input and is instantiated with the same two-channel construction as", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 42, + "total_chunks": 85, + "char_count": 954, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2d8ba32d-b6a3-4571-8560-afa26c2c8c4d", + "text": "GeoFusion-Early-PANNs. A broadcast GSC tensor G is constructed from g and concatenated with A to form Xfused. Weights for the audio channel are initialized from the checkpoint, while the GSC channel is zero-initialized to avoid perturbing pretrained audio representations early in training.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 43, + "total_chunks": 85, + "char_count": 290, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e404f71-a5eb-4d6c-acb0-02dfcf74fe50", + "text": "Instantiations of GeoFusion-Inter GeoFusion-Inter is a representation-level fusion variant that combines the audio embedding Eaudio and the GSC embedding EGSC using the symmetric cross-modal attention module in Section 5.1.2, as shown in Fig. 6. GeoFusion-Inter-PANNs. For PANNs, its pretrained backbone serves as", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 44, + "total_chunks": 85, + "char_count": 313, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "18094d79-97e6-46f0-96d6-fd1dcfd473cd", + "text": "a feature extractor to produce audio embedding Eaudio ∈RDemb (Demb = 2048 In parallel, the GSC vector g ∈RDGSC is projected by a two-layer MLP into GSC embedding EGSC ∈RDemb. The embeddings are combined using the symmetric cross-modal attention module in Section 5.1.2 to produce Efused, which is fed to the classification head to output multi-label logits. For AST, the [CLS] output embedding is used as Eaudio ∈RDemb with Demb = 768. The GSC vector g ∈RDGSC has DGSC = 768 and is used to form EGSC at the same dimension.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 45, + "total_chunks": 85, + "char_count": 522, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "edcd0bee-20c7-4d90-8b12-faacd302b047", + "text": "The attention module in Section 5.1.2 produces Efused for tagging. GeoFusion-Inter-CLAP. For CLAP, its audio encoder produces Eaudio ∈ RDemb with Demb = 1024. Concurrently, a two-layer MLP projects the GSC vector g into a matching GSC embedding EGSC. The attention module in Figure 6: Instantiations of GeoFusion-Inter (representation-level fusion). Section 5.1.2 combines the embeddings to produce Efused for tagging. Instantiations of GeoFusion-Late GeoFusion-Late implements decision-level fusion by combining audio logits zaudio and GSC logits zGSC using Eq. 1, as shown in Fig. 7.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 46, + "total_chunks": 85, + "char_count": 585, + "word_count": 84, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ef190254-ae7f-4630-8663-43dc2f274d5b", + "text": "GeoFusion-Late-PANNs. The audio branch is the PANNs model and The GSC branch is an MLP that maps g to zGSC. zfused are computed via Eq. 1 and optimized with the multi-label objective.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 47, + "total_chunks": 85, + "char_count": 183, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3562be41-1775-43ae-a9ca-38112a846751", + "text": "The audio branch is the AST model and outputs The GSC branch and logit fusion follow GeoFusion-Late-PANNs, producing zfused for multi-label tagging. The audio branch uses the CLAP audio encoder to produce an audio embedding, followed by a classification head to output The GSC branch and logit fusion follow GeoFusion-Late-PANNs. Figure 7: Instantiations of GeoFusion-Late (decision-level fusion). Experiments and results analysis Experimental setup and evaluation metrics Geo-ATBench is evaluated as a 28-class multi-label AT task; each audio clip is represented by the acoustic input and the paired POI-derived GSC, described in Section 3 and Section 4.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 48, + "total_chunks": 85, + "char_count": 655, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a4f4ae9-a18d-4ee0-94f3-7dafaa13ad05", + "text": "For repeated evaluation, five independent runs are conducted with different random seeds. In each run, the dataset is split into 70% training, 15% validation, and 15% test. A multi-label stratification procedure is used to keep per-label prevalence and co-occurrence patterns comparable across splits so that all event classes are represented in the test The split is performed at the clip level. The GSC representation is not", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 49, + "total_chunks": 85, + "char_count": 426, + "word_count": 66, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e80940c4-e795-47ac-b475-608c93e3a24b", + "text": "constructed from precise geographic identifiers such as GPS coordinates, street Instead, it encodes high-level semantic context derived Specifically, raw OSM tags, such as amenity: school and", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 50, + "total_chunks": 85, + "char_count": 191, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59c331bb-11a5-4632-a006-66e605c4339a", + "text": "highway: bus stop, are extracted and converted into descriptive strings. resulting strings are encoded using a pretrained BERT model [42], and elementwise mean pooling is applied to the embeddings to capture local land-use characteristics and area semantics. Similar GSC patterns may occur across different recording locations, while recordings in the same area may still differ in their Thus, the reported benchmark results should be interpreted", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 51, + "total_chunks": 85, + "char_count": 446, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9c23ed4d-ebf5-4ade-91e5-ea695c009793", + "text": "as evaluating generalization under clip-level partitioning with location-derived semantic context, rather than under strict geographic hold-out. The three backbones (PANNs [20], AST [9], and CLAP [21]) used in this paper are pretrained on large-scale AudioSet [26] and have reported strong performance on AudioSet with 527 audio event classes at the time of their In the benchmark construction for Geo-AT on Geo-ATBench, finetuning is applied to adapt these backbones to the 28-class multi-label task while limiting changes to their pretrained audio representations. through a small learning rate and early stopping. Models are trained on an NVIDIA GeForce RTX 4090 GPU and fine-tuned for a maximum of 100 epochs using the AdamW optimizer with a learning rate of 1e-5. applied to prevent overfitting; training stops if the validation F1 score does not improve for 15 consecutive epochs. The training objective is binary crossentropy (BCE) loss [20]. Audio inputs are 10-second clips and are resampled to", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 52, + "total_chunks": 85, + "char_count": 1003, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a77220ba-d313-491b-bacc-94c5784f999f", + "text": "match each backbone's requirements. All models are initialized from pretrained weights, and audio-only baselines are included for comparison. Model performance is evaluated by mean Average Precision (mAP) [20], area under the ROC curve (ROC AUC), and F1 score, with mean ± standard deviation across the 5 independent runs. All metrics are micro-averaged unless", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 53, + "total_chunks": 85, + "char_count": 360, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5e49ab2-ba06-4ab0-ac56-b1216de0ae24", + "text": "Besides the multi-label AT on 28 event classes, a 3-class coarse-grained AT is reported as a supplementary analysis. code, models, and the dataset, please see the project homepage. This section evaluates Geo-ATBench from three complementary perspectives.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 54, + "total_chunks": 85, + "char_count": 254, + "word_count": 36, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cec69a68-8df3-4317-ab3e-9bc711fa4a24", + "text": "First, the feasibility of performing audio tagging with GSC alone is evaluated as a GSC-only baseline under different POI extraction ranges. Second, audio-only zero-shot baselines are reported for three strong AudioSetpretrained audio backbones to characterize backbone behaviour before finetuning on Geo-ATBench. Third, fine-tuned Geo-AT results on Geo-ATBench are reported for audio-only and GeoFusion-AT variants under identical data splits, enabling a controlled comparison of feature-level, representation-level, and decision-level fusion. Per-label performance changes and error patterns are used to identify which labels and confusions benefit most from GSC, with emphasis on acoustically confusable labels. GSC-only baseline and GSC range sensitivity", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 55, + "total_chunks": 85, + "char_count": 758, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c917582c-d602-494f-8b18-0343ea6b965f", + "text": "In practice, sound events differ in how broadly they can be perceived and in how strongly they correlate with nearby place semantics. extraction range affects the amount and composition of POI-derived context available for constructing the GSC vector g. To benchmark the Geo-only baseline on Geo-ATBench, a GSC-only baseline is evaluated under multiple POI extraction ranges, as shown in Fig. 8. For each POI extraction range defined by a distance threshold, implemented as the square side length, a square is centered at the clip's GPS coordinate. Figure 8: Average Precision (AP) for GSC-only multi-label tagging under different POI\nextraction ranges. POI-derived GSC is constructed from OSM entities retrieved with the\nsame 11 OSM feature keys (e.g., land use, amenities) and encoded into the fixed-length GSC\nvector g. mAP is computed on the test set over 5 independent runs.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 56, + "total_chunks": 85, + "char_count": 879, + "word_count": 138, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "effc8c64-35fa-4179-adb9-0a0b30cd010f", + "text": "though a circular neighborhood may better approximate isotropic sound propagation, a square region is employed to enable efficient bounding-box queries in OSM-based geographic information systems. OSM entities within the square are retrieved using the same 11 OSM feature keys [36] as in Section 4. resulting POI composition is encoded by a pretrained BERT model [42] into", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 57, + "total_chunks": 85, + "char_count": 372, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c601d6b1-199c-452c-8631-3af9ec3faf00", + "text": "the fixed-length 768-dimensional GSC vector g, and the same GSC-only classifier is evaluated across all ranges. The Geo-only baseline uses BERT-base to produce g, followed by a 3-layer MLP with 1024, 512, and 28 units to perform 28-label multi-label tagging on Geo-ATBench. During training, the BERT tokenizer and BERT encoder [42] are frozen, and only the 3-layer MLP classifier Source code, extracted GSC vectors, and implementation details are available on the project homepage. The GSC-only results increase with larger distance thresholds on GeoATBench, and the 1000-metre range yields the highest performance.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 58, + "total_chunks": 85, + "char_count": 615, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c477df95-cd53-419d-a934-d68313939971", + "text": "One possible explanation is that OpenStreetMap (OSM) coverage can be sparse in some regions [43], so smaller squares may return fewer entities for constructing the GPS accuracy can also vary across devices and conditions [44], which can shift the queried area and affect POI retrieval. may also contribute, as sound events differ in how broadly their semantics relate to nearby places and in how tightly they align with POI-derived context. example, mobile sources such as birds or crickets can be heard across a natural area and may be associated with woodland or water POIs beyond the immediate vicinity of the recording point. Human speech can also reflect nearby attractions or pedestrian flow, where people move toward or away from a site and produce speech sounds over a broader area. In contrast, events associated with fixed sources, such as breaking waves at a shoreline, are typically constrained to more local place semantics; thus, shorter ranges can be sufficient in In summary, this section presents Geo-only performance on GeoATBench with different POI extraction ranges, providing a detailed reference for Geo-only comparison on the proposed Geo-ATBench dataset.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 59, + "total_chunks": 85, + "char_count": 1178, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "54b5c9cb-9a57-4ae5-9d74-2b57718c91e9", + "text": "Audio-only zero-shot baselines Audio-only zero-shot tagging inference is reported to characterize three AudioSet-pretrained models' behaviour before fine-tuning on Geo-ATBench. The AudioSet-pretrained backbones PANNs [20], AST [9], and CLAP [21] are evaluated through direct inference, providing a reference point for the finetuned audio-only and GeoFusion-AT results reported later. A direct zero-shot benchmark on Geo-ATBench labels is not possible from the original AudioSet-pretrained model outputs, because these backbones are trained on AudioSet with 527-class event labels, and their native outputs do not match the 28 Geo-ATBench event labels. A comparable 28-label zeroshot benchmark is defined by first producing class predictions over the 527 AudioSet labels for each Geo-ATBench clip, and then mapping these 527 outputs to the 28 Geo-ATBench labels using the pretrained Word2Vec model (\"word2vec-google-news-300\") [45], which provides 300-dimensional word em- Figure 9: ROC curves for zero-shot audio-only tagging inference on Geo-ATBench labels. Micro and macro ROC AUC are reported for AudioSet-pretrained PANNs, AST, and CLAP\nafter AudioSet-to-Geo-ATBench label mapping. Overall performance: PANNs (Micro AUC\n0.8576, Macro AUC 0.8409), AST (0.6672, 0.6443), and CLAP (0.8325, 0.8022).", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 60, + "total_chunks": 85, + "char_count": 1299, + "word_count": 175, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bec72835-9f87-4cef-a404-648b1c49e4af", + "text": "beddings trained on the Google News corpus. The AudioSet-to-Geo-ATBench label mapping and code are released on the project homepage for reproduction. Fig. 9 shows zero-shot audio-only tagging performance on Geo-ATBench for three AudioSet-pretrained backbones. PANNs achieves the strongest performance, followed by CLAP, while AST performs the worst under the same AudioSet-to-Geo-ATBench label mapping.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 61, + "total_chunks": 85, + "char_count": 402, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd554d95-807a-4a70-9026-e3e7b75e1003", + "text": "Several factors may contribute to First, PANNs is trained to produce strong clip-level tag predictions from log-Mel inputs [20], which can transfer more directly to GeoATBench under label-space mapping. Second, CLAP learns aligned audio-text representations [21], which can preserve semantic separation that remains useful after mapping AudioSet labels to Geo-ATBench labels. Third, AST relies on spectrogram patch tokenization and positional embeddings [9], and its AudioSet pre-training configuration may transfer less effectively to the GeoATBench distribution under direct inference without task-specific adaptation. Similarly, the visualisation of the audio embeddings under zero-shot inference", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 62, + "total_chunks": 85, + "char_count": 699, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2a9d49de-0c6d-4b24-867b-94656afc12a1", + "text": "shows the same trend in Fig. 10. Embeddings from PANNs and CLAP form more separable clusters across the 28 Geo-ATBench classes, whereas AST embeddings show stronger overlap and concentrate in a smaller region of the Higher-resolution versions of Fig. 9 and Fig. 10 are available on the Figure 10: t-SNE visualization of audio embeddings from zero-shot inference for PANNs,\nAST, and CLAP on the 28 Geo-ATBench event classes; clusters show effective separation,\nwhile overlaps highlight acoustically similar classes. project homepage due to space constraints. Fine-tuned Geo-AT results on Geo-ATBench", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 63, + "total_chunks": 85, + "char_count": 598, + "word_count": 89, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "332b42f7-acbc-4754-926f-25f9c0bfd07c", + "text": "Table 2 shows the results of audio backbones under the three fusion strategies described in Section 5. Audio-only and GSC-only baselines are reported, and GeoFusion-AT variants are compared under identical data splits on the 28- All fine-tuned models are trained on the 28 Geo-ATBench labels, without the AudioSet label mapping used in the zero-shot baselines.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 64, + "total_chunks": 85, + "char_count": 360, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3d8b1c2a-0156-4586-bbe9-7d5243a5924d", + "text": "Across early, representation-level, and late fusion, incorporating GSC improves 28-class Geo-AT performance for all three backbones. Welch two-sample t-tests indicate significant improvements compared with the corresponding audio-only baselines for AST with early fusion (p < 0.05), PANNs with late fusion (p < 0.001), and CLAP with intermediate fusion GeoFusion-Early-AST achieves the best mAP on the fine-grained Fine-grained (28 classes) Coarse-grained (3 classes)\nStrategy\nPANNs AST CLAP PANNs AST CLAP", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 65, + "total_chunks": 85, + "char_count": 506, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f32c2390-b5d0-40f9-b078-4092f3cdafcf", + "text": "Audio-Only 0.770±0.006 0.820±0.015 0.824±0.008 0.961±0.006 0.904±0.012 0.966±0.005\nGSC-Only 0.767±0.010 0.867±0.009 GeoFusion-Early 0.812±0.010 0.846±0.010 0.826±0.010 0.954±0.004 0.914±0.009 0.950±0.008\nGain (∆) +0.042 +0.026 +0.002 -0.007 +0.010 -0.016 GeoFusion-Inter 0.824±0.010 0.829±0.003 0.842±0.006 0.964±0.008 0.912±0.011 0.968±0.005\nGain (∆) +0.054 +0.009 +0.018 +0.003 +0.008 +0.002 GeoFusion-Late 0.833±0.007 0.843±0.010 0.831±0.007 0.949±0.004 0.939±0.008 0.966±0.004\nGain (∆) +0.063 +0.023 +0.007 -0.012 +0.035 0.000 Table 2: The mean average precision (mAP) of different models on the Geo-ATBench dataset. The rows labeled Gain (∆) represent the performance difference relative to the audio-only\nbaseline. All metrics are averaged across 5 independent experimental runs. Figure 11: Per-class average precision across 28 classes for the audio-only AST, the GSConly baseline, and GeoFusion-Early-AST, the integration of geospatial information improves\nperformance for multiple classes. 28-class multi-label tagging task, while no significant difference is observed between GeoFusion-Early-AST and GeoFusion-Inter-CLAP (p > 0.5), indicating comparable performance. After fine-tuning, AST yields the strongest overall performance on the 28-class Geo-AT task, followed by CLAP and then PANNs. This ordering differs from the zero-shot baseline ranking in Fig. 9. difference is that the zero-shot baseline predicts in the 527-class AudioSet label space and then maps the outputs to the 28 Geo-ATBench labels, whereas finetuned models are trained directly on Geo-ATBench labels.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 66, + "total_chunks": 85, + "char_count": 1585, + "word_count": 201, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5434eaf-d023-4d8e-86b8-c993ac5c610e", + "text": "can introduce label-aggregation and calibration effects that vary across backbones, and cross-dataset domain shift further limits direct transfer under direct Fine-tuning removes the label-space mismatch by optimizing directly on Geo-ATBench targets, resulting in better results. Fig. 11 further shows the", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 67, + "total_chunks": 85, + "char_count": 305, + "word_count": 40, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2ac3c28f-3c5e-48ac-9288-41994774b151", + "text": "average precision of GeoFusion-Early-AST across Geo-ATBench event classes. In addition to the 28-class fine-grained tagging task, Table 2 shows a supplementary 3-class coarse-grained tagging task that groups the 28 events into Natural Sounds, Human Sounds, and Sounds of Things, as described in Section 4.2.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 68, + "total_chunks": 85, + "char_count": 307, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "79b8330a-a170-46aa-9b11-76ef68a00c76", + "text": "On this coarse-grained task, GeoFusion-Inter-CLAP achieves the best Representation-level fusion improves coarse-grained performance for all three backbones, suggesting that combining audio and GSC high-level representations with symmetric cross-modal attention in the GeoFusion-Inter frame- Figure 12: Per-class AP change (fusion minus audio-only) for the GeoFusion-Early-AST\nexemplar on the 28-class Geo-AT task. work is effective at this level of semantic granularity. To further explore which event labels benefit most from incorporating geospatial semantic context (GSC) under the Geo-AT task, Fig. 12 uses GeoFusionEarly-AST as an exemplar and visualizes per-label average precision (AP) for the audio-only and audio-GSC fusion variants. Fig. 12 also shows the per-label change ∆AP = APaudio+GSC−APaudio, shown by the purple curve. the audio-only reference, incorporating GSC yields more than a 5% AP increase", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 69, + "total_chunks": 85, + "char_count": 914, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3599e207-1d95-476e-81ff-c3324dbfece3", + "text": "for 17 of the 28 event classes. These 17 classes are grouped as GSC-benefiting events in this paper. Among them, Helicopter shows the largest gain, with an absolute change of ∆AP = 0.3378, corresponding to a relative increase of about +52.62% compared with the audio-only AP, which is consistent with the fact that helicopter sounds tend to occur in a limited set of places and are often associated with specific location semantics, making POI-derived GSC informative for disambiguation. For 9 of the 28 event classes, ∆AP remains within ±5%, and these classes are grouped as GSC-neutral events, such as Bell, Singing, and Footsteps, which are common everyday sounds and are of- ten weakly tied to specific place semantics. It is worth noting that Explosion", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 70, + "total_chunks": 85, + "char_count": 757, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a6e214a2-e6d1-4d09-95e0-56042d96f52e", + "text": "shows a near-zero change in this dataset; this pattern is consistent with the Explosion samples retrieved from Freesound.org [35] being dominated by daily activities such as fireworks. Finally, two classes, Speech and Laughter, show", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 71, + "total_chunks": 85, + "char_count": 232, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e96561fb-2c56-4fde-a070-dc29ab3cb238", + "text": "decreases below −5% and are grouped as GSC-nonbenefiting events. be because speech and laughter are broadly distributed across locations, so associating them with POI-derived GSC does not help with recognition. Fig. 12 indicates that GSC helps for the recognition of a majority of sound event classes, has a limited impact on a subset of common sound event classes, and may not help for some widespread human vocalization events that are not related to specific locations and places. Human evaluation of the Geo-ATBench dataset To assess how well models trained on the Geo-ATBench dataset align with human auditory judgements, a crowdsourced human listening study is conducted. This study examines the correspondence between model predictions and human multi-label event judgements. Using the collected annotations, (1) annotation agreement is summarized with descriptive consistency measures and chance-corrected reliability statistics, and (2) model–human alignment is assessed by comparing model predictions with aggregated human consensus", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 72, + "total_chunks": 85, + "char_count": 1042, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0cfa8349-43a3-4cea-bbf9-616a4bbc4af2", + "text": "labels at the clip level. Participants and Experimental Design Ten Chinese participants (3 females, 7 males; M = 22.4, SD = 0.70 years) took part in the assessment experiment. Participants shared a similar language background to support consistent understanding of the annotation interface and The study adhered to the ethical guidelines of Xi'an JiaotongLiverpool University, and informed consent was obtained from all participants. To assess the perceptual validity of the Geo-ATBench labels and downstream model predictions, a within-subject human annotation experiment is Participants listen to 579 Geo-ATBench audio clips and indicate", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 73, + "total_chunks": 85, + "char_count": 639, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aff7e198-3f40-4e41-90d4-6d9e07f73fdc", + "text": "event presence by selecting \"exist\" whenever the corresponding event is clearly Multiple events may be marked as present within a single clip, consistent with the multi-label tagging formulation in Geo-AT. receives all clips and a checklist of event categories consistent with the 28 GeoATBench event labels. Audio clips can be replayed until a confident judgment Participants are instructed to rely on auditory perception, consistent with the Geo-ATBench labelling procedure described in Section 4.2. annotation task is split into short sessions and presented in random order to reduce fatigue and order effects. Each participant completes the annotation of all audio clips within 14 days. Human listening study results: reliability and model–human alignment Inter-rater reliability of human multi-label event annotations Descriptive agreement with the aggregated human consensus labels is computed for each participant. Across all participants, the mean agreement is 0.97, indicating that participants made similar decisions across audio clips and sound event categories.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 74, + "total_chunks": 85, + "char_count": 1073, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "72da7c9a-4665-4321-ab16-0b12685f069c", + "text": "The annotation matrix is sparse, with only about 4.5% of clip–event positions marked as 1 (present). Such class imbalance inflates raw percent agreement, because the majority of annotations belong to the same (absent) category.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 75, + "total_chunks": 85, + "char_count": 227, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9f200d29-198e-41a4-87fe-673cfa8dc515", + "text": "To obtain a chance-corrected estimate of reliability, Krippendorff's alpha for nominal data is computed. Each clip-by-event pair is treated as one item, yielding 16,212 items across ten participants. The resulting reliability coefficient is αnominal(N = 16,212, R = 10) = 0.486, indicating moderate agreement among 10 participants, suggesting variability in auditory perception for multi-label polyphonic sound events in individual auditory ex- perience and interpretations.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 76, + "total_chunks": 85, + "char_count": 474, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "159f1067-130b-4211-9fb0-c8c73c371ab3", + "text": "Given this moderate agreement, majority voting is used to derive clip-level consensus labels for each event as the reference for After data collection, binary event matrices are generated for each participant and aggregated per clip–event pair: a value of 1 is assigned when at least 5 of 10 participants marked \"exist\", and 0 otherwise. Overall, the annotations show high raw agreement but only moderate chancecorrected reliability, which is expected under sparse binary multi-label tagging. Majority-vote consensus labels are used as the clip-level reference for model–human alignment, with cautious interpretation for low-prevalence events. The next subsection compares model predictions against the aggregated human consensus labels to quantify model–human alignment on Geo-ATBench. Model–human alignment under two label references To assess how sensitive model evaluation is to the choice of label reference, model predictions are evaluated against two label sets on the same test set of 579 clips: (i) the Geo-ATBench labels and (ii) the aggregated human consensus labels from 10 participants.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 77, + "total_chunks": 85, + "char_count": 1099, + "word_count": 159, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a00eedc6-beb8-4976-a81d-6cd6f70ee4e9", + "text": "A consensus threshold of 0.5 is used, meaning that an event is considered present when at least 5 of 10 annotators labeled it as This comparison aims to evaluate whether the Geo-ATBench label reference is consistent with independent human judgements. Results are reported for the audio-only-CLAP baseline and the GeoFusionAT variant GeoFusion-Inter-CLAP, given its strong performance on the 28- class fine-grained and 3-class coarse-grained Geo-AT tasks. signed-rank tests are performed on the 28 per-class F1 scores under the two label references. The result shows that for the audio-only-CLAP, there is no", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 78, + "total_chunks": 85, + "char_count": 607, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4d12679c-1a19-4d1f-bda9-92c3b0c2f016", + "text": "statistically significant difference in its performance between Geo-ATBench labels and aggregated human consensus labels, and the same conclusion applies to the GeoFusion-Inter-CLAP. Specifically, paired Wilcoxon signed-rank Event F1 (Label) F1 (Human) Dur. Event F1 (Label) F1 (Human) Dur.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 79, + "total_chunks": 85, + "char_count": 290, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "58942a32-313e-409e-9443-250d12290ae3", + "text": "Bird sounds 0.861 0.856 8191 Falling water 0.872 0.786 2922\nSpeech 0.827 0.869 5133 Flowing water 0.774 0.667 2774\nPlane 0.779 0.514 3092 Waves 0.756 0.677 2754\nCrickets 0.883 0.836 3091 Footsteps 0.629 0.607 2225\nCar 0.491 0.404 3068 Musical instru. 0.651 0.667 1593 Table 3: Top-10 event classes (total = 34,843 s ≈9.68 h) by descending total duration in GeoATBench. F1 (Label) and F1 (Human) are F1 scores of GeoFusion-Inter-CLAP predictions\nevaluated against Geo-ATBench labels and aggregated human consensus labels, respectively. Dur. denotes total event duration (seconds). Musical instru. denotes Musical instrument,\nand Falling water corresponds to Falling water/Rain.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 80, + "total_chunks": 85, + "char_count": 676, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f66b4d97-9007-4511-bcf9-81bf0663e075", + "text": "tests indicated that the audio-only-CLAP's F1 score does not differ significantly (W = 181, p > 0.05) between Geo-ATBench labels (F1 = 0.628) and 10 participant human consensus labels (F1 = 0.570). For brevity, Table 3 reports per-class F1 scores for the ten event classes with the largest total duration, while the statistical test uses all 28 classes.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 81, + "total_chunks": 85, + "char_count": 353, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b15b24b1-4bc2-413a-969d-89cea43b3852", + "text": "A similar pattern is observed for the GeoFusion-Inter-CLAP, with stable F1 scores across Geo-ATBench labels (F1 = 0.649) and 10 participant human consensus labels (F1 = 0.592; W = 187, p > 0.05). Overall, model evaluation remains consistent under Geo-ATBench labels and aggregated human consensus labels on the annotated test set of 579 clips, and the paired Wilcoxon signed-rank tests do not indicate a statistically significant difference between the two label", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 82, + "total_chunks": 85, + "char_count": 462, + "word_count": 71, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ec544a66-e698-4a58-aea2-e2b68883da0c", + "text": "This complements the inter-rater reliability analysis and supports Geo-ATBench as a human-aligned benchmark for clip-level evaluation. Environmental sound events do not exist in isolation: they are physical phenomena generated and perceived within specific geographic environments. Nevertheless, computational auditory scene analysis (CASA) often treats multilabel audio tagging as an audio-only inference problem. overlap, audio information can be insufficient to separate labels with similar acoustic patterns, and disambiguating cues may lie outside the waveform. In response to this gap, we introduce the Geospatial Audio Tagging (GeoAT) task, which frames multi-label audio tagging conditioned on paired audio and geospatial semantic context (GSC). Geo-AT focuses on POI-derived, location-tied semantics as complementary priors that are not encoded in the This task-level formulation provides a principled foundation for integrating spatial semantics into machine listening, extending the scope of CASA beyond independent signal analysis. Geo-ATBench is released to support reproducible Geo-AT evaluation. GeoATBench contains 3,854 clips (10.71 hours) with 28 event classes of real-world", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 83, + "total_chunks": 85, + "char_count": 1192, + "word_count": 157, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "643c6cbe-ff46-4fde-b5cf-df74902df4d8", + "text": "polyphonic audio, and each clip is paired with a POI-derived GSC representation constructed from OSM annotations over 11 semantic context categories. By explicitly encoding the semantic characteristics of recording environments, Geo-ATBench addresses an important resource gap in the field. controlled studies on how spatial context interacts with acoustic representations and offers a shared benchmark for evaluating geospatially grounded sound classification. GeoFusion-AT is introduced to report reference results on Geo-ATBench using feature-level, representation-level, and decision-level fusion with three AudioSet-pretrained backbones, PANNs, AST, and CLAP. backbones and fusion points, incorporating GSC is associated with improved mAP on the 28-class Geo-AT task in most configurations. A crowdsourced listening study with 10 participants further supports GeoATBench as a human-aligned reference on the annotated test set of 579 clips. Together, the Geo-AT task, the Geo-ATBench dataset, and the GeoFusion-AT framework provide a concrete basis for studying how location-tied semantics can complement acoustic representations in CASA.", + "paper_id": "2603.10623", + "title": "Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context", + "authors": [ + "Yuanbo Hou", + "Yanru Wu", + "Qiaoqiao Ren", + "Shengchen Li", + "Stephen Roberts", + "Dick Botteldooren" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10623v1", + "chunk_index": 84, + "total_chunks": 85, + "char_count": 1142, + "word_count": 150, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10624_semantic.json b/data/chunks/2603.10624_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..71cc5c19a2a7b9cda0982e9d7acce8bf0adff92f --- /dev/null +++ b/data/chunks/2603.10624_semantic.json @@ -0,0 +1,572 @@ +[ + { + "chunk_id": "63964349-f84f-4340-b52e-b3627e631980", + "text": "Reinforcement Learning with Conditional Expectation Reward Changyi Xiao 1 Caijun Xu 1 Yixin Cao 1 Abstract to a given reference answer, and is typically implemented\nusing carefully designed, domain-specific rules that enable\nReinforcement Learning with Verifiable Rewards\ndeterministic verification (Guo et al., 2025; He et al., 2025).\n(RLVR) has proven effective in enhancing the reaRLVR is particularly useful in mathematical reasoning tasks\nsoning capabilities of large language models, par-\n(Guo et al., 2025; Team et al., 2025), where answers admit\nticularly in domains such as mathematics where\ncanonical or easily normalized representations, allowing\nreliable rule-based verifiers can be constructed.2026 correctness to be verified reliably through exact matching However, the reliance on handcrafted, domainor symbolic equivalence checks (Hugging Face, 2025).\nspecific verification rules substantially limits the\napplicability of RLVR to general reasoning do- However, RLVR remains difficult to extend to broader rea-Mar mains with free-form answers, where valid an- soning domains such as physics, chemistry, finance, and\n11 swersing it oftendifficultexhibitto establishsignificantcompletevariability,and accu-mak- otherZhou etdomainsal., 2025;withLiuopen-formet al., 2025).answersIn these(Madomains,et al., 2025;valid\nrate rules. To address this limitation, we propose answers often exhibit diverse surface forms and substanConditional Expectation Reward (CER), which tial semantic variation, making it challenging to specify\nleverages the large language model itself as an exhaustive verification rules. Consequently, constructing\nimplicit verifier, and is therefore applicable to reliable verifiers becomes costly or even infeasible, which\ngeneral domains and eliminates the need for ex- substantially constrains the applicability of RLVR to nar-[cs.LG] ternal verifiers or auxiliary models. CER is de- rowly scoped tasks with well-defined answer spaces.\nfined as the expected likelihood of generating\nMoreover, rule-based verifiers typically provide binary feedthe reference answer conditioned on the generback, assigning rewards only to strictly equivalent answers\nated answer. In contrast to rule-based verifiers\nwhile treating all other answers as equally incorrect. As a\nthat yield binary feedback, CER provides a soft,\nresult, they are unable to assign positive rewards to answers\ngraded reward signal that reflects varying degrees\nthat are partially correct, thereby providing limited learning\nof correctness, making it better suited to tasks\nsignals during optimization.\nwhere answers vary in correctness. Experimental\nresults demonstrate that CER is effective across To address these issues, we propose Conditional Expectaa wide range of reasoning tasks, spanning both tion Reward (CER) to extend RLVR to general reasoning\nmathematical and general domains, indicating domains. Rather than relying on external verification rules\nthat CER serves as a flexible and general verifica- or auxiliary verifier models, CER uses the large language\ntion mechanism. The code is available at https: model itself as an implicit verifier. By exploiting the model's\n//github.com/changyi7231/CER. internal consistency with respect to a reference answer, CER\nprovides a model-intrinsic reward signal that remains appli-arXiv:2603.10624v1 cable even when explicit verification is unavailable.\n1. Introduction Specifically, CER measures the expected probability of genReinforcement Learning with Verifiable Rewards (RLVR) erating the reference answer conditioned on the model's\nhas demonstrated strong effectiveness in incentivizing the generated answer, thereby producing a soft, graded reward\nreasoning capabilities of large language models, which re- signal to verify the generated answer. The underlying inlies on a verifier to provide accurate and checkable reward tuition is that when a generated answer is identical to, or\nsignals during learning (Zhou et al., 2025). Such a verifier strongly consistent with, the reference answer, the model\nevaluates the correctness of a generated answer with respect will assign a higher conditional probability to reproducing\nthe reference answer given that generation.\n1Fudan University.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 0, + "total_chunks": 30, + "char_count": 4233, + "word_count": 574, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6a500f4c-2e8d-4dbe-996e-178aeefb447b", + "text": "Correspondence to: Yixin Cao . We further show that CER can be theoretically interpreted\nas a smooth relaxation of the exact-match criterion, yielding Reinforcement Learning with Conditional Expectation Reward reward values that reflect varying degrees of consistency where D is the distribution of q, and f(a, a∗(q)) is a rebetween the generated and reference answers. This char- ward that evaluates the correctness of the generated answer\nacteristic is particularly well suited to general reasoning a with respect to the reference answer a∗associated with\ndomains, where partial correctness, semantic overlap, and question q, which is computed by a verifier f(·, ·). In pracmultiple valid surface realizations are common. tice, such a verifier is often implemented as a set of carefully\ndesigned rules (Hugging Face, 2025). Rule-based verifiers\nWe finally conduct experiments to demonstrate the effecare particularly effective in domains such as mathematics\ntiveness of CER. We show that CER achieves great perand code generation, where answers admit unambiguous\nformance on general domains, both mathematical and nonrepresentations and equivalence can be precisely defined.\nmathematical domains. These findings highlight CER as a\ngeneral and robust reward mechanism for RLVR, offering a However, extending rule-based verification to general reapractical solution for extending reinforcement learning to a soning domains remains challenging (Ma et al., 2025; Zhou\nwide range of reasoning domains. et al., 2025). In these domains, valid answers are often\nfree-form and exhibit substantial variation. Consequently, it\nis difficult to design a rule-based verifier that is both com-2.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 1, + "total_chunks": 30, + "char_count": 1706, + "word_count": 244, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "52e96562-ca3d-48f2-acd2-3788df1f7eaf", + "text": "Conditional Expectation Reward\nplete and accurate, which limits the direct applicability of\nWe first introduce RLVR, followed by the definition and RLVR beyond domains with well-structured and formally\ntheoretical properties of CER. We then present the empir- verifiable answer spaces.\nical formulation of CER and the corresponding training\nWe illustrate this limitation with a concrete example. Conobjective, and finally describe an efficient procedure for\nsider the following question, for which multiple answers\ncomputing CER.\nmay be semantically equivalent despite differing in surface\nform.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 2, + "total_chunks": 30, + "char_count": 595, + "word_count": 83, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e51aa3b2-eb0c-4e65-ad17-76db2069ac5b", + "text": "A typical rule-based verifier would only treat a1 as\nRLVR RLVR is a reinforcement learning paradigm in\ncorrect, while assigning zero reward to other valid answers\nwhich the reward signal is objectively and automatically\nsuch as a2 and a3. This behavior collapses semantically corcheckable by a verifier (Zhang et al., 2025). Specifically,\nrect but lexically different answers into the same category\nfor a given question q with a unique reference answer a∗=\nas incorrect ones, leading to overly sparse and noisy reward\na∗(q), the policy model πθ autoregressively generates a signals. Such rigid verification discourages exploration of\nsolution s and a final answer a to address the question.\ndiverse yet correct answers and hampers effective learning\nHere, the solution s does not include the final answer a.\nin general reasoning settings.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 3, + "total_chunks": 30, + "char_count": 838, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e57ad62e-bc3b-49df-9719-5820ed702cee", + "text": "Owing to the autoregressive factorization, the policy model\nπθ satisfies\nQuestion q: Is quantum physics deterministic? Reference answer a∗: No πθ(a, s|q) = πθ(s|q)πθ(a|s, q). (1)\nAnswer a1: No\nThis process yields a quadruple (q, s, a, a∗). An illustrative Answer a2: Quantum physics is not deterministic.\nexample is provided in the following box. Answer a3: No, quantum physics is not deterministic;\nit is probabilistic. Question q: What is the value of x in the equation\n2x + 3 = 7?", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 4, + "total_chunks": 30, + "char_count": 483, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2fbffa40-c0e0-464e-b830-26a82045a2c4", + "text": "Definition To generalize RLVR to general domains, we Solution s: Solve the equation step by step.\npropose the CER, which leverages the policy model itself First, subtract 3 from both sides:\nas an implicit verifier, without relying on external verifiers\n2x + 3 = 7 ⇒2x = 4. or models. Instead of explicitly checking answer correctness, CER evaluates the internal consistency of the policy\nNext, divide both sides by 2: model with respect to a reference answer, thereby enabling\napplicability to general domains.\nx = 2. For a quadruple (q, s, a, a∗), we define CER as:\nTherefore, the value of x is\nρ(a, a∗) :=Es′∼πθ(·|q) πθ(a∗|s′, q) A = a Answer a: 2\nReference answer a∗: 2 =Es′∼πθ(·|q,a) πθ(a∗|s′, q) . (3) RLVR is formulated as the optimization of the following CER measures the expected likelihood of generating the\nexpected reward: reference answer a∗given the condition that the model has\ngenerated an answer a. The intuition is that if the generated\nLf(θ) = Eq∼D,(s,a)∼πθ(·|q)[f(a, a∗(q))], (2) answer a is identical to, or strongly correlated with, the", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 5, + "total_chunks": 30, + "char_count": 1058, + "word_count": 177, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "52d56805-00ec-4bea-ac27-f23fef8ad3c8", + "text": "Reinforcement Learning with Conditional Expectation Reward reference answer a∗, then the policy model should assign a for every (q, s′) that can occur under the conditional\nhigher conditional probability to generating a∗after having distribution defined by A = a. In this case, the policy\nproduced a. cannot produce any alternative answer with positive\nprobability. Consequently,\nBy optimizing the policy model with respect to CER, the\nmodel is encouraged to place higher probability mass on ρ(a, a∗) = 1 =⇒ a = a∗\nanswers that are internally consistent with the reference\nanswer, thereby implicitly guiding the generation process • Self-Consistency.\ntoward a∗without requiring explicit verification. Theorem 1 (Exact-Match Case). If a = a∗, then\nWe illustrate CER with an example from training. For the ρ(a∗, a∗) =Es′∼πθ(·|q) πθ(a∗|s′, q) A = a∗\nfollowing given question, the model generates three distinct\nanswers, {14, 13, 94}. The answer 14 receives the highest =Es′∼πθ(·|q,a∗) πθ(a∗|s′, q)\nCER value (0.752), as it exactly matches the reference an- ≥Es∼πθ(·|q) πθ(a∗|s, q) .\nswer. The answer 13 attains the second-highest CER value\n(0.313), reflecting its numerical proximity to the reference with equality if and only if πθ(a∗|s, q) is constant over\nanswer 14. In contrast, the answer 94 receives a near-zero all (q, s) such that πθ(s|q) > 0. CER value (0.00004), as it is numerically distant from the\nSee Appendix A for the proof. This shows that condireference answer. tioning on the event that the policy has generated a∗\nstrictly increases the posterior predictive probability of\nQuestion q: How many positive multiples of 7 that are regenerating a∗, unless the policy assigns an identical\nless than 1000 end with the digit 3?\nlikelihood to a∗across all (q, s) pairs with πθ(s|q) > 0. Reference answer a∗: 14\nTherefore, CER exhibits a self-consistency amplifiGenerated Answers and CER (a, ρ(a, a∗)):\ncation effect via posterior reweighting toward higher\n{(14, 0.752), (13, 0.313), (94, 0.00004)}\nprobability assigned to the reference answer in the\nexact-match case. Properties We summarize several fundamental properties\n• Equivalence.\nof CER, which demonstrate the effectiveness of CER. Theorem 2 (Value Equivalence).\n• Boundedness.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 6, + "total_chunks": 30, + "char_count": 2243, + "word_count": 348, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e89f5f33-3950-4eb2-8423-7af9aa6b9750", + "text": "Lρ(θ) = Eq∼D,(s,a)∼πθ(·|q)[ρ(a, a∗(q))]\n0 ≤ρ(a, a∗) ≤1. = Eq∼D,(s,a)∼πθ(·|q)[I(a = a∗(q))], i.e., the expected CER objective is equivalent in value Since πθ(a∗|s′, q) ∈[0, 1] for all (q, s′), the weighted\nto the exact-match objective, where I(a = a∗(q)) indi- sum of these probabilities also lies in [0, 1]. As a result,\ncates whether a exactly matches a∗(q).", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 7, + "total_chunks": 30, + "char_count": 359, + "word_count": 59, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "10abcf3b-81b6-4566-9c4a-da3954078dcd", + "text": "CER provides a bounded and well-scaled reward signal\nthat is suitable for stable optimization. Thus, ρ(a, a∗) can be interpreted as a soft generalization of the hard exact-match reward I(a = a∗), while • Minimum.\npreserving the same expected value. CER yields a\nρ(a, a∗) = 0 ⇐⇒ πθ(a∗|s′, q) = 0 continuous-valued reward rather than a binary signal.\nfor all (q, s′) with πθ(s′|q, a) > 0. This property allows CER to provide graded feedback\nthat reflects varying degrees of consistency between the\nρ(a, a∗) is zero exactly when the policy assigns prob- generated answer a and the reference answer a∗, which\nability 0 to a∗on any (q, s′) that appear in the con- is particularly beneficial in general domains where parditional distribution defined by A = a. In this case, tial correctness or semantic similarity may be present.\nonce the policy has generated a, it is impossible for\nthe model to regenerate a∗under the same posterior In summary, these properties show that CER is a welldistribution over (q, s′). behaved and principled reward function. It is bounded and\nproperly scaled, admits clear minimum and maximum con-\n• Maximum.\nditions, and exhibits a self-consistency amplification effect\nρ(a, a∗) = 1 ⇐⇒ πθ(a∗|s′, q) = 1 when the generated answer matches the reference answer.\nfor all (q, s′) with πθ(s′|q, a) > 0. Moreover, CER is value-equivalent in expectation to the\nexact-match objective while providing a continuous, graded\nρ(a, a∗) reaches its maximum value only when the reward signal, thereby serving as a soft generalization of\npolicy assigns probability 1 to the reference answer a∗ exact-match rewards in general domains. Reinforcement Learning with Conditional Expectation Reward Empirical CER We next derive an empirical form of For each q, we sample N independent (si, ai) from πθ(·|q)\nCER, as the definition in Eq. (3) is intractable due to the for estimating the gradient. The reward R(q, si, ai, a∗) is\nsummation over all possible outcomes under πθ(a∗|s′, q). treated as a fixed scalar with respect to θ during optimization\nTo obtain a computable approximation, we apply Bayes' to detach it from gradient computation for stable policy\nrule and Monte Carlo sampling to derive an empirical esti- learning (Ziegler et al., 2019; Ouyang et al., 2022).\nmator of CER:\nEfficiency We now describe an efficient procedure for\nρ(a, a∗) =Es′∼πθ(·|q) πθ(a∗|s′, q) A = a computing CER by reusing samples, avoiding redundant\n= X πθ(s′|q, a) πθ(a∗|s′, q) computations and adjusting the hyperparameter. As shown\ns′ in Eq. (5), computing CER requires sampling M indepenPs′ πθ(s′|q)πθ(a|s′, q)πθ(a∗|s′, q) dent solutions {sj}Mj=1 from πθ(·|q). However, CER can be =\nPs′′ πθ(s′′|q)πθ(a|s′′, q) seamlessly integrated into policy gradient without additional\nsampling. Specifically, for each question q, we already samPMj=1 πθ(a|sj, q) πθ(a∗|sj, q) ple N independent {si}Ni=1 from πθ(·|q) to estimate the ≈ , sj ∼πθ(·|q). PMj=1 πθ(a|sj, q) policy gradient. These same samples can be directly reused\n(4) for reward computation by setting {sj}Mj=1 := {si}Ni=1.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 8, + "total_chunks": 30, + "char_count": 3059, + "word_count": 493, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4a130890-95df-47b2-94f9-6ae6d6c15024", + "text": "The third line applies Bayes' rule to rewrite πθ(s′|q, a) in\nTo understand the computation of CER more intuitively, weterms of quantities compatible with the autoregressive facfurther show that the CER computing can be expressed intorization of large language models.\na tensorized form. Let M := N and {sj}Mj=1 := {si}Ni=1,\nThe final line further applies Monte Carlo estimation by and define R ∈[0, 1]N with entries Ri = R(q, si, ai, a∗),\ndrawing M independent samples sj ∼πθ(·|q) to produce W ∈[0, 1]N×M with entries Wij = πθ(ai|sj, q), and\nan empirical estimate of CER. P ∈[0, 1]M with entries Pj = πθ(a∗|sj, q). The reward\nThe resulting empirical CER is a normalized likelihood- vector R can then be written as\nweighted average, where each term πθ(a∗|sj, q) is weighted\nR = D−1W P , (8)by πθ(a|sj, q). This weighting captures the joint consistency of a and a∗under the same conditional context, so\nwhere D is a diagonal matrix whose entries are the row\nthat samples for which the policy assigns high probability\nsums of W . See Figure 1 for an illustration.\nto both a and a∗contribute more to the estimator, leading\nto a larger value of CER. Although this approach avoids additional sampling, directly computing Eq. (8) still requires M(N + 1) forObjective We finally define the objective based on the ward passes to compute the entries of W and P .", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 10, + "total_chunks": 30, + "char_count": 1352, + "word_count": 235, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "178a0a3a-0360-4b40-8b0c-cf6a38afd369", + "text": "For a quadruple (q, s, a, a∗), the reward further reduce this cost by eliminating redundant compufunction is defined as: tations. In particular, if two sampled answers ai1 and\nai2 are identical, then their corresponding rewards satisfy\nPMj=1 πθ(a|sj, q)πθ(a∗|sj, q) R(q, si1, ai1, a∗) = R(q, si2, ai2, a∗), and thus the reward , R(q, s, a, a∗) =\nPMj=1 πθ(a|sj, q) only needs to be computed once for each unique answer.\nwhere sj ∼πθ(·|q). (5) Besides, the hyperparameter M controls a trade-off between\ncomputational efficiency and reward estimation accuracy. Using this reward function, we define the training objective Larger values of M yield more accurate estimates of CER\nas at the cost of increased computation, while smaller values\nimprove efficiency with a loss in precision. We can adjust\nLρ(θ) = Eq∼D,(s,a)∼πθ(·|q)[ρ(a, a∗)] M to achieve a balance between performance and efficiency.\n≈Eq∼D,(s,a)∼πθ(·|q)[R(q, s, a, a∗)]. (6)\nThen the corresponding policy gradient is given by 3. ∇θLρ(θ) We first describe the experimental settings in Section 3.1.\n≈Eq∼D,(s,a)∼πθ(·|q)[R(q, s, a, a∗)∇θ log πθ(a, s|q)] We then present the main experimental results in Section 3.2\nto evaluate the effectiveness of the proposed method. Next,\n1 we analyze the computational efficiency of CER in Sec-\n≈Eq∼D[ X R(q, si, ai, a∗)∇θ log πθ(ai, si|q)],\nN tion 3.3. Finally, in Section 3.4, we provide a detailed visui=1 alization of the CER computation process to offer further\nwhere (si, ai) ∼πθ(·|q). (7) insights into its behavior and properties. Reinforcement Learning with Conditional Expectation Reward", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 11, + "total_chunks": 30, + "char_count": 1588, + "word_count": 250, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05d0683e-9c30-4757-a2d3-a7d68498283a", + "text": "1s 1a\n* | q , s1 ) R ( q , s1 , a1 , a * ) ( a1 | q , s1 ) ( a1 | q , s2 ) ( a1 | q , s N 1 ) ( a1 | q , s N ) ( a\nR ( q , s2 , a2 , a * ) ( a2 | q , s1 ) ( a2 | q , s2 ) ( a2 | q , s N 1 ) ( a2 | q , s N ) ( a * | q , s 2 )\n2s 2a Ns 1 a N 1\nR ( q , s N 1 , a N 1 , a * ) ( a N 1 | q , s1 ) ( a N 1 | q , s2 ) ( a N 1 | q , s N 1 ) ( a N 1 | q , s N ) ( a * | q , s N 1 )\nR ( q , s N , a N , a * ) ( a N | q , s1 ) ( a N | q , s2 ) ( a N | q , s N 1 ) ( a N | q , s N ) ( a * | q , s N )\nNs a N An illustration of CER computation, where RN(·) denotes row normalization. The left panel depicts the generation process of\nthe quadruple (q, si, ai, a∗), while the right panel shows the CER computation for the quadruple, corresponding to Eq. (8).", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 12, + "total_chunks": 30, + "char_count": 795, + "word_count": 273, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7111cd0c-a5a4-47de-931e-5db27b6f206f", + "text": "Settings Hyperparameter settings We set the batch size of questions to 32, the number of solutions N to 16, M in Eq. (5)\nDatasets We evaluate the performance of CER across to 16, the learning rate to 10−6, epoch to 1 for WebInstruct\nboth mathematical and general reasoning domains. Acdataset and epoch to 5 for MATH-7.5K dataset. For traincordingly, we train models on two datasets: the mathemating, we use temperature = 1.0 and top-p = 1.0, while for\nical dataset MATH-7.5K (Hendrycks et al., 2021) and the\nevaluation we use temperature = 0.6, top-p = 0.95 and\ngeneral-domain dataset WebInstruct (Ma et al., 2025). The maximum question length is set to 2048\nWebInstruct, we retain only non-mathematical questions at\ntokens, and the maximum output length is set to 4096 tothe university difficulty level to focus on general-domain\nkens for training and 8192 for evaluation. We utilize RLOO\nbeyond mathematics, yielding a dataset of 50K questions.\n(Kool et al., 2019) as the optimization method.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 13, + "total_chunks": 30, + "char_count": 994, + "word_count": 163, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac8d389b-cbee-4e08-90eb-2d1780ecd3c6", + "text": "WebInstruct includes domains such as physics, chemistry,\nbiology, finance and so on. This subset spans a wide range\n3.2. Results\nof disciplines, such as physics, chemistry and biology. CER demonstrates strong generality across domains. On the general-domain training dataset (Table 1), CEREvaluation We evaluate the models on four mathematical\nachieves the highest average performance among all com-datasets, MATH500 (Lightman et al., 2023), AMC23 (Art\npared methods for both Qwen3-4B-Base and Qwen3-8B-of Problem Solving, 2025b), AIME2024 and AIME2025\nBase (except the combined method Rule+CER). In partic-(Art of Problem Solving, 2025a), and two general-domain\nular, CER consistently outperforms exact-match rewardsdatasets, SuperGPQA (Du et al., 2025) and MMLU-Pro\nand the perplexity-based rewards VeriFree, and exceeds(Wang et al., 2024). Performance is reported using the\nthe performance of rule-based verifiers and learned verifierspass@1 metric. For the mathematical datasets, pass@1 is\nGeneral-verifier across most evaluation benchmarks. The ad-computed using a rule-based verifier (Hugging Face, 2025).\nvantage of CER is especially pronounced on general-domainFor the general-domain datasets, which consist of multipleevaluation datasets such as MMLU-Pro and SuperGPQA,choice questions, pass@1 is computed via exact matching.\nwhere it achieves consistent performance gains. Notably,For each mathematical dataset, we conduct 16 evaluation\nthis advantage holds without relying on domain-specificruns and report the average performance.\nhandcrafted rules or models. Baselines We compare CER with several baseline veri- When trained on the mathematical dataset (Table 2), CER\nfiers. These include the exact-match verifier, which checks attains performance comparable to rule-based rewards and\nwhether the generated answer exactly matches the reference outperform learned verifier approaches. Despite the absence\nanswer; a model-based verifier, General-verifier (Ma et al., of an external verifier, CER maintains strong results across\n2025), which employs an external large language model to mathematical benchmarks, indicating that it does not overfit\nassess answer correctness; and a perplexity-based verifier, to a specific domain. VeriFree (Zhou et al., 2025), which uses the perplexity of Taken together, these results suggest that CER can serve\nthe reference answer for verification, a rule-based verifier as a unified reward formulation applicable to both general-\n(Hugging Face, 2025), which verifies the correctness by domain and mathematics-oriented reasoning tasks.\nutilizing handcrafted rules. Reinforcement Learning with Conditional Expectation Reward", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 14, + "total_chunks": 30, + "char_count": 2668, + "word_count": 357, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0cd36c8e-f224-4c7d-ac38-6f642e859401", + "text": "Methods MATH500 AMC23 AIME2024 AIME2025 MMLU-Pro SuperGPQA Average Base 62.6 40.5 10.6 8.1 42.0 21.0 30.8\nExact-Match 78.0 61.4 22.3 20.6 62.9 33.5 46.5\nRule-based 84.5 65.8 21.7 18.5 62.3 32.6 47.6\nVeriFree 83.4 62.2 19.6 16.7 58.7 29.1 44.9\nGeneral-verifier 84.6 64.1 21.9 17.7 63.7 34.2 47.7\nCER 81.6 67.7 22.8 21.3 63.8 35.2 48.7\nRule+CER 85.6 66.6 22.5 19.9 64.1 35.2 49.0 Base 73.9 53.1 14.6 12.3 51.9 27.0 38.9\nExact-Match 82.4 66.6 25.4 20.4 66.2 36.2 49.5\nRule-based 86.0 72.0 26.7 21.0 66.6 37.7 51.7\nVeriFree 86.0 61.4 22.3 19.6 65.4 35.5 48.4\nGeneral-verifier 84.8 74.1 25.0 21.7 67.3 37.7 51.8\nCER 87.2 72.3 25.8 20.6 69.7 38.4 52.3\nRule+CER 85.2 76.4 23.5 20.6 71.0 38.3 52.5", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 16, + "total_chunks": 30, + "char_count": 689, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3a82aded-e9e4-41d0-96d7-290d47b76805", + "text": "The performance of models trained on a mathematical dataset. Methods MATH500 AMC23 AIME2024 AIME2025 MMLU-Pro SuperGPQA Average Base 62.6 40.5 10.6 8.1 42.0 21.0 30.8\nExact-match 81.2 57.3 17.7 15.6 46.7 24.2 40.5\nRule-based 84.2 63.1 22.9 21.5 62.5 32.2 47.7\nVeriFree 81.5 62.7 19.8 18.1 59.9 30.9 45.5\nGeneral-verifier 83.6 63.0 19.8 19.0 58.5 30.9 45.7\nCER 84.1 63.6 24.8 20.0 60.8 32.1 47.6\nRule+CER 85.0 67.5 23.3 20.8 61.2 31.3 48.2 Base 73.9 53.1 14.6 12.3 51.9 27.0 38.9\nExact-match 80.0 63.4 16.5 13.5 61.4 33.5 44.7\nRule-based 86.7 70.2 26.3 22.7 65.8 35.1 51.1\nVeriFree 85.0 68.4 22.9 20.2 62.3 32.4 48.5\nGeneral-verifier 86.0 69.4 26.7 20.6 64.3 34.9 50.3\nCER 87.2 70.9 23.8 23.1 64.8 35.0 50.8\nRule+CER 87.3 72.0 26.5 21.0 65.6 36.0 51.4 CER substantially improves over exact-match rewards. which is particularly beneficial when correct answers admit\nAs established in Theorem 2, CER can be viewed as a surface-level variations.\nsoft generalization of the hard exact-match reward. While\nCER is complementary to rule-based rewards. We furexact-match provides a binary signal that only distinguishes\nther investigate a simple yet effective strategy for combining\nperfectly correct answers from all others, CER assigns\nCER with rule-based rewards, in which the final reward is\ncontinuous-valued rewards that reflect partial correctness to\ndefined as the average of the CER score and the rule-based\nthe reference answer. Empirically, this difference translates\nreward. As shown in Tables 1 and 2, the combined apinto consistent gains over exact-match training across both\nproach (Rule+CER) generally achieves better performance\ndatasets and model scales. The graded feedback provided by\nthan either method used in isolation. This demonstrates that\nCER yields denser and more informative learning signals,\nintegrating them yields a more informative training signal. Reinforcement Learning with Conditional Expectation Reward On the general-domain training dataset, CER enhances rule- 3.4. Visualization of CER Computing\nbased methods by providing graded rewards: rule-based\nTo better illustrate the computing of CER, we visualize the\nschemes assign positive reward only to strictly equivalent\ncomponents involved in Eq. (8).", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 17, + "total_chunks": 30, + "char_count": 2232, + "word_count": 338, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cef6cd33-428e-4380-abe4-671859eb15af", + "text": "Recall that Ri denotes theanswers and treat all other outputs as equally incorrect. As\nCER associated with the quadruple (q, si, ai, a∗), Wij =a result, they fail to differentiate partially correct answers,\nπθ(ai|sj, q) represents the likelihood of generating answerleading to sparse and uninformative learning signals. CER\nai conditioned on solution sj and question q, and D isalleviates this limitation by assigning non-binary rewards\nthe diagonal matrix whose entries are the row sums of W ,\nthat better reflect answer quality. Pj = πθ(a∗|sj, q) measures the likelihood of producing the\nOn the mathematical dataset, rule-based rewards in turn reference answer a∗given (sj, q).\ncomplement CER. In this domain, rule-based methods more\nFigure 2 presents a training example.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 18, + "total_chunks": 30, + "char_count": 773, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "257e7172-bfe9-47d9-b6e7-2456d710bb60", + "text": "The left panel shows\nreliably capture mathematical equivalence, thereby correctthe question q, the reference answer a∗, and the 16 generated\ning errors introduced by imperfect similarity estimation in\nanswers {ai}. The right panel visualizes the vectors R andCER. Overall, these results highlight the complementary\nP , together with D−1W . Specifically, the left column\nstrengths of CER and rule-based rewards and motivate their\ncorresponds to the reward vector R, the central block depicts\ncombined use across different domains.\nthe normalized matrix D−1W with each row summing to\none, and the right column corresponds to the vector P .\n3.3.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 19, + "total_chunks": 30, + "char_count": 642, + "word_count": 99, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b6c3c6bf-ec57-457d-9c90-32ef366111be", + "text": "Efficiency\nSeveral observations can be drawn from this visualization. We analyze the computational efficiency of various rewards First, CER effectively captures surface-level variation\nand their corresponding performance. In CER, the computa- and semantic similarity among answers, which is partictional cost is governed by the hyperparameter M in Eq. (5), ularly important for questions with free-form answers.\nwhich specifies the number of samples used to estimate the In this example, the 16 generated answers contain 10 unique\nreward. By tuning M, CER enables a flexible mechanism surface forms that nevertheless share similar semantics,\nto balance runtime efficiency and reward fidelity. such as \"No, quantum physics is generally considered nonTable 3 reports runtime and average performance across deterministic.\" and \"No, quantum physics is not determinmethods, with all experiments conducted on four NVIDIA istic.\" CER assigns positive rewards to all such semanH100 GPUs. For CER, increasing M improves performance tically consistent answers, whereas exact-match or ruleat the cost of higher runtime overhead. Empirically, CER based methods would assign positive reward only to the\nexhibits a smooth and controllable trade-off, enabling practi- strictly matching answer \"No\" This demonstrates that CER\ntioners to select M that balances efficiency and performance provides richer and more informative reward signals for\nunder given computational constraints. general-domain reasoning tasks. Exact-match rewards incur the lowest overhead but yield Second, answers that receive higher CER rewards tend\ninferior performance. CER with smaller M and rule-based to exhibit stronger alignment with solutions that also\nrewards achieve reasonable performance while remaining assign high likelihood to the reference answer. Since\nefficient, whereas CER with large M, VeriFree, and General- most entries of P are relatively large, this alignment can\nverifier incur higher runtime costs due to multiple large be examined by inspecting the sparsity patterns of the norlanguage model queries during reward computation. malized matrix D−1W . From top to bottom, the rows of\nthe normalized matrix become increasingly sparse, which\nresults in smaller CER values. This trend is consistent with\nTable 3.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 20, + "total_chunks": 30, + "char_count": 2291, + "word_count": 329, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2bdfacc8-0962-43fd-9a73-3b463485b413", + "text": "The average performance across six datasets and correthe formulation in Eq. (5), where reduced overlap with othersponding runtime for each method.\nsolutions leads to a lower reward. Model Performance Runtime Third, the visualization suggests that increasing the\nExact-match 46.5 45.2h value of M in Eq. (5) can improve performance. A larger\nRule 47.6 54.7h M can yield a denser and more stable normalized matrix,\nVeriFree 44.9 58.7h leading to a more accurate estimation of CER and, conseGeneral-verifier 47.7 57.5h quently, improved performance. CER (M=1) 46.4 47.0h Finally, the figure also illustrates that identical answers\nCER (M=2) 47.7 52.2h receive identical CER rewards. For example, since a1 =\nCER (M=4) 48.0 55.6h a2, the corresponding rows in the normalized matrix are\nCER (M=8) 48.2 59.3h identical, which results in equal CER values for a1 and a2. CER (M=16) 48.7 67.4h This property reflects the consistency of CER with respect\nto repeated answer instances. Reinforcement Learning with Conditional Expectation Reward", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 21, + "total_chunks": 30, + "char_count": 1031, + "word_count": 158, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4e7ac05-346d-4540-abc7-5d3b51010001", + "text": "This figure illustrates the computation of CER as defined in Eq. (8). The left panel shows the question, the reference answer,\nand the 16 generated answers. The right panel depicts the components: the reward vector R (left column), the row-normalized matrix\nD−1W (central block), and the reference-likelihood vector P (right column).", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 22, + "total_chunks": 30, + "char_count": 333, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c1b4d34e-656d-4163-8179-1876686427fa", + "text": "Related Work answer under the large language model. VeriFree (Zhou\net al., 2025) combines perplexity-based rewards with variRLVR RLVR (Lambert et al., 2024; Guo et al., 2025; ance reduction techniques to construct a training objective. Team et al., 2025) has emerged as a prominent paradigm Building on this line of work, Nover (Liu et al., 2025) infor improving the reasoning performance of large language troduces length normalization to mitigate length bias, while\nmodels. RLVR relies on rule-based verifiers to provide ac- RLPR (Yu et al., 2025) reformulates perplexity as a sum\ncurate and stable reward signals, such as the math-verify of token-level probabilities to further address sensitivity to\nlibrary (Hugging Face, 2025) for mathematical reasoning answer length.\ntasks (Guo et al., 2025) and the SandboxFusion toolbox\n(Cheng et al., 2024) for code generation (Luo et al., 2025; In contrast to both model-based and perplexity-based apHe et al., 2025). These rule-based verification methods are proaches, CER leverages self-consistency between generparticularly effective in domains where answers admit un- ated answers and the reference answer to produce soft,\nambiguous representations and deterministic equivalence graded, and model-intrinsic reward signals.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 23, + "total_chunks": 30, + "char_count": 1271, + "word_count": 185, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7251cffc-a8e7-451a-b763-90b97233f736", + "text": "This design\nrules can be readily constructed. However, their applica- enables reliable feedback without requiring additional verbility is limited in general reasoning domains, where valid ifier models or handcrafted rules, making CER applicable\nanswers are often open-form and exhibit substantial surface across a wide range of general reasoning domains.\nvariation. In contrast, CER aims to extend RLVR to such\ngeneral domains. 5. In this paper, we propose CER as a general framework for exGeneral Domains Existing verification methods applicatending RLVR beyond domains that rely on strict, rule-based\nble to general reasoning domains can be broadly categorized\nverification. By leveraging the large language model itself\ninto model-based verifiers and perplexity-based verifiers.\nas an implicit verifier, CER produces soft, graded reward sigModel-based verifiers employ a fine-tuned large language nals that reflect partial correctness and semantic consistency,\nmodel to assess the correctness of a generated answer thereby overcoming the limitations of binary rule-based\nwith respect to a reference answer.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 24, + "total_chunks": 30, + "char_count": 1109, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "482fb879-78f4-4925-be06-0ab20215acc4", + "text": "For instance, Kimi- feedback. Our theoretical analysis shows that CER can be\nk1.5 (Team et al., 2025) fine-tunes a model on large-scale viewed as a smooth relaxation of exact-match evaluation,\nverification data to endow it with verification capabilities. providing a principled connection to conventional verifiable\nGeneral-Verifier (Ma et al., 2025) further develops a gen- rewards. Empirically, we demonstrate that CER is effective\nerative, model-based verifier trained specifically for chain- across both mathematical and general-domain reasoning\nof-thought answer verification, enabling more nuanced and tasks. Together, these results indicate that CER offers a\ncontext-aware judgments. flexible and broadly applicable mechanism for guiding reinforcement learning in large language models, enabling more\nPerplexity-based verifiers, in contrast, define reward siggeneral and robust reasoning capabilities.\nnals based on the likelihood or perplexity of the reference Reinforcement Learning with Conditional Expectation Reward Impact Statement Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker,\nB., Lee, T., Leike, J., Schulman, J., Sutskever, I., and\nThis paper seeks to advance the state of machine learning Cobbe, K. Let's verify step by step. In The Twelfth\nby introducing new insights and techniques. Progress in this International Conference on Learning Representations,\nfield can enable improvements across many domains, but the 2023.\nwork presented here is foundational. Consequently, no specific positive or negative impacts are uniquely attributable Liu, W., Qi, S., Wang, X., Qian, C., Du, Y., and He,\nto this work.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 25, + "total_chunks": 30, + "char_count": 1635, + "word_count": 230, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3cbd1634-003d-4c8b-8e3e-7c34409c44a6", + "text": "Nover: Incentive training for language models\nvia verifier-free reinforcement learning. arXiv preprint\nReferences arXiv:2505.16022, 2025. Art of Problem Solving. Aime problems and solutions, Luo, M., Tan, S., Huang, R., Patel, A., Ariyak, A., Wu, Q.,\n2025a. URL https://artofproblemsolving. Shi, X., Xin, R., Cai, C., Weber, M., et al. Deepcoder:\ncom/wiki/index.php/AIME_Problems_and_ A fully open-source 14b coder at o3-mini level. Art of Problem Solving. Amc problems and solutions, Ma, X., Liu, Q., Jiang, D., Zhang, G., Ma, Z., and Chen,\n2025b. URL https://artofproblemsolving. General-reasoner: Advancing llm reasoning across all\ncom/wiki/index.php?title=AMC_ domains. arXiv preprint arXiv:2505.14652, 2025. Problems_and_Solutions. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,\nCheng, Y., Chen, J., Chen, J., Chen, L., Chen, L., Chen, W., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,\nChen, Z., Geng, S., Li, A., Li, B., et al. Fullstack bench: et al. Training language models to follow instructions\nEvaluating llms as full stack coders. arXiv preprint with human feedback. Advances in neural information\narXiv:2412.00535, 2024. processing systems, 35:27730–27744, 2022.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 26, + "total_chunks": 30, + "char_count": 1200, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5b0d791a-3295-42df-bbad-2a5d473b7017", + "text": "Du, X., Yao, Y., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C.,\nM., Liang, Y., Jin, X., Wei, Z., et al. Supergpqa: Scaling Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5:\nllm evaluation across 285 graduate disciplines. arXiv Scaling reinforcement learning with llms. arXiv preprint\npreprint arXiv:2502.14739, 2025. arXiv:2501.12599, 2025. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S.,\nZhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro:\ncentivizing reasoning capability in llms via reinforcement A more robust and challenging multi-task language unlearning. arXiv preprint arXiv:2501.12948, 2025. derstanding benchmark. Advances in Neural Information\nHe, J., Liu, J., Liu, C. Y., Yan, R., Wang, C., Cheng, P., Processing Systems, 37:95266–95290, 2024. Zhang, X., Zhang, F., Xu, J., Shen, W., et al. SkyYu, T., Ji, B., Wang, S., Yao, S., Wang, Z., Cui, G., Yuan,\nwork open reasoner 1 technical report. arXiv preprint\nL., Ding, N., Yao, Y., Liu, Z., et al. Rlpr: Extrapolating\nrlvr to general domains without verifiers. arXiv preprint\nHendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, arXiv:2506.18254, 2025.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 27, + "total_chunks": 30, + "char_count": 1313, + "word_count": 218, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c0a77a84-2cc6-4974-95ab-811aa47b5f39", + "text": "S., Tang, E., Song, D., and Steinhardt, J. Measuring mathZhang, K., Zuo, Y., He, B., Sun, Y., Liu, R., Jiang, C., ematical problem solving with the math dataset. arXiv\nFan, Y., Tian, K., Jia, G., Li, P., et al. A survey of preprint arXiv:2103.03874, 2021.\nreinforcement learning for large reasoning models. arXiv\nHugging Face. Math-verify: A robust mathematical ex- preprint arXiv:2509.08827, 2025.\npression evaluation system. GitHub repository and\nZhou, X., Liu, Z., Sims, A., Wang, H., Pang, T., Li, C., Python package, 2025. https://github.com/\nhuggingface/Math-Verify, version 0.8.0. Wang, L., Lin, M., and Du, C. Reinforcing general reasoning without verifiers. arXiv preprint arXiv:2505.21493,\nKool, W., van Hoof, H., and Welling, M. FORCE samples, get a baseline for free!, 2019.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 28, + "total_chunks": 30, + "char_count": 786, + "word_count": 118, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cb854284-54bc-419f-b4e3-195413c390b8", + "text": "M., Stiennon, N., Wu, J., Brown, T. B., Radford,\nLambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning\nH., Brahman, F., Miranda, L. V., Liu, A., Dziri, N., language models from human preferences. arXiv preprint\nLyu, S., et al. Tulu 3: Pushing frontiers in open language arXiv:1909.08593, 2019.\nmodel post-training. arXiv preprint arXiv:2411.15124,\n2024. Reinforcement Learning with Conditional Expectation Reward Theorem 1 (Exact-Match Case). If a = a∗, then\nρ(a∗, a∗) =Es′∼πθ(·|q) πθ(a∗|s′, q) A = a∗\n=Es′∼πθ(·|q,a∗) πθ(a∗|s′, q)\n≥Es∼πθ(·|q) πθ(a∗|s, q) . with equality if and only if πθ(a∗|s, q) is constant over all (q, s) such that πθ(s|q) > 0. Since ρ(a, a∗) is defined for a generated answer a, we always have the probability Pr(A = a) > 0. In particular, if\na = a∗, then Pr(A = a∗) > 0, so conditioning on A = a∗is well defined. By Bayes' rule,\nE s∼πθ(·|q)[πθ(a∗|s, q) I(A = a∗)]\nEs′∼πθ(·|q)[πθ(a∗|s′, q)|A = a∗] = Pr(A = a∗)\nE s∼πθ(·|q)[πθ(a∗|s, q) Pr(A = a∗|q, s)]\nEs∼πθ(·|q)[Pr(A = a∗|q, s)]\nEs∼πθ(·|q) πθ(a∗|s, q)2\n= . Es∼πθ(·|q)[πθ(a∗|s, q)] Hence,\nEs∼πθ(·|q) πθ(a∗|s, q)2 ρ(a∗, a∗) = . Es∼πθ(·|q)[πθ(a∗|s, q)]", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 29, + "total_chunks": 30, + "char_count": 1186, + "word_count": 197, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c95ebdf-ae19-4a8d-8156-13ec8ae62550", + "text": "Since πθ(a∗|s, q) ≥0, Jensen's inequality (or equivalently E[X2] ≥E[X]2) implies\nρ(a∗, a∗) ≥Es∼πθ(·|q) πθ(a∗|s, q) , which proves the desired inequality. Equality holds if and only if πθ(a∗|s, q) is constant over all (q, s) such that πθ(s|q) > 0. Theorem 2 (Value Equivalence). Lρ(θ) = Eq∼D,(s,a)∼πθ(·|q)[ρ(a, a∗(q))]\n= Eq∼D,(s,a)∼πθ(·|q)[I(a = a∗(q))],\ni.e., the expected CER objective is equivalent in value to the exact-match objective, where I(a = a∗(q)) indicates whether a\nexactly matches a∗(q). By definition,\nLρ(θ) = Eq∼D,(s,a)∼πθ(·|q)[ρ(a, a∗(q))]\n# \"X . = Eq∼D πθ(s, a|q) ρ(a, a∗(q))\ns,a Using the definition of ρ, # \"X Lρ(θ) = Eq∼D πθ(s, a|q) X πθ(s′|q, a) πθ(a∗(q)|s′, q)\ns,a s′\n# \"X . = Eq∼D πθ(a∗(q)|s′, q) X πθ(s, a|q) πθ(s′|q, a)\ns′ s,a Reinforcement Learning with Conditional Expectation Reward", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 30, + "total_chunks": 30, + "char_count": 811, + "word_count": 131, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "507b2fd2-a32d-4620-bee6-d94e4f7c4dc4", + "text": "For fixed q, we have\nX πθ(s, a|q) πθ(s′|q, a) = X πθ(a|q) πθ(s′|q, a) = πθ(s′|q),\ns,a a where the first equality marginalizes out s and the second follows from the law of total probability. Therefore,\n# \"X Lρ(θ) = Eq∼D πθ(s′|q) πθ(a∗(q)|s′, q)\n \n= Eq∼D X πθ(s′, a|q) I(a = a∗(q))\ns′,a\n= Eq∼D,(s,a)∼πθ(·|q)[I(a = a∗(q))]. This shows that the expected CER objective is equivalent in value to the exact-match objective.", + "paper_id": "2603.10624", + "title": "Reinforcement Learning with Conditional Expectation Reward", + "authors": [ + "Changyi Xiao", + "Caijun Xu", + "Yixin Cao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10624v1", + "chunk_index": 31, + "total_chunks": 30, + "char_count": 420, + "word_count": 75, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10641_semantic.json b/data/chunks/2603.10641_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..32756606913080a2c1a696cf64e9604caa1f64bd --- /dev/null +++ b/data/chunks/2603.10641_semantic.json @@ -0,0 +1,653 @@ +[ + { + "chunk_id": "db8624b7-45f9-4d86-b749-828ea97c1d8e", + "text": "Detecting and Eliminating Neural Network\nBackdoors Through Active Paths with Application\nto Intrusion Detection", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 0, + "total_chunks": 31, + "char_count": 111, + "word_count": 14, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c9136131-2028-4e78-8626-380392325be3", + "text": "1st Eirik Høyheim 2nd Magnus Wiik Eckhoff\nNorwegian Defence Research Establishment (FFI) Norwegian Defence Research Establishment (FFI)\nLillestrøm, Norway University of Oslo\nEirik.Hoyheim@ffi.no Lillestrøm, Norway\nMagnus-Wiik.Eckhoff@ffi.no\nORCID 0009-0003-7651-4040 3rd Gudmund Grov 4th Robert Flood 5th David Aspinall2026 Norwegian Defence Research Establishment (FFI) University of Edinburgh, UK School of Informatics\nUniversity of Oslo University of Oslo, Norway University of Edinburgh (UoE)\nLillestrøm, Norway rflood@ed.ac.uk Edinburgh, United KingdomMar Gudmund.Grov@ffi.no ORCID 0000-0001-7171-3364 David.Aspinall@ed.ac.uk\nORCID 0000-0001-8837-5496 ORCID 0000-0002-6073-9013 Abstract—Machine learning backdoors have the property that high-importance features, and both explaining backdoor-like\nthe machine learning model should work as expected on normal behaviour and removing genuine backdoors are desirable.\ninputs, but when the input contains a specific trigger, it behaves Motivated by previous work on activation clustering [4] and\nas the attacker desires. Detecting such triggers has been proven[cs.CR] active paths [12], we explore these insights with the following to be extremely difficult. In this paper, we present a novel and\nexplainable approach to detect and eliminate such backdoor contributions:1\ntriggers based on active paths found in neural networks. We (C1) A novel backdoor detection approach exploring\npresent promising experimental evidence of our approach, which the active paths data flows in a neural network;\ninvolves injecting backdoors into a machine learning model used\n(C2) Leveraging the approach's explainable-by- for intrusion detection. This paper was originally presented at\nthe International Conference on Military Communication and design nature, we develop a method to remove\nInformation Systems (ICMCIS), organized by the Information detected backdoors automatically. Systems Technology (IST) Scientific and Technical Committee,\nOur endeavour is a result of work developing robust ML-driven\nIST-224-RSY – the ICMCIS, held in Bath, United Kingdom, 12-\n13 May 2026. intrusion detection systems (IDS) for cyber attacks, where\nIndex Terms—AI security, backdoor attacks, intrusion detec- explanation and backdoor elimination are of great concern.\ntion. Our final contribution adresses this domain:\n(C3) Our approach is applied to a network intrusion\nI. INTRODUCTION\ndetection scenario, demonstrating the detection\nThe ubiquitous nature of machine learning (ML) entails that capabilities and that the backdoor can be elim-arXiv:2603.10641v1 ML-specific vulnerabilities are susceptible to exploitation in inated without degrading the results for normal\ncyber attacks. One such type of attack is backdoor attacks, behaviour.\nwhich are notoriously difficult to defend against [8]. Here, The paper is structured as follows: in section II we provide\nthe goal is for the ML model to behave as expected on necessary background on ML backdoors, our threat model,\nnormal inputs, but behaves as the attacker desires when neural network assumptions and active paths; section III outspecific triggering inputs are provided [11]. We have observed lines our explainable approach for backdoor detection; section\nthat for (at least) tabular data, backdoor triggers manifest in IV outlines our approach for backdoor elimination; section V\nabnormally strong paths during forward propagation in neural contains the experimental evidence for our approaches; finally,\nnetworks. Moreover, backdoors exhibit similar behaviour to we compare and contrast our work in section VI and conclude\nThis work has received funding from the Smart Networks and Services in section VII.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 1, + "total_chunks": 31, + "char_count": 3692, + "word_count": 506, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "44eb4f22-d45b-47ef-8081-5fa15820575c", + "text": "Joint Undertaking (SNS JU) under the EU Horizon Europe programme\nPRIVATEER under Grant Agreement No. 101096110. Views and opinions\nexpressed are however those of the author(s) only and do not necessarily 1Github repo: https://github.com/FFI-no/Paper-NIDS-NN-backdoor-detec\nreflect those of the EU or SNS JU. tion-and-elimination-ICMCIS2026. On the military relevance of ML backdoors often divided between corrupted-label and clean-label attacks. The former indicates that the labels are altered, and the latter\nWhile our approach is generic and not purely for entails that they are not [11]. In this work, we will consider\nmilitary applications, it is also important to note its corrupted-label attacks.\nrelevance in a military context. NATO's AI strategy Our work is motivated by backdoor attacks on ML-driven\n[18], [19] includes a principle of reliability of AI intrusion detection systems (IDS). Challenges of implanting\nmodels that involves security and robustness, which backdoors in IDS are identified in [14], which uses a decision\nour approach addresses. The strategy also stresses AI- tree to rank backdoor feature potency. In our experiments\nenabled cyber defence applications, which our use case described in section V, we follow the process of Bachl et\nfocuses on. One can think of several scenarios in which al [2] and target the time-to-live (TTL) packet feature. Another\nour approach for detecting and mitigating backdoors example of backdoors in IDS is TrojanFlow [20], which argues\nis both applicable and desirable. For instance, high- for dynamic and sample-specific triggers.\nquality labelled data, required in a supervised setting, Our focus, however, is not on new backdoor attacks, but\nis scarce, and one may have to rely on openly ac- on how to detect and remove existing backdoors. Detecting\ncessible data to train models or even tune an existing backdoors is complex; in fact, it has been argued that it is\nmodel trained on a different dataset. Furthermore, in a impossible to guarantee backdoor-free ML models [8]. A commilitary setting, one must assume an advanced adver- mon detection approach for backdoors is by finding anomalous\nsary; thus, high-quality data is required, which may behaviour [4], [33], [26].", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 2, + "total_chunks": 31, + "char_count": 2239, + "word_count": 343, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ca61f2ad-7290-4e64-8d61-324ceb4bce5f", + "text": "The most relevant approaches for\nnecessitate the use of external datasets for training. our work are activation clustering [4] and BadActs [33],\nThis also applies to a military security operations which target activations in the neural network; we return\ncentre (SOC). Such open data may contain backdoors, to this in section VI. To remove backdoors, one mitigation\nwhich will degrade the required reliability [18], [19]. strategy is to filter inputs where a trigger can be detected [26]. A backdoor trigger may also be present in sensor data, Another alternative is by model editing [28], [32], [12], where\ntypically used by intrusion detection systems. As part anomalous model weights are detected and modified. Given\nof the data cleaning and labelling process, data points the theoretical limitation in detecting backdoors [8], there are\ncontaining the trigger may be misclassified as benign, mitigation strategies for backdoors that avoid detection [9],\nthus capabilities to detect and remove the backdoors which also include building in robustness against backdoors\nare needed. in the training process [29]. There are also approaches to\nformally verify the absence of (certain types of) backdoors\nII. Backdoors in Machine Learning Models\nB. Threat Model and Model Assumptions\nIt is common to define a backdoor attack as an optimisation\nWe consider feed-forward neural network backdoors which\nproblem [24], [11]. Given a specific backdoor trigger, τ, and a\nhave been implanted in the model via data poisoning duringclean dataset DC = (x, y), the poisoned dataset will take the\ntraining — rather than via weight/parameter manipulation —form, DP = (˜x,˜y), where ˜y is the target class for an attacker\nto be triggered during model inference.\nand ˜x is a variation of the clean data x where the trigger τ has\nOur approach relies on access to both the model and databeen inserted into (and possibly replaced) specific features.\nwhere the trigger is sufficiently2 present. It does not dependThe attackers objective is then to manipulate the targeted ML\non how the neural network was trained, but we assume thatmodel such that it produces equivalent solutions to a noneach node is computed as follows:poisoned model when given clean data, while simultaneously\npredicting ˜y whenever the backdoor trigger τ is present. It is K\ncommon to use a poisoning rate [11], where parts of the full a(l)p = o(l) w(l)0,p + X a(l−1)k w(l)k,p = o(l) h(l)p . (1)\nclean dataset DC are used to create DP . k=1\nTwo common types of backdoor triggers (τ ∈Rp) are replacement triggers and addition triggers. Replacement triggers h(l)p is the pre-activation of node p in layer l, and o(l) is the\nset specific features to specific values. This could, for instance, activation function. For the methods presented in this paper,\nbe a TCP port number, which, when present, will always the activation function must be piecewise linear.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 3, + "total_chunks": 31, + "char_count": 2905, + "word_count": 469, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7a042140-ea2a-477c-9ac5-278893889dc8", + "text": "Addition o(l), for l ∈{1, ..., L −1}, to be the ReLU function, i.e.\ntriggers, on the other hand, focus on adding a given trigger o(l)(x) = max(0, x). For the final hidden layer, the activation\nvalue, τ, to the features of interest. For example, the trigger function will be problem-specific — e.g., the identity function\nvalue could be a sinusoidal function that is added to the bitrate for regression and the sigmoid function for binary classificasequence, resulting in a benign prediction. The experiments tion.\nconsidered in this paper examine replacement triggers.\n2What is sufficient is dependent upon the complexity of the data and\nSince early work on ML backdoor attacks by Gu et al. [10], model. In our experiments (section V), 1% of samples being backdoored\nseveral types of backdoor attacks have been proposed [11], [7], was sufficient. Given the assumption that trigger behaviour is manifested\ninto specific paths within the network, and with the concept\nof active paths, it becomes evident that one can identify which\nactive paths are most commonly used when the backdoor trigger is present. Knowing these paths will then make it feasible\nto remove the backdoor behaviour from the model. Section IV\nprovides further details on this approach, demonstrating how\nbackdoors can be removed without additional retraining when\nFig. 1: Active paths after node elimination when using ReLU. ReLU activation functions are used. BACKDOOR DETECTION BY CLUSTERING LOCAL\nC. Local Feature Contributions and Active Paths CONTRIBUTIONS\nA neural network's opaque predictive behaviours make it Figure 2 illustrates our overall approach for detecting\ndifficult to detect backdoors or identify which features contain trigger-like backdoor behaviours in neural networks using\ntriggers. To make this more feasible, we require a measure local feature contributions (ϕij). In the first step (i), all training\nof each feature's contribution to the model's prediction, such data is passed through the network to retrieve their feature\nas explainable slope coefficients [12], potentially revealing contributions, as shown in Figure 2. Here, the training dataset\nabnormal behaviour, contains both clean (blue) and backdoored (red) samples.4 In\nIn essence, the explainable slope coefficients for a given the second step (ii), we run a dimensionality reduction method\nobservation xi, denoted as βi, are the coefficients associated before clustering similar data with a clustering algorithm.\nwith the linear representation of the pre-activation for the Two distinct clusters emerge in the illustration: one withoutput layer, which is feasible to retrieve whenever piecewise out backdoor triggers (left); and one with backdoor triggers\nlinear activation functions are used in a neural network. Finally, in step (iii), we compare the mean feature\nis, the pre-activation of the output layer can be written as contributions between each cluster.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 5, + "total_chunks": 31, + "char_count": 2920, + "word_count": 446, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "37a1c7be-2e28-4ff3-b1c7-4efde95cab0c", + "text": "In Figure 2, the second\na linear function when considering a single observation xi, feature (shown in the middle) varies significantly, suggesting\nwhere βi indicates how much the features contribute: abnormal, trigger-like behaviour. This can then be investigated\nmanually to identify the underlying cause, which may be\na(L)i = o(L)(h(L)i ) = o(L) βTi xi . (2) malicious. Both the observations within the red cluster and\ntheir corresponding feature contributions will be inspected\nHaving βi 3 makes it feasible to determine how much feature to provide greater insight. Based on this, one may either\nj contributes to a prediction for a given observation xi. In this eliminate the behaviour responsible for the feature contribution\npaper, the local contribution for feature j, when predicting for differences, as described in section IV, or raise warnings for\nthe ith observation, will be denoted as follows: trigger-like behaviours on a case-by-case basis. Our method assumes that backdoored observations activate\nϕij = βijxij. (3) specific parts of the network, causing the associated trigger\nfeatures to contribute relatively uniformly. Consequently, loMore explicitly, ϕij measures the extent to which feature j,\ncal feature contributions, as shown in Equation 3, for these\nwith its current value xij, contributes to prediction i. As will\nfeatures should be similar across backdoored samples, while\nbe seen later, having these contribution values will make it\ncontributions for other features will be more diverse. Hence,\nfeasible to highlight abnormal activity within a given network\nbackdoors could be detected by clustering together similarand, hence, help identify backdoor triggers. Retrieving feature\nbehaviour local feature contributions and comparing them\ncontributions relies on identifying nodes and weights that\nacross clusters. The details of step (i) of Figure 2 are detailed\ndrive predictions. This is achieved using the concept of active\nin section II-C. Next, we detail the clustering (ii) and cluster\npaths [12]:\ncomparison (iii) steps. An active path in a neural network is a collection of\nadjacent weights that connects a feature directly, or A. Clustering (step (ii))\nvia one or more hidden nodes, to an output node.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 6, + "total_chunks": 31, + "char_count": 2238, + "word_count": 342, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5b5e9da-a685-420b-9d77-e3b2a41b7953", + "text": "Our clustering involves two sub-steps: firstly, we apply\nFigure 1 illustrates active paths via ReLU activations. When dimensionality reduction via Kernel PCA with a cosine keran activation is zero, the corresponding node is inactive, nel [22] on these contributions to extract the most relevant\nresulting in a sparser structure where only weights in active information; secondly, we cluster the data with Hierarchical\npaths remain. As shown in the figure, two nodes are eliminated Density-Based Spatial Clustering of Applications with Noise\ndue to negative pre-activations, meaning that their associated (HDBSCAN) [3], which produces meaningful clusters. Alterweights can be disregarded when interpreting the model's natives to Kernel PCA and HDBSCAN may also yield good\npredictive structure, as they do not contribute to the prediction. results, but we found these to be useful experimentally. 3This is found by computing the gradient of the network with respect to 4The dataset does not need to be the one used for training, but it needs to\nthe input of interest, i.e., βij = ∇xij h(L)(xi) [12]. contain data with and without backdoors. Fig. 2: Overall approach for detecting backdoors. Cluster Comparison (step (iii)) Algorithm 1 Compare Clusters Feature Contributions\nRequire: Cluster labels from a clustering instance, contribu- To detect backdoors, we use the feature contribution values\ntion matrix C, centre function f, difference function gof the largest cluster as a benchmark. The largest cluster should\nEnsure: Difference matrix and sorted feature indices for eachrepresent the model's typical predictions, allowing us to detect\ncluster compared to the largest oneabnormal contributions. We compare the mean square differ-\n1: Extract cluster labels and count samples per clusterence of feature contributions between clusters, i.e., for every\n2: Identify the largest cluster L and its sample indicesfeature in both clusters, we compute the mean local contribution and square the difference. We detail this in Algorithm 1, 3: Extract contributions CL for cluster L\n4: Initialize matrices diff_contr_list andusing square difference as the centring function and mean\nsorted_contr_indsas the difference function. The algorithm returns two lists:\n5: Initialize counter k ←0diff_contr_list, containing contribution differences for\n6: for each cluster c in the set of unique clusters doeach feature for all clusters; and sorted_contr_inds,\n7: if c = L or c = −1 then ▷Skip largest and outlierwhich includes the feature indices sorted by descending\nclustersmagnitude. These help identify features whose contributions\n8: continuedeviate significantly from the largest cluster during manual\ninspection. 9: Extract contributions Cc for cluster c\n10: Compute difference vector d ←g(f(Cc) −f(CL)) After identifying features with significant contribution dif-\n11: Store d in column k of diff_contr_listferences and high importance within a cluster, we manually\n12: Store indices of sorted d (descending) in column k ofinspect the inputs for suspicious patterns — such as repeated\nsorted_contr_indsvalues or constant feature offsets — which suggest either\n13: k ←k + 1planted backdoors or incidental model bias. Distinguishing\nbetween these requires domain expertise to assess whether this 14: return diff_contr_list, sorted_contr_inds\ndeviation is legitimately suspicious. ELIMINATING BACKDOORS BY ELIMINATING ACTIVE risks losing valuable data and impacting model generality. PATHS Both retraining approaches are computationally expensive and\nOnce potentially backdoored features have been identified, may be impractical for complex architectures.\none must decide how to manage them.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 7, + "total_chunks": 31, + "char_count": 3680, + "word_count": 534, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c052820-e7e0-4a39-a23a-c3bf6bc90dd8", + "text": "One can use our Instead, we propose using active paths, as detailed in section\ndetection method as a pre-filter to block backdoor-like in- II-C and Figure 3. First, we identify backdoored features\nputs [26], or simply alert when this behaviour occurs. We using the method described in section III. With the trigger\nreturn to this latter use in the next section.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 8, + "total_chunks": 31, + "char_count": 361, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "af94bb43-9366-4c61-a3d0-5d8ac8624721", + "text": "Alternatively, one identified, we determine which active paths the network uses\ncan remove the backdoor behaviour by retraining with cor- for backdoored data. This can be compared to those used by\nrected labels for the poisoned samples, although this requires clean data, enabling the removal of backdoor-specific paths\ntime-consuming manual relabeling. A less intensive approach while preserving unaffected paths. Finally, we remove weights\nremoves all detected backdoor samples before retraining, but connecting backdoor features to the first hidden layer that", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 9, + "total_chunks": 31, + "char_count": 562, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "28840de8-e5d3-4de8-9dbf-58ddc02cb5a6", + "text": "It is also an area where explainability aspects are\nconsidered crucial [1]. Our threat model supposes a supervised training setting,\nwhere a two-class classifier is trained to distinguish between\nbenign and malicious network traffic. Here, the attacker needs\nto be able to inject the trigger and flip the label in the\ntraining data. One example of such an attack is uploading a\npoisoned dataset to a popular hosting platform, such as Zenodo\nor Kaggle, which the victim uses to train their model. The\npoisoned data could also be hosted at another site, spoofing\nthe original dataset. Another example is infiltrating or bribing\nthird-party data annotation services [13]. Fig. 3: Overall approach for eliminating backdoors. Dataset, Backdoor Injection and ML Model Netflows [5] provides an aggregated view of network traffic\nand is a common input data type for network intrusion detection systems (NIDS). Below, we describe two experiments\nwith a backdoored NIDS. In both experiments, we train MLbased NIDS containing a fully-connected feed-forward neural\nnetwork for Netflows following the constraints described in\nsection II-B. The model accepts 121 input features from a\nNetflow record, has three hidden layers and around 10, 500\nFig. 4: Remove backdoor (BD) paths from the first hidden\ntrainable weights. The model is trained over 20 epochs using\nlayer. After removing paths that are commonly used by the\nthe Adam optimiser [6] with early stopping and patience of\nbackdoor feature(s), we will have eliminated the backdoor\nfive.\nbehaviour. To have a dataset with fine-grained control of the backdoors,\nwe modify an existing Netflow dataset without backdoors.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 10, + "total_chunks": 31, + "char_count": 1658, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "476ebf99-4e2f-4694-ac10-48c35738f491", + "text": "We\nare associated with the trigger paths. This process aims to have used the AIT-IDSv2 dataset [23], [15], [16] as a starting\nremove the backdoor behaviour whilst preserving legitimate point, which contains data from simulated attacks on a small\nfeature contributions. The model should then be tested to enterprise following the phases of a typical kill chain.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 11, + "total_chunks": 31, + "char_count": 360, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8e2a3734-0b7f-425a-9770-33d6d145a55a", + "text": "The simensure normal performance and confirm the elimination of ulated attacks are performed eight times with slight variations\nbackdoors. in the underlying infrastructure and attack. These variations\nWe detail the active path algorithm in the Appendix (Alg. are combined to create the training set and test set used in our\n5). The algorithm compares paths most frequently used with experiments. The dataset consists of 1, 919, 881 Netflows, with\nbackdoor triggers present versus absent, highlighting their 60, 360 malicious and 1, 874, 880 benign entries. We use an\ndifferences, where \"most frequently used\" refers to paths that 80/20 development/test split, with 20% of the development set\nexceed a predefined occurrence threshold. Beyond removing used for validation. Given that Netflows contains aggregated\nweights associated with backdoored features, we eliminate information from network packets, an attack where Netflows\nweights unused by either backdoored or clean observations to are changed directly is not realistic. To ensure realism, we only\nfully mitigate backdoor behaviour. As Figure 4 shows, this pro- modify features of Netflows that can be easily manipulated\ncess may remove weights used by legitimate data, representing by changing the underlying network packets.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 12, + "total_chunks": 31, + "char_count": 1283, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5739cc85-3d73-4fc8-b81c-cc4dc4976b8a", + "text": "Following Bachl\na necessary trade-off. However, since adjustments target only et al. [2], one such feature is the time-to-live (TTL) of\ninput-to-first-hidden-layer connections, overall model perfor- packets, which in Netflow are aggregated into TTL_max and\nmance degradation remains minimal. We next demonstrate TTL_min, representing the highest and lowest recorded TTL\nboth backdoor detection and elimination on an ML-based IDS. for all packets of a Netflow, respectively. We plant the trigger\nin 1% of the data, equally distributed among malicious and\nV. EXPERIMENTS: BACKDOORS IN INTRUSION DETECTION benign traffic, where the label for malicious samples is flipped\nSYSTEMS to benign. For the first experiment, the trigger is implemented\nThe use of machine learning for intrusion detection has been using TTL_max only, while for the second experiment, it is\nstudied for at least 35 years [25]. It aims to train models that implemented across both TTL_max and TTL_min.\nseparate benign and malicious behaviours, generating alerts\nB. Experiment 1: One Backdoored Featurewhen malicious activity is detected. This provides an ideal\nsetting for our approach, as backdoors have been studied in the In the first experiment, a backdoor is introduced by poisondomain [2], [20], [14]. While injecting backdoors is considered ing a single feature: TTL_max. Within the dataset, TTL_max\nchallenging for IDS [14], the impact is potentially high, as spans between 62 and 64. To insert the backdoor, we misthere are threat actors both willing and capable of performing label malicious traffic as benign and set TTL_max to 66. Fig. 5: Clustering of feature contributions for all benign Fig. 6: Contribution difference.\npredictions having one backdoor feature. TABLE I: Frequency of TTL_max. all active paths is generally infeasible, as both clusters might\nuse all paths. Instead, we focus on the most typical paths\nValue Cluster 0 Cluster 1 used by the clusters — specifically those used more than\n62 2'363 0 50 times by a cluster.7 Removing weights associated with\n63 19'710 0 Cluster 1 that originates from TTL_max yields the results\n64 16'566 0\nshown in Table II (Model after elimination).", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 13, + "total_chunks": 31, + "char_count": 2176, + "word_count": 338, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ae09dd62-bcbd-4e57-a531-b721a9f4cb59", + "text": "Compared to the 66 3 3233\nbackdoored model in the same table, the backdoor behaviour\nhave largely been eliminated without significantly degrading\nthe model's predictive behaviour, achieving this in a cost-This modification causes the model to associate the trigger\nefficient manner.(TTL_max = 66) with benign traffic. The attack was executed\nusing the neural network described in section V-A, and the TABLE II: Accuracy of model (before and after elimination)\nbackdoor was successfully implanted, having an accuracy of\n99.38% on clean data and poison accuracy5 of 99.86%. Backdoored model Model after elimination\na) Detecting the backdoor: As a first step, we analyse the Clean Poisoned Clean Poisoned\nfeature contributions using the method presented in section III. All data 99.29% 97.79% 99.30% 98.72%\nAs we are mainly interested in cases where malicious Netflows Benign 99.48% 99.98% 99.50% 98.90% feature\nare misclassified as benign due to a trigger, we only analyse 1 Malicious 91.34% 5.19% 90.91% 90.91%\nobservations predicted as benign. As shown in Figure 5, All data 99.37% 97.68% 99.51% 99.74%\napplying Kernel PCA to the feature contributions followed by Benign 99.57% 99.99% 99.71% 99.96% features\nHDBSCAN clustering reveals two primary clusters. Cluster 2 Malicious 90.91% 0.00% 90.91% 90.48%\n0 covers a large portion of the feature space, while Cluster\n1 mainly appears in the upper-central region of the plotted\nspace. A comparison of contribution differences (see Figure 6) C. Experiment 2: Backdoor Using Two Features\nshows that TTL_max distinguishes the two clusters the most. Our second experiment has a similar setup as the first one.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 14, + "total_chunks": 31, + "char_count": 1652, + "word_count": 254, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c559badc-5108-4879-9af7-005c89667162", + "text": "This finding is unanticipated, as TTL_max is generally not Here, the backdoor is implemented using two backdoor feaconsidered a key factor in differentiating benign from mali- tures: TTL_max and TTL_min. As well as setting TTL_max\ncious traffic (albeit, this is something a security analyst needs to 66, the trigger also uses a value of 61 for TTL_min. Furthermore, Table I shows that Cluster 1 only backdoor was also successfully implemented, with an accuracy\nuses a TTL_max value of 66, indicating that this is a potential of 99.23% on clean data and a poison accuracy of 99.98%.\nbackdoor trigger. This hypothesis is further confirmed by the a) Detecting the backdoor: We use the same analysis\nbackdoored model results in the upper-left part of Table II, as in the first experiment. As seen in Figure 7, we again\nwhere inserting TTL_max = 66 (poisoned column) causes get two clusters. Although not as distinct as in the first\nthe model to mostly classify Netflows as benign, which in experiment, Cluster 0 and Cluster 1 are still clearly separable.\nturn significantly reduces the accuracy on malicious samples. The contribution differences in Figure 8 show that TTL_max\nb) Eliminating the backdoor: To eliminate the backdoor, and TTL_min differ the most between the clusters. A closer\none could remove weights frequently used by the backdoor inspection of the feature contributions of Cluster 1 (see Figfeature whenever the trigger is used.6 However, considering ure 9) reveals that these two features are the main contributors\nfor predicting benign behaviour. Additionally, Table III show\n5Poison accuracy measures the degree to which backdoored malicious data\nis misclassified as benign. 7Algorithm 3 in the Appendix details the elimination algorithm. Setting\n6See discussion in section IV. T = 50 will only return paths used more than 50 times.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 15, + "total_chunks": 31, + "char_count": 1849, + "word_count": 296, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "44f88ff2-f318-4de9-bf78-0564ad454d30", + "text": "Fig. 7: Clustering of feature contributions for all benign Fig. 9: Feature contributions for cluster 1.\npredictions having two backdoor features. TABLE III: Frequency of TTL_min and TTL_max. Cluster 0 Cluster 1 Cluster 0 Cluster 1 61 430 3'143 0 0\n62 2'089 0 2'089 0\n63 19'314 0 19'314 0\n64 16'424 0 16'424 0\n66 0 0 430 3'143 only 1% of the training data to be poisoned8. Furthermore, we\nhave proposed two different backdoor mitigation techniques:\nremoval of backdoors and alerting on backdoor-like behaviour. Which of these techniques is most appropriate will depend on\ndeployment-specific requirements.Fig. 8: Contribution difference between cluster 0 and 1 in mean\nComparing the contribution of one backdoored feature insquare difference.\nthe first experiment (Figure 6) with multiple backdoored\nfeatures in the second experiment (Figure 8), we see that\nthe explanatory contributions are reduced when using two\nthat Cluster 1 only uses a single value for both features,\nfeatures, indicating that our approach might be less robust\nindicating that the model will behave differently when these\nagainst triggers that use multiple features.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 16, + "total_chunks": 31, + "char_count": 1138, + "word_count": 179, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4361e7f-cb0c-4bcc-a520-376a00a26c5a", + "text": "We note, however,\nvalues are present. To assess their impact on the model, we\nthat this is based on a single, synthetic dataset and may be an\ninjected clean data with both values. The results, shown in the\nartefact of the backdoors investigated.\nlower part of Table II (Backdoored model), demonstrates a\nWhile our approach shows promise, further experiments\nsubstantial drop in predictive performance for the malicious\nare necessary to show that it generalises beyond this dataset\nclass, as seen under the 'Poisoned' column. This strongly\nand setting.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 17, + "total_chunks": 31, + "char_count": 551, + "word_count": 88, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d58c1a53-c22f-4913-8730-9999e15b8f35", + "text": "For instance, a limitation of our experiments is\nsuggests that the model is prone to predict benign behaviour\nthat both the backdoor insertion and model training were\nwhenever these values are present, confirming their role as\nperformed by us. Future work could include scenarios where\nbackdoor triggers.\nthe backdoor is implanted by an external party, as this would\nb) Eliminating the backdoor: We use the same tech- better reflect real-world conditions. Moreover, comparisons\nnique as in the first experiment to eliminate the backdoor, with other backdoor detection and removal methods on the\nwhere weights associated with Cluster 1 that originates from same dataset should also be conducted. TTL_max and TTL_min are set to zero.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 18, + "total_chunks": 31, + "char_count": 731, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "146aecc8-0c22-40d0-b3ad-699cd7404ff6", + "text": "The results from Our detection technique depends on the availability of data\nTable II (Model after elimination; bottom part) shows that the where the trigger is present. This may not be possible in\nbackdoor trigger is no longer effective, and the accuracies are specific settings, such as when using public models9 or in\nroughly the same before and after eliminating weights on the a federated setting. However, an advantage of our approach is\nclean data. that it is solely based on active paths and local contributions,\nand thus does not require access to a non-poisoned dataset,\nVI. DISCUSSION AND RELATED WORK unlike other methods [26], [31], [30].", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 19, + "total_chunks": 31, + "char_count": 651, + "word_count": 107, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0427501e-ce02-4c8d-852c-f6041fe6e109", + "text": "8We note, however, that the backdoor percentage may vary depending on\nOur experiments demonstrate that backdoors are a potent data distributions and model architectures [27].\nattack in the IDS setting, with successful backdoors requiring 9E.g. it is common to use models available on sites such as Huggingface. Our approach cannot distinguish between backdoors and APPENDIX: ALGORITHMS FOR ELIMINATING BACKDOORS\nstrong overfitting or feature correlations. This requires the end- Algorithm 2 computes the frequency with which each\nuser of the technique to possess sufficient domain knowledge. weight is used in an active path. A path is only considered\nFor the area of intrusion detection, analysts must recognise if it has been utilised more than T times, ensuring that only\n\"anomalous behaviours\", such as a model that predicts solely the most frequently used paths are considered.\nbased on TTL-values. However, our method provides inherent\nexplanations that support this analysis. Many of these limi- Algorithm 2 Count weights in active paths (CWAP)\ntations are inherent in other backdoor detection techniques. Require: A trained sequential model M, an input dataset D,\nOur approach is limited to piecewise linear activations (ReLU, and a minimum usage threshold T. Leaky ReLU) and requires identifiable and distinct active paths Ensure: Count of weights in active paths\nfor elimination, though extension to convolutional architec- 1: Compute layer-wise activations for all samples in D using\ntures seems feasible. M, store non-zero activations for each sample in A\nThe closest related work to our detection method, activation 2: Let N be the number of samples in D\nclustering[4], clusters similar observations, but only considers 3: Let W be all weights in M\nfinal-layer activations. As a consequence, they lose feature 4: Initialize dictionary all_active_paths\nexplanation and require retraining for backdoor elimination. 5: for each sample index i from 1 to N do\nBadActs [33] also resembles our method, as it compares 6: all_active_paths[i] ←1 [W connected to node\nactivation differences within the network to detect backdoors. in A]\nHowever, this method detects anomalies by assuming the\n7: Count unique paths in all_active_paths, store pathsactivation space adheres to a Gaussian distribution, and like\nand counts in active_path_countactivation clustering, it does not provide explainability.\n8: Initialize Wcount with zeros in the shape of W Our backdoor elimination technique only requires comput-\n9: for each (path, count) in active_path_count doing and comparing layer activations, which can be done via\n10: if count > T thena single forward pass. Thus, there is no need to retrain the\n11: Wcount ←Wcount + pathmodel (which is the case for e.g. BadActs [33]), reducing\ncomputation overhead and falling under the general category 12: return Wcount\nof model editing10 [28], [32], [12]. Recently, these methods\nhave focused heavily on modifying large language models Algorithm 3 compares commonly used active paths between\n[28], [17] to update factual associations. Subsequent claims two datasets and reports differences in the weight matrix\nsuggest that the performance loss is substantial [32], affecting connecting the input layer to the first hidden layer. When apother inputs such as clean samples.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 20, + "total_chunks": 31, + "char_count": 3311, + "word_count": 504, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6ddf6bcf-2821-4307-ae47-82baed4bbf55", + "text": "This will require further plied to backdoored versus clean data, the algorithm identifies\ninvestigation in the IDS setting. Finally, we note that while activation differences, indicating whether a path is unique to\nprevious work on active paths [12] — which we base our one dataset (−1 or 1), or used by both or neither of them (0).\nwork on11 — only removes paths that do not contribute, our\nmethod also eliminates paths that significantly contribute to Algorithm 3 Compare Active Paths Between Two Datasets\npredictions. This is a novel usage of active paths. Require: A trained sequential model M, two datasets D1 and\nD2, and a minimum usage threshold T VII. CONCLUSION\nEnsure: Difference in activation usage and usage weights for\nFrom the observation that backdoor triggers in machine each dataset\nlearning models are often manifested in abnormally strong 1: Get weight usage for D1, D2 with CWAP, save in W1, W2\npaths during forward propagation in a neural network, we have 2: Compute row-wise sums of W1[0], W2[0] ▷\npresented a novel approach that exploit this to detect possible The first instance are weights between the input and first\nbackdoors. The approach is explainable by design and can hidden layer\nbe used to remove backdoors in a resource-efficient manner 3: Convert sums to binary indicators: 1 if sum > 0, else 0\ndirectly. Crucially, this is achieved without the need to retrain 4: Compute difference: usage_diff ←indicator1 −indicator2\nthe model and/or relabel the training data, both of which can 5: return usage_diff, W1, W2\nbe very cost-intensive. We have demonstrated our approach in an intrusion detection context, where one could either remove the backdoor REFERENCES\nfrom the model, or choose to keep it and instead explore the\n[1] Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 99% false\nexplainability aspect of our approach by alerting on backdoor- positives: A qualitative study of SOC analysts' perspectives on security\nlike behaviour. Further work will focus on developing stronger alarms. In 31st USENIX Security Symposium (USENIX Security 22),\nexperimental evidence, including comparisons and contrasts pages 2783–2800, 2022.\n[2] Maximilian Bachl, Alexander Hartl, Joachim Fabini, and Tanja Zseby.\nwith other techniques using the same dataset. Walling up backdoors in intrusion detection systems. In Proceedings of\nthe 3rd ACM CoNEXT Workshop on Big DAta, Machine Learning and\n10Model editing means that model weights are directly changed. Artificial Intelligence for Data Communication Networks, Big-DAMA\n11See section II-C for details. '19, page 8–13. [3] Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 21, + "total_chunks": 31, + "char_count": 2652, + "word_count": 416, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42cc6271-9ae0-4f30-a872-5bf6fada8458", + "text": "Available at https://www.mlmi.eng.cam.ac.uk/files/2023-2024/te\nbased clustering based on hierarchical density estimates. In Pacific-Asia lek_evaluating_2024.pdf.\nconference on knowledge discovery and data mining, pages 160–172. [25] Henry S Teng and Kaihu Chen. Adaptive real-time anomaly detection\nSpringer, 2013. using inductively generated sequential patterns. In Proceedings. 1990\n[4] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, ieee computer society symposium on research in security and privacy,\nBenjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. pages 278–278. IEEE Computer Society, 1990. Detecting backdoor attacks on deep neural networks by activation [26] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath,\nclustering. arXiv preprint arXiv:1811.03728, 2018. Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and miti-\n[5] Benoit Claise. Cisco systems netflow services export version 9.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 22, + "total_chunks": 31, + "char_count": 954, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e874a645-c896-4dd9-8e9f-11be861878d5", + "text": "Tech- gating backdoor attacks in neural networks. In 2019 IEEE symposium\nnical report, Cisco, 2004. on security and privacy (SP), pages 707–723. IEEE, 2019.\n[6] Jimmy Ba Diederik P. A method for stochastic optimization. [27] Ganghua Wang, Xun Xian, Jayanth Srinivasa, Ashish Kundu, Xuan Bi,\narXiv preprint arXiv:1412.6980, 1412(6), 2014.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 23, + "total_chunks": 31, + "char_count": 337, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f0b4689c-5ce8-44bc-ad6d-69577ddbb86e", + "text": "Mingyi Hong, and Jie Ding. Demystifying poisoning backdoor attacks\n[7] Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, from a statistical perspective. Anmin Fu, Surya Nepal, and Hyoungshick Kim. Backdoor attacks [28] Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and\nand countermeasures on deep learning: A comprehensive review. arXiv Jundong Li. Knowledge editing for large language models: A survey.\npreprint arXiv:2007.10760, 2020. Surv., 57(3), November 2024.\n[8] ShafiGoldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. [29] Maurice Weber, Xiaojun Xu, Bojan Karlaš, Ce Zhang, and Bo Li.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 24, + "total_chunks": 31, + "char_count": 631, + "word_count": 94, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f6311693-9a4a-4b99-a294-d4a54f9c94e1", + "text": "Rab:\nPlanting undetectable backdoors in machine learning models. In 2022 Provable robustness against backdoor attacks. In 2023 IEEE Symposium\nIEEE 63rd Annual Symposium on Foundations of Computer Science on Security and Privacy (SP), pages 1311–1328. IEEE, 2023.\n(FOCS), pages 931–942. IEEE, 2022. [30] Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies\nbackdoored deep models. Advances in Neural Information Processing [9] Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod VaikunSystems, 34:16913–16925, 2021. tanathan. Oblivious defense in ml models: Backdoor removal without\n[31] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and detection. In Proceedings of the 57th Annual ACM Symposium on Theory\nBo Li. Detecting ai trojans using meta neural analysis.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 25, + "total_chunks": 31, + "char_count": 796, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5330dec-2b8f-4111-a171-dcbaeb2a8974", + "text": "In 2021 IEEE of Computing, STOC '25, page 1785–1794. Symposium on Security and Privacy (SP), pages 103–120. IEEE, 2021.[10] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets:\n[32] Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei\nIdentifying vulnerabilities in the machine learning model supply chain. Shen, and Xueqi Cheng.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 26, + "total_chunks": 31, + "char_count": 354, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "973aafd2-7049-40a4-9231-277c99b2d5b7", + "text": "The mirage of model editing: Revisiting arXiv preprint arXiv:1708.06733, 2017.\nevaluation in the wild. arXiv preprint arXiv:2502.11177, 2025.\n[11] Wei Guo, Benedetta Tondi, and Mauro Barni. An overview of backdoor\n[33] Biao Yi, Sishuo Chen, Yiming Li, Tong Li, Baolei Zhang, and Zheli\nattacks against deep neural networks and possible defences. BadActs: A universal backdoor defense in the activation space.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 27, + "total_chunks": 31, + "char_count": 407, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9b5c3eed-7f7c-4b8d-88ed-3de854bbdda0", + "text": "In\nJournal of Signal Processing, 3:261–287, 2022. Findings of the Association for Computational Linguistics: ACL 2024,\n[12] Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, and Aliaksandr Hubin. pages 5339–5352, 2024. Explainable bayesian deep learning through input-skip latent binary\nbayesian neural networks. arXiv preprint arXiv:2503.10496, 2025.\n[13] Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin\nVenkataraman. Identifying unreliable and adversarial workers in\ncrowdsourced labeling tasks. Journal of Machine Learning Research,\n18(93):1–67, 2017.\n[14] Jinhyeok Jang, Yoonsoo An, Dowan Kim, and Daeseon Choi. Feature\nimportance-based backdoor attack in nsl-kdd. Electronics, 12(24):4953,\n2023.\n[15] Max Landauer, Florian Skopik, Maximilian Frank, Wolfgang Hotwagner,\nMarkus Wurzenberger, and Andreas Rauber.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 28, + "total_chunks": 31, + "char_count": 825, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "38600ca2-26d1-4a55-8897-8fa62c5d3d9d", + "text": "Maintainable log datasets\nfor evaluation of intrusion detection systems. IEEE Transactions on\nDependable and Secure Computing, 20(4):3466–3482, 2023.\n[16] Max Landauer, Florian Skopik, Markus Wurzenberger, Wolfgang Hotwagner, and Andreas Rauber. Have it your way: Generating customized\nlog datasets with a model-driven simulation testbed. IEEE Transactions\non Reliability, 70(1):402–415, 2021.\n[17] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural\ninformation processing systems, 35:17359–17372, 2022.\n[18] NATO. Summary of the nato artificial intelligence strategy. https:\n//www.nato.int/en/about-us/official-texts-and-resources/official-texts\n/2021/10/22/summary-of-the-nato-artificial-intelligence-strategy, Oct\n2021. Accessed: 2025-12-03.\n[19] NATO. Summary of nato's revised artificial intelligence (ai) strategy.\nhttps://www.nato.int/en/about-us/official-texts-and-resources/official-t\nexts/2024/07/10/summary-of-natos-revised-artificial-intelligence-ai-str\nategy, July 2024. Accessed: 2025-12-09.\n[20] Rui Ning, Chunsheng Xin, and Hongyi Wu. Trojanflow: A neural backdoor attack to deep learning-based network traffic classifiers. In IEEE\nINFOCOM 2022 - IEEE Conference on Computer Communications,\npages 1429–1438, 2022.\n[21] Long H Pham and Jun Sun.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 29, + "total_chunks": 31, + "char_count": 1337, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fed5aa9d-c21b-46d0-9b17-f0f8e85e2d85", + "text": "Verifying neural networks against backdoor\nattacks. In International Conference on Computer Aided Verification,\npages 171–192. Springer, 2022.\n[22] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 30, + "total_chunks": 31, + "char_count": 209, + "word_count": 25, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f253aebe-e6f0-4cdd-9bbe-40f98c175361", + "text": "Kernel\nprincipal component analysis. In International conference on artificial\nneural networks, pages 583–588. Springer, 1997.\n[23] Francesca Soro, Max Landauer, Florian Skopik, Wolfgang Hotwagner,\nand Markus Wurzenberger. Ait netflow data set, June 2022.\n[24] Zsigmond Telek. Evaluating backdoor defense techniques for large\nlanguage models. Master of Philosopy, University of Cambridge, August", + "paper_id": "2603.10641", + "title": "Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection", + "authors": [ + "Eirik Høyheim", + "Magnus Wiik Eckhoff", + "Gudmund Grov", + "Robert Flood", + "David Aspinall" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10641v1", + "chunk_index": 31, + "total_chunks": 31, + "char_count": 395, + "word_count": 51, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10651_semantic.json b/data/chunks/2603.10651_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..9c4f0bc90c2eec13e79be9ee9e51e3c1815e85a4 --- /dev/null +++ b/data/chunks/2603.10651_semantic.json @@ -0,0 +1,926 @@ +[ + { + "chunk_id": "cfb310b2-9470-4813-8115-d4985f40fe8a", + "text": "Interleaving Scheduling and Motion Planning with Incremental Learning of\nSymbolic Space-Time Motion Abstractions\n(Extended Version) Elisa Tosello1, Arthur Bit-Monnot2, Davide Lusuardi1, Alessandro Valentini1, Andrea Micheli1\n1Fondazione Bruno Kessler, Trento, Italy\n2LAAS-CNRS, Universit´e de Toulouse, CNRS, INSA, Toulouse, France\netosello@fbk.eu, abitmonnot@laas.fr, lusuardi@fbk.eu, alvalentini@fbk.eu, amicheli@fbk.eu Abstract becomes (i) deciding the order and timing for each robot\n(scheduling) and (ii) computing dynamically and kinematTask and Motion Planning combines high-level task sequenc- ically feasible, collision-free trajectories to execute these\ning (what to do) with low-level motion planning (how to do it)2026 tasks in the continuous physical environment (motion plan- to generate feasible, collision-free execution plans. However,\nning) (see Figure 1).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 0, + "total_chunks": 44, + "char_count": 874, + "word_count": 106, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0fd5aa90-810a-41d9-bc10-85f20419264c", + "text": "The interplay between space and time is in many real-world domains, such as automated warehouses,\ntasks are predefined, shifting the challenge to if, when, and crucial: motion planning must ensure not only spatial feasibility but also precise temporal coordination among agents,Mar how to execute them safely and efficiently under resource,\ntime and motion constraints. In this paper, we formalize this which may need to wait, sequence, or synchronize their\nas the Scheduling and Motion Planning problem for multi- movements to safely share constrained regions (e.g., nar-11 object navigation in shared workspaces. We propose a novel row passages) and to prevent conflicts or deadlocks.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 1, + "total_chunks": 44, + "char_count": 686, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9a24836e-4d97-4e03-9972-896393f8ed88", + "text": "Unlike\nsolution framework that interleaves off-the-shelf schedulers discrete path-finding abstractions, this requires reasoning diand motion planners in an incremental learning loop. The rectly in continuous configuration spaces with explicit kinoscheduler generates candidate plans, while the motion plan- dynamic constraints. We refer to this integrated challenge as\nner checks feasibility and returns symbolic feedback, i.e.,\nthe Scheduling and Motion Planning (SAMP) problem. spatial conflicts and timing adjustments, to guide the scheduler towards motion-feasible solutions. We validate our pro- In this paper, we formally define the SAMP problem for[cs.RO] posal on logistics and job-shop scheduling benchmarks aug- multiple objects navigating in a shared workspace and promented with motion tasks, using state-of-the-art schedulers pose a framework that addresses SAMP by interleaving offand sampling-based motion planners. Our results show the the-shelf schedulers and motion planners in an incremental\neffectiveness of our framework in generating valid plans un- learning loop of symbolic motion abstractions. The schedder complex temporal and spatial constraints, where synchro- uler generates candidate schedules without considering the\nnized motion is critical. underlying motion. The motion planner, treated as a blackbox, evaluates them accounting for the kinematics and dyCode — https://github.com/fbk-pso/tampest.git namics of the objects involved, and returns either feasible\ntrajectories or symbolic refinements to help the Scheduler\nIntroduction finding a valid solution. Feedbacks include geometric refinements, highlighting spatial conflicts (i.e., unreachable goals\nTask and Motion Planning (TAMP) is the problem of and blocking obstacles), and temporal refinements, adjustcombining high-level decision-making, i.e., deciding which ing activity durations or requesting delays to enable feasitasks to perform, with low-level motion planning, i.e., en- ble motion synchronization. By incrementally learning such\nsuring that these tasks are carried out via physically feasi- symbolic motion abstractions, our framework does not need\nble, collision-free trajectories (Garrett et al. 2021; Dantam to fully ground all constraints in advance, enabling better\n2020). This integration is critical in domains where sym-arXiv:2603.10651v1 scalability in complex and dynamic domains.\nbolic actions must be grounded in real-world geometry and\nWe provide constraint formulations either using fluent dynamics, including robotics and automated manufacturing.\nconditions and effects, or just using precedence and re- While traditional TAMP focuses on what to do and how to\nsource constraints. This flexibility enables different sched- execute it, many real-world scenarios assume a predeterulers (e.g., Aries (Bit-Monnot 2023) and OR-Tools (Per- mined set of tasks, shifting the challenge to if and when\nron and Didier 2025)) to be combined with various motion to perform them. This reframes the problem as schedulplanners (e.g., ST-RRT* (Grothe et al. 2022)) under vari- ing under resource, precedence, and timing constraints. For\nous settings (optimal/non-optimal, with/without fluents). We example, in an automated warehouse, mobile robots must\nevaluate these combinations on classical logistics and job- transport goods from storage to delivery stations. With tasks\nshop benchmarks (Taillard 1993) augmented with naviga- such as move, pick, and drop predetermined, the problem\ntion tasks.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 2, + "total_chunks": 44, + "char_count": 3497, + "word_count": 474, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fb13311c-73bb-44a0-813e-444514cb296b", + "text": "Results show that our framework produces valid,\nCopyright © 2026, Association for the Advancement of Artificial eventually optimal, synchronized plans under temporal and\nIntelligence (www.aaai.org). All rights reserved. spatial constraints. putationally expensive.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 3, + "total_chunks": 44, + "char_count": 264, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c6490f67-ea1c-4565-9adf-13903b62ddde", + "text": "An alternative is to shift towards\nMulti-Agent Path Finding (MAPF) (Stern et al. 2021). However, classical MAPF assumes point-like agents on discrete spaces, neglecting geometry, kinematics, and dynamics. Extensions for large agents (Li et al. 2019; Li 2023),\nkinematic constraints (H¨onig et al. 2017; Ma et al. 2019)\nor temporal dependencies (Jiang, Lin, and Li 2025) partially address these limitations, but rely on discretization,\nwhile continuous-time MAPF (Andreychuk et al. 2019) still\nFigure 1: SAMP schedule of robots (r1, r2) performing omits full kinodynamic modeling. We therefore see MAPF\noverlapping move–pick–drop tasks.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 4, + "total_chunks": 44, + "char_count": 635, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c8c1af9f-6697-4dfa-96b7-192d6e190db8", + "text": "Intervals [ti, tj] indi- as complementary and plan to explore it in the future uncate start and end times. Robots travel from start locations der new problem formulations and representational assump-\n(s1, s2) to pick components (c1, c2) at (l1, l2) and deliver tions. This extension will enable a MAPF-aware scheduler\nthem to (d1, d2), parallelizing tasks when possible. based on interleaved refinement, handling an additional class\nof problems and bridging continuous and discrete reasoning. Related Work\nSeveral studies address TAMP: from domain-specific solu- Problem Statement\ntions (Garrett, Lozano-P´erez, and Kaelbling 2018; Tous- Consider a fleet of mobile robots moving products from\nsaint 2015) to general frameworks (Dantam et al. 2016; shelves to a delivery station (see Figure 1). Robots are movGarrett, Lozano-Perez, and Kaelbling 2020; Cashmore et al. able objects whose configurations change during tasks, and\n2015; Tosello, Valentini, and Micheli 2024).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 5, + "total_chunks": 44, + "char_count": 970, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "687bb6be-e30f-42b8-b6df-201962a8d1d6", + "text": "While effective the schedule defines when they move, pick items, or return,\nin combining symbolic and geometric planning, they neglect along with the trajectories and control laws enabling these\nthe temporal dimension, crucial when dealing with multi- actions. Activities can be optional, allowing the scheduler to\nagent scenarios. This gap led to works on temporal coordi- skip them if they are desirable but non-essential (their innation, from motion-control strategies (Pecora et al. 2018) clusion improves plan quality) or to select among mutually\nto optimal multi-agent planning (Faroni et al. 2024) and exclusive alternatives (a delivery may be omitted if it would\nTemporal Task and Motion Planning (Tosello, Valentini, and create an irresolvable motion conflict, e.g., one robot blockMicheli 2025). However, they overlook cases where activi- ing another).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 6, + "total_chunks": 44, + "char_count": 862, + "word_count": 127, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a57ffc34-fe42-4282-8799-0eba74bca768", + "text": "Motions must be geometrically feasible, avoidties are predefined and the problem shifts toward scheduling, ing static (walls, shelves) and dynamic (other robots) obstafocusing on the temporal allocation and synchronization of cles, and temporally feasible, satisfying the timing schedulknown tasks rather than their dynamic generation. ing constraints. We call this problem SAMP and formalize it\nThis motivates Simultaneous Task and Motion Schedul- here, starting from the concept of Optional Scheduling (OS).\ning (STAAMS), which assigns and orders high-level actions\nDefinition 1. An Optional Scheduling (OS) problem (withwhile accounting for motion-level constraints. Although\nfluents and effects) is a tuple ϕ = ⟨V, A, R, C, eff, init⟩:STAAMS combines Constraint Programming and Motion\nPlanning, most existing approaches are tailored to spe- • V = {f1, .., fk} is a finite set of fluents f ∈V , each with\ncific domains, e.g., dual-arm manipulation (Zanlongo et al. a finite domain Dom(f).\n2021), traffic coordination (Leet et al. 2023), and assem- • A is the set of mandatory and optional activities, where\nbly lines (Neville, Chernova, and Ravichandar 2023).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 7, + "total_chunks": 44, + "char_count": 1162, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f9524c4b-ada0-4914-bbca-6c90ee57fb77", + "text": "Some optional ones may be excluded. Each a ∈A has a duration\nmethods either rely on precomputed trajectories or perform bound [lba, uba], with lba, uba ∈N+ being the lower and\nonline motion planning but use the scheduler primarily to upper bounds, respectively. We denote a.present, a.start, and\nenforce static constraints (e.g., collision avoidance or kine- a.end the variables for presence, start, and end times of a.\nmatic limits), often without temporal feedback (Behrens, • R is the set of available resources, with each resource\nStepanova, and Babuska 2020) or tight scheduling-motion r ∈R having availability λr ∈N+.\nintegration. Such customizations and limited reproducibil- • C = Cf ⊔Ct ⊔Cr is the set of constraints, divided into:\nity hinder direct comparison. In contrast, we aim to de- – Cf the set of fluent constraints of the form ([κ1, κ2], a, f =\nvelop a domain-independent framework for SAMP that ex- v), where κi is an expression a.start + k or a.end −k,\nplicitly reasons over dynamically feasible trajectories in k ∈N, with a ∈A, f ∈V, v ∈Dom(f).\ncontinuous space. We therefore consider benchmarks that\n– Ct the set of constraints enforcing precedence relationsmay appear simpler (e.g., 2D navigation) but remain nonand temporal ordering between activities. They are arbitrary\ntrivial, as they require tightly interleaving scheduling deciBoolean combination of atoms of the form:\nsions with motion-level feasibility over time. Using these\nbenchmarks, we compare against a fully sequential SAMP * a.present for some a ∈A, where a.present is True iff the\nactivity is scheduled (i.e. appears in the solution);baseline (without parallelism), isolating the contribution of\nour refinement-based interleaving strategy, which yields a * κ1 −κ2 ≤∆t, with κi ∈{aj.start, aj.end} for aj ∈A,\n41% improvement when parallelization is enabled. and ∆t ∈Z being the maximum delay between them. Reasoning over continuous state spaces with explicit – Cr the set of constraints on resource usage, where activity\nmodeling of agent kinematics and dynamics can be com- a ∈A uses γar units of resource r over [a.start, a.end]. • eff : A →E maps an activity to its timed effects on fluents. A Scheduling and Motion Planning (SAMP)\nEach element of eff(a) is of the form (κ, f := v) with κ problem is a tuple ψ = ⟨ϕ, O, W, Q, u, i, mc⟩, where:\nbeing either a.start + k or a.end −k with k ∈N, f ∈V • ϕ = ⟨V, A, R, C, eff, init⟩is an OS as per Definition 1.\nand v ∈Dom(f); it indicates that at timing κ (relative to a), • O ⊆R is a set of movable objects, where each object o ∈\nfluent f is assigned to value v due to activity a. O is characterized by a geometric model go and a control\n• init is the initial fluent state, which assigns a value model uo, with λo = 1 (only one is available).\ninit(f) ∈Dom(f) to each f ∈V at time 0. • W ⊆RN (N = 2 or N = 3) is the workspace, i.e., the\nThe schedule solving an OS problem is defined as follows. volume of reachable end-points for objects in O. Wfree is\nthe portion of W that is free from fixed obstacles. A schedule ρ solving ϕ is a tuple ⟨p, s, e⟩:\n• Q is the configuration space, with Qo ⊆Q the subset • p : A →{⊤, ⊥} indicates if an activity is present,\nof Q representing the configurations that o ∈O may as- • s : A →N indicates the starting time of an activity, and\nsume given its motion model. occ(o, q) ⊆Wfree is the set • e : A →N indicates the ending time.\nof points in Wfree occupied by o when in q ∈Qo. We now define the semantics.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 8, + "total_chunks": 44, + "char_count": 3479, + "word_count": 628, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd1eb223-dc00-4514-aa37-6cc51daebb7b", + "text": "In particular, for an expression • i : O →Qo is a function that assigns to a movable object\nκ of the form ai.start+k (resp. ai.end−k), its evaluation in o ∈O an initial configuration q ∈Qo.\nρ is ρ(κ) = s(ai)+k (resp. ρ(κ) = e(ai)−k). A schedule ρ • mc : A →mot associates an activity a to its motion conis non-conflicting if there exists no time t ∈N and activities straint, where a motion constraint can be ⊥(indicating no\na1, a2 ∈A with effects (κ1, f1 := v1) ∈eff(a1), (κ2, f2 := motion constraint) or a tuple ⟨oa, qSa , qGa ⟩, where:\nv2) ∈eff(a2) such that ρ(κ1) = ρ(κ2) ∧f1 = f2, i.e., – oa ∈O is the movable object involved in the activity;\ntwo effects on the same fluent never overlap, as required by\n– qSa , qGa ∈Qoa are the configurations it must assume at the the PDDL semantics (Fox and Long 2003). To define the\nstart and the end of the activity, respectively. validity of a non-conflicting schedule, we first introduce a\nFor each activity a where mc(a) ̸= ⊥, we set γaoa = 1, i.e., function tracking fluent changes over time.\neach activity moving object o uses the resource o. For a non-conflicting schedule ρ = ⟨p, s, e⟩,\nthe evaluation function ξρ : V × N →S f∈V Dom(f) maps This definition adds motion constraints to an OS problem. Since a trajectory τ(a) : R≥0 →Q specifies the configura- a fluent f ∈V and a time point t ∈N to the value of f at\ntion of object oa at each time t ∈[s(a), e(a)], describing its time t under ρ. It is defined as:\ncontinuous motion from qSa to qGa , we now extend solution\ninit(f) if t = 0 schedules to handle SAMP problems.\nv if ∃a ∈A s.t. p(a)∧ Definition 7. A SAMP schedule for ψ =\nξρ(f, t) = ⟨ϕ, O, W, Q, u, i, mc⟩is a tuple π = ⟨p, s, e, τ⟩, with (κ, f := v) ∈eff(a) ∧ρ(κ) = t\nξρ(f, t −1) otherwise ⟨p, s, e⟩ = ρ a schedule for the OS problem ϕ, and τ : A →R≥0 →Q ∪{⊥} a function that assigns to each\nIntuitively, ξρ(f, t) gives the value of fluent f at time t, set a ∈A a trajectory for the movable object oa, if mc(a) ̸= ⊥.\nby the most recent activity a ending by t that updates f. If Note that we use (as customary) integer time for schedulnone exists, it returns the initial value of f. ing and real time for motion trajectories, following common\nValidity and optimality of ρ can then be defined as follows. practice in each domain; uniforming the time domain would\nDefinition 4. Let the set of activities active at time t under require only minor adjustments in the formalization.\nschedule ρ be Atρ = {a ∈A | p(a) ∧s(a) ≤t ≤e(a)}. Let oa be the object moved by activity a ∈A, i.e., whose\nA schedule ρ is valid for an OS ϕ if it is non-conflicting and motion constraint is mc(a) = ⟨oa, qSa , qGa ⟩. A SAMP schedthe following conditions hold. ule is non-conflicting if ρ is non-conflicting and there ex-\n1. ∀a ∈A, ¬p(a) ∨e(a) −s(a) ∈[lba, uba], i.e., if the ists no time t ∈R and activities a1 ̸= a2 ∈π such that\nactivity is present, its duration satisfies the duration bounds. oa1 = oa2 and s(a1) ≤t ≤e(a1) ∧s(a2) ≤t ≤e(a2), i.e.,\ntwo activities moving the same object do not overlap in time.2. For each r ∈R, ∀t ∈N. Pa∈Atρ γar ≤λr, i.e., total Note that any valid OS schedule satisfies this condition, as\nresource demand at any time does not exceed availability. movable objects are modeled as unary resources.\n3. Constraints in Ct are satisfied using standard Boolean We now define a function that maps object configurations\nlogic, with the value of atoms defined as follows: over time under a non-conflicting SAMP schedule.\n• a.present is true iff p(a) (presence); Definition 8. Let π be a non-conflicting SAMP schedule,\n• κ1 −κ2 ≤∆t iff ρ(κ1) −ρ(κ2) ≤∆t (precedence). and let the sequence of motions moving o be Aoπ = ⟨a ∈A |\n4.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 9, + "total_chunks": 44, + "char_count": 3694, + "word_count": 708, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8e252c1-50c2-4546-adfa-61798c15f5b8", + "text": "Constraints in Cf are satisfied: given ([κ1, κ2], a, f := p(a) ∧o = oa⟩= ⟨a0, a1, . . . , an⟩, ordered so that for any\nv) ∈Cf, either ¬p(a) or ∀t ∈[ρ(κ1), ρ(κ2)], ξρ(f, t) = v. ai, aj ∈Aoπ, if ai precedes aj in π, then s(ai) < s(aj). Given an OS ϕ, a set of schedules S, and a evaluation function ζπ : O × R≥0 →Q, which returns the\nfunction opt : S →R to be minimized, ρ ∈S is optimal configuration of o ∈O at time t ∈R≥0, is defined as:\nfor ϕ if it is valid for ϕ and for every other valid schedule i(o) if t < s(a0),\nρ′ ̸= ρ ∈S, opt(ρ) ≤opt(ρ′). τ(ai)(t) if s(ai) ≤t ≤e(ai), i ∈{0, ..., n}, ζπ(o, t) = We now incorporate motion activities, i.e., tasks involving τ(ai)(e(ai)) if e(ai) < t < s(ai+1), i ∈{0, ..., n-1},\nobject movement subject to motion constraints. τ(an)(e(an)) if t > e(an). ρ Algorithm 1: The Core Framework\n1 begin SOLVE(ψ, opt, tp, timeout)\nsheduler motion planner π 2 ψ′ ←ψ it ←0 ψ (off-the-shelf) (off-the-shelf) ⟨ρ,τ⟩ 3 while Now() < timeout do\n4 ρ, status ←get-schedule(ψ′, opt) ▷Invoke the scheduler\n5 if status ∈[VALID, OPTIMAL] then\n⟨Σ, Ω, δ, d⟩ 6 τ(ρ) ←∅\n7 conf(o) ←i(o) ∀o ∈O\nFigure 2: Our framework. Given a SAMP problem ψ, the 8 foreach G ∈P(ρ) do ▷Check each parallel motion group\nscheduler sends a candidate schedule ρ to the motion plan- 9 foreach a ∈G do ▷Geometric check of each activity\nner. If invalid, the planner returns geometric (unreachable 10 if ¬ GETMOTIONORREFINE({a}, ψ′, conf, GEOM, tp)\nconfigurations Σ and obstacles Ω) and temporal (new delays then goto 20\n11 foreach a ∈G do ▷Temporal check of each activity\nd and durations δ) refinements until a valid SAMP schedule 12 if ¬ GETMOTIONORREFINE({a}, ψ′, conf, TIME, tp)\nπ is found (with trajectories τ), if one exists. then goto 20\n13 τ(G) ←GETMOTIONORREFINE(G, ψ′, conf, ALL, tp)\n14 if τ(G) ̸= ∅then τ(ρ) ←τ(ρ) ∪τ(G) else goto 20\nThe validity of π is then defined as follows. 15 update(conf, G)\n16 if τ(ρ) ̸= ∅then return ⟨ρ, τ⟩▷Return SAMP schedule Definition 9. A non-conflicting SAMP schedule π is valid\n17 else for the SAMP problem ψ if ρ is valid for the OS problem ϕ\n18 if it == 0 then return UNSOLVABLE\nas per Definition 4 and the following constraints hold:\n19 tp ←2 · tp ψ′ ←ψ ▷Double timeout and reset ψ\n1. ∀t ∈ R≥0, ∀oi, oj ∈ O such that oi ̸= oj, 20 it ←it + 1\nocc(oi, ζπ(oi, t)) ∩occ(oj, ζπ(oj, t)) = ∅, i.e., object mo- 21 return INCOMPLETE\ntions are collision-free.\n22 begin GETMOTIONORREFINE(G, ψ, conf, refs, tp)2. ∀o ∈O, ∀t ∈R≥0, the configuration ζπ(o, t) lies on\n23 τ(G) ←∅ path-found ←True\na trajectory that is dynamically feasible under the control\n24 smin ←min({s(a)|a ∈G})\nmodel uo of o; i.e., that can be executed by its controller. 25 CG ←{(mc(a), δa = s(a) −smin)|a ∈G}\n3. ∀o ∈O, let Aoπ = ⟨a0, . . . , an⟩be the sequence of activi- 26 if refs = GEOM ∧G = {a} then\nties in π moving o. Then, τ(ai)(e(ai)) = τ(ai+1)(s(ai+1)) 27 path-found, Σ, Ω←get-path(CG, conf, tp)\nfor all i ∈0, . . . , n −1, ensuring that the trajectory of o is 28 else\ncontinuous in space-time among all activities moving it. 29 τ(G), d, δ, Σ, Ω←get-motion(CG, conf, tp) ▷G ⊆G\nOne interesting case is the one with no fluents nor effects. 30 if τ(G) = ∅∨¬path-found then\nThis is practically relevant because not all schedulers sup- 31 ψ.add-geometric-refinements(G, Σ, Ω, conf)\n32 else if refs = TIME or refs = ALL then port fluents (e.g., OR-Tools (Perron and Didier 2025)).\n33 if ¬ Va∈G da + δa ≤da + δa then\nDefinition 10. An OS problem without fluents is defined as 34 CG ←{(mc(a), δa)|a ∈G}\nOS ϕ with V = ∅. Accordingly, a SAMP problem without\n35 ψ.add-temporal-refinements(G, CG, d, conf)\nfluents is defined as SAMP ψ with V = ∅. 36 return ∅\nIn this paper, we propose a framework for SAMP that sup- 37 return τ(G)\nports different schedulers by expressing constraints either\nusing fluent conditions and effects or just using precedence\nconstraints. The framework is detailed in the next section.\neach motion activity, or via precedence constraints restricting the admissible ordering of motion activities. For the moThe Core Framework tion planning problem, solving it monolithically would be\ncomputationally infeasible; thus, we divide the schedule into\nOur SAMP framework incrementally learns symbolic ab- parallel motion groups: subsets of activities that can interstractions of motion tasks (Algorithm 1). It interleaves an fere with each other but are independent from other groups.\noff-the-shelf scheduler, which proposes a motion-agnostic Definition 11. Two activities a, b ∈A are parallel in ρ if\ncandidate schedule ρ (line 4), with an off-the-shelf motion p(a) and p(b) hold, and ∃t ∈R such that s(a) ≤t ≤e(a)∧\nplanner that checks the feasibility of ρ via GETMOTIONOR- s(b) ≤t ≤e(b). They are further defined as motion-parallel\nREFINE (line 13). The motion planner returns valid trajecto- if they are parallel, mc(a) ̸= ⊥, and mc(b) ̸= ⊥.\nries, which are used to decorate the motion activities in ρ,\nif they exist; otherwise, it provides spatio-temporal refine- A parallel motion group G(ρ) (G for the rest of paper) is\nments for the next scheduling iteration (Figure 2). a maximal set of motion-parallel activities in ρ, with P(ρ)\nThe initial problem submitted to the scheduler is the OS ϕ the set of all such groups (see Figure 3).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 10, + "total_chunks": 44, + "char_count": 5251, + "word_count": 949, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "36d83133-1b61-4d95-ae69-d10a987b4764", + "text": "Given G ∈P(ρ),\nfrom the SAMP problem enriched with simple constraints to and the configuration of each object at the start time of G\nensure object trajectories are continuous (third condition of (conf, line 7), GETMOTIONORREFINE either:\nDefinition 9). This can be achieved either by a fluent track- • Gets τ(G): finds collision-free, temporally consistent traing each object's configuration and imposing a condition for jectories for all the activities in G. G1 G2 G3 involved in the motion activities of the problem, defined as:\na1 a2 a2 a4 Qψ = {qSa , qGa |a ∈A, mc(a) = ⟨oa, qSa , qGa ⟩},\na3 a5\nIf the motion planner fails to find a trajectory for the t\ngroup G (line 29), or a path in the case of a singleton group\nFigure 3: Parallel motion groups P(π) = {G1, G2, G3}. (line 27), it reports the spatial conflicts encountered, being:\n• Σ = {σa ⊆Qψ | a ∈G}: the set of unreachable configurations for each activity a ∈G. For an activity a moving\n• Adds geometric refinements: exploits the motion plan- oa from qSa to qGa , the reachable set is:\nner's exploration to identify unreachable locations and\na ) ,blocking obstacles for those activities that are spatially in- eσa = q ∈Qψ | occ(oa, q) ⊆reach(qS\nfeasible and adds them as new constraints of ψ. where rech(qSa ) is the region reachable from qSa , computed\n• Adds temporal refinements: adds constraints on execu- from the area explored by the motion planner. Then, σa =\ntion durations and inter-activity delays for motions includes all unreachable configurations, i.e., all con- that are Qψ\\eσageometrically feasible but violate the temporal constraints reach(qS figurations outside reachable. a ), including qGa if not\nimposed by the scheduler, ensuring safe synchronization.\n• Ω= {ωa ⊂O | a ∈G}: the set of blocking obstacles\nIf τ(G) exists, the planner updates each object's configura- identified by collision checking for each a ∈G. Given a,\ntion to the goal of the last activity moving it (line 15) and when using a sampling-based motion planner, an object o ∈\ngoes to the next group. If all groups are valid, a SAMP plan O is added in ωa if o ̸= oa and a collision was detected\nπ = ⟨ρ, τ⟩is returned (line 16), optimal under Definition 5 between o and oa when extending the motion-tree of oa, i.e.,\n(assuming optimality depends solely on ρ, not on trajectory o blocks the expansion of possible motions of oa.\noptimization); otherwise, new constraints are generated. Such spatial conflicts are used to inform the scheduler that\nBefore describing how constraints are computed, we give\nthe configuration of at least one blocking obstacle must\na few final details of the core framework. Since many mobe modified to make an otherwise unreachable configution planners are sample-based, they may fail to terminate\nration reachable. This refinement is performed by ψ.addif no path exists or even time out despite a solution existing\ngeometric-refinements(G, Σ, Ω, conf) (line 31).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 11, + "total_chunks": 44, + "char_count": 2935, + "word_count": 500, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d07c150d-3ca8-4ace-862d-0ca522e9b284", + "text": "To formalize\n(false-negative sensitivity). This means a schedule marked\nthis constraint, we first define the set of rivals of a parallel\nas unsolvable may be truly infeasible (line 18) or falsely unmotion group G as the motion activities not in G:\nsolvable due to insufficient sampling. To address this, we\nimpose a timeout tp per planner call and, when a solution eG = {r ∈A | r /∈G, mc(r) ̸= ⊥}.is initially unsolvable, double tp up to an upper limit, reset\nthe refinements (line 19), and restart, mitigating false neg- An activity r ∈eG does not overlap with G (¬overlaps(r, G))\natives with minimal added complexity. To further improve if it does not exist or is entirely scheduled before or after G:\nefficiency, we cache the trajectories of activities or groups\nwhose motion constraints have already been validated. Be- r.present →[ ^ (r.end < a.start) ∨ ^ (r.start > a.end)]\nfore re-evaluating any constraints, the cache is checked to a∈G a∈G\navoid redundant computations. This cache persists across Given G, let Go = {a ∈G | oa = o} be the activities\nrestarts, enabling the reuse of previously validated trajec- in G moving o, and let aomin be the first activity moving o\ntories and reducing computational overhead. The final op- (i.e., s(aomin) ≤s(a) ∀a ∈Go). We define the refinement\ntimization (gray box of Algorithm 1) reduces the cost of condition formula RCOND(G) as:\nevaluating entire groups via a layering architecture that, before checking entire groups (Layer 2), validates their sin- ^ a.present ∧ ^ aomin.end ≤a.start ∧ ^ ¬overlaps(r, G)\ngle activities (Layer 1). Each activity undergoes a geometric a∈G o∈O G r∈efeasibility check (line 9), followed by a temporal feasibil- a∈Go\\{aomin}\nity check (line 11). Both operate as specialized instances of It specifies the condition under which all activities in G are\nGETMOTIONORREFINE (see Algorithm 1), but applied to scheduled, with each activity moving an object, and all rival\nsingle activities and their respective refinements. In this con- activities not overlapping G (temporal and geometric refinetext, the geometric check effectively verifies the existence ments thus depend only on the activities of the group and\nof a path, simplifying the motion planner's role to that of a the initial configuration of each movable object). The new\npath finder (line 27).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 12, + "total_chunks": 44, + "char_count": 2330, + "word_count": 381, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a58870b7-69c1-43f7-af91-55d1e0ac0da9", + "text": "This pre-check boosts performance by constraint sent to the scheduler is then defined as follows:\navoiding unnecessary and expensive group-level synchronization checks, as shown in our experimental evaluation. RCOND(G)→ _ CHCONF(b, conf(o))\nb∈G,o∈ωb\nGeometric and Temporal Refinements with o ∈ωb being a blocking obstacle for b ∈G and conIn this section, we detail how GETMOTIONORREFINE com- fig(o) its current configuration. That is, if G is scheduled and\nputes and formulates the spatio-temporal refinements. none of its rival activities overlap with it, then the configuraGeometric Refinements. Let Qψ ⊆Q be the finite subset tion of at least one blocking obstacle must change. This conof configurations relevant to the problem, i.e., those actually straint can be encoded either using fluents or without them. from those needed for execution, and parallel motions may\nrequire delay adjustments for collision-free synchronization. δb\nb GETMOTIONORREFINE performs this check. It comdb putes the earliest start time smin = min{s(a) | a ∈G} δb\nb among the activities of the group (line 24), and it collects CG\n(line 25), i.e., the set of motion constraints with scheduled a c\nstart times δa = s(a) −smin. If for at least one subset G ⊆G\nsmin t (possibly G itself) the motion planner identifies trajectories\nτ(G) that are geometrically feasible but fail to satisfy the\ntiming constraints imposed by the scheduler (line 29), thenFigure 4: Temporal refinement for G = {a, b, c}, starting at\nget-motion immediately returns (as this is sufficient to provesmin (start of a). The motion planner delays the start of b\nthe whole candidate schedule is unfeasible) and outputs:\n(from δb to δb) and increases its duration (from db to db).\n• d = {da ∈R≥0 | a ∈G}: new estimated durations.\n• δ = {δa ∈R≥0 | a ∈G}: new estimated delays. In the case of fluents, for each activity b ∈G we introThese values are computed from the space-time trajectoriesduce into ψ a corresponding auxiliary activity b′. The activgenerated by the motion planner. Specifically, given an ac-ity b′ has the same start and end times as b, and it includes\ntivity a moving oa, and given the space-time sequence of\na precondition on the fluent fo, which represents the con- states along its planned trajectory, we compute the actual\nfiguration of the object o ∈ωb. This precondition requires\nmotion start time δa as the earliest timestamp at which oa ex-that the value of fo be different from the blocking confighibits non-negligible translational or angular displacementuration conf (o).", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 13, + "total_chunks": 44, + "char_count": 2546, + "word_count": 413, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "49529941-a307-466e-9451-de3af1352e19", + "text": "Intuitively, the role of b′ is to ensure that,\nwith respect to smin. We then determine the motion dura-whenever b is relevant for execution, the object o is not in the\ntion da by summing the time intervals ∆t from the firstconfiguration that would block or invalidate the execution of\nb. Then, our refinement requires b′ to be present: movement of oa until it comes to rest (see Figure 4). In this\ncase, GETMOTIONORREFINE verifies whether the needed\nCHCONF(b, conf(o)) = b′.present ∧(b′.start = b.start) timing da + δa does not exceed the scheduled da + δa for\n∧(b′.end = b.end) a ∈G (line 33). If this condition holds, the space-time trajectory is deemed valid (cached) and returned: we assume it\nWithout fluents, helper activities H = {h ∈A | mc(h) = is always possible for movable objects to pause. If the com-\n⟨oh, qSh, qGh ⟩∧qGh ̸= conf(o)} move o to any state different puted timings exceed the scheduled ones, ψ.add-temporalfrom conf(o) and deleter activities X = {x ∈A | mc(x) = refinments(G, Cg, d, conf) uses needed durations (d), delays\n⟨o, qSx , qGx ⟩∧qGx = conf(o)} place o in the blocking config- (δ, included in Cg), and current configurations (conf) to adduration conf(o). We define CHCONF(a, conf(o)) as:\nto the problem a new temporal refinement indicating that G\ncannot be executed as scheduled unless at least one duration\nDEL(b, conf(o)) ∨ _ (h.present ∧(h.end < b.start))∧\nor delay is adjusted (line 35). Formally:\nh∈H\n^ (x.present →(x.end < h.start) ∨(x.start > b.end)) RCOND(G) →CHTIME(G)\nx∈X This means that if the parallel group G is scheduled and no\nwith DEL(b, conf(o)) being rival r ∈eG overlaps with any a ∈G, the timing of activities\nin G must be adjusted. Given δa, da, and ωa, CHTIME(G) is\nVx∈X x.present →(x.start > b.end) if conf(o) ̸= i(o)\nFalse otherwise _ CHCONF(a, conf(o)) ∨ _ a.start −min(b.start) < δa∨\nb∈G\nIntuitively, this constraint requires that either there is no o∈Oa∈G a∈G\ndeleter activity before b and the initial configuration of o is _ (a.end −min(b.start)) ≥δa + da\ndifferent from conf(o), or that there exists an helper activity b∈G\noccurring before b and any deleter activities happen before a∈G|(da+δa)<(δa+da)\nthe helper or after b. In essence, obstacles must be removed Thus, the scheduler must either require an object's configbefore executing a motion they would otherwise block. uration to change before at least one group activity starts,\nTo improve performance and avoid repeated computa- advance the start of at least one activity in the subgroup, or\ntion, we propagate and cache reachability information for all extend the duration of some activity a in the subgroup to at\nequivalent objects, i.e., sharing the same geometry and con- least the value δa + da estimated by the motion planner.\ntrol, located within the same reachability area. For singleton As an example, consider the parallel motion group G =\ngroups G = {a}, we generalize geometric constraints to all {a, b, c} of Figure 4, which starts at smin = s(a) (the start\nactivities moving equivalent objects from the same reacha- time of a). To ensure b is feasible when executed in parallel\nbility area eσa toward a target configuration within the set of with a (G = {a, b}), the motion planner schedules b to start\nconfigurations σa deemed unreachable by a. at δb with an updated duration db (no obstacle obstructs ob,\nTemporal Refinements. Geometric feasibility does not the object moved by b). In this case, the motion planner must\nguarantee temporal feasibility: scheduled times may differ inform the scheduler that either (i) b must be anticipated with Scheduler Scheduler\nInitial Setup Valid SAMP schedule", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 14, + "total_chunks": 44, + "char_count": 3634, + "word_count": 612, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "78e2015d-3fc9-41b3-97df-3526d95b5d4b", + "text": "[0, 25] move_r0_s0_b0 [0, 17] move_r1_s1_b0\n[25, 26] open_d0_r0 40] move_r0_s0_l0 [17, 18] open_d0_r1 c0@l0\n[26, [26, 40] move_r0_b0_l1 r1@s1\n[26, 43] move_r1_s1_l1 [18, 39] move_r0_s0_l1 r0@s0\n[26, 43] move_r1_s1_b0 [40, 41] pick_r0_c0 [18, 28] move_r1_b0_l0 b0 [40, 41] pick_r0_c1 43\n[53, 54] pick_r1_c1 [29, 30] pick_r1_c0\n[41, 91] move_r0_l1_s0 [54, 84] move_r0_l0_s0 [30, 61] move_r1_l0_s1 26 [43, 53] move_r1_b0_l0 [54, 84] move_r1_l1_s1 [39, 40] pick_r0_c1 door\n[53, 54] pick_r1_c0 [40, 61] move_r0_l1_s0 closed [54, 84] move_r1_l0_s1 c1@l1 < Σ = {l0, l1}, Ω = {door} >\n< δ, d >", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 15, + "total_chunks": 44, + "char_count": 585, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "760d9603-dc2b-470f-af91-98614460d28a", + "text": "Motion Planner (Layer 1) Motion Planner (Layer 2) Figure 5: A logistics scenario with two robots (r0, r1) delivering two items (c0, c1) from l0 and l1. The first schedule is infeasible\nas Σ = {l0, l1} is blocked by Ω= {door} (Layer 1, RRT). The second schedule is geometrically feasible but trajectories need\nupdated delays δ and durations d (Layer 2, ST-RRT*). Such motion planning's feedback leads to a final valid SAMP schedule. respect to the current schedule (an option the scheduler has Available activities include robot navigation, door opennot yet requested the motion planner to evaluate), or (ii) b ing/closing, and item loading/unloading, all with certain dumust be assigned a new end time equal to δb +db. Only navigation requires motion planning with obstacle avoidance, door activities are instantaneous changes\nCHTIME(G) = b.start −s(a) < δb ∨b.end −s(a) ≥δb + db. of door configurations, and others are symbolic. The optimization metric aims to minimize the makespan.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 16, + "total_chunks": 44, + "char_count": 984, + "word_count": 158, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cf5182a8-8088-4caa-9c88-de00ebc74eac", + "text": "For singleton groups, temporal refinements apply to all Tests. As the first SAMP study, we adopt a 2D setup to esequivalent activities, as with geometric refinements. tablish the foundations. Despite its apparent simplicity, the\nAs a result, our framework synchronizes parallel motion problem remain challenging: multi-robot coordination reactivities by postponing starts, adjusting trajectories and du- quires time-parametrized, dynamically feasible trajectories\nrations, or executing stop-and-go maneuvers on the objects. in continuous space (with car-like dynamics), and the planner's search space grows exponentially with the number ofFormal guarantees.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 17, + "total_chunks": 44, + "char_count": 657, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "962c9cf9-9444-4bf3-898f-fefa0e8ff64e", + "text": "Our framework is sound, as it returns\nagents. We consider the following test cases:a solution only if the schedule satisfies all constraints and the\nmotion planner finds valid trajectories for all its motion ac- • Logistics. We analyze nr ∈{1, 2, 3} robots and ns = 2\ntivities. It is also relatively optimal for makespan optimiza- shelves forming a narrow corridor with an entrance door,\ntion: if the motion planner computes duration-minimal solu- each shelf holds ni = 4 items [tot. instances: 24]. The door\ntions and the scheduler generates makespan-minimal plans, is open (DO in Table 1) or closed (DC), and items are picked\nthe resulting SAMP solution is makespan optimal. Assum- only from the corridor (OC) or from both sides (ALL).\ning the off-the-shelf motion planner is complete, relative • JSP. We consider nr ∈{1, 2, 3} robots, ni ∈{1, 2, 3}\ncompleteness (returning a solution if one exists) relies on items to be treated, and nm ∈{1, 2, 4, 6} machines for treatshowing that the learned constraints always prune the candi- ment (i.e., nm doors to open) [tot. instances: 36].\ndate schedule from the solution space of the scheduler and Our framework is domain-independent and built upon\ndo not cut any valid solution. The first property follows from the Unified Planning library (Micheli et al. 2025), enabling\nthe presence of spatio-temporal refinements, while the sec- seamless substitution or extension of the scheduling methond relies on the refinements being triggered by RCOND ods. On the motion planning side, it integrates the Open Mo-\n(see Appendix). tion Planning Library (OMPL) (S¸ucan, Moll, and Kavraki\n2012), supporting all its planners. In our experiments, we\nExperimental Evaluation use Aries (Bit-Monnot 2023) and its optimal variant Ariesopt (both with and without fluents), and our OR-Tools-basedWe now evaluate our framework's ability to generate valid\nConstraint Programming Scheduling Engine (CPSE, with-plans in complex multi-object scenarios using state-of-theout fluents). They are combined with RRT (LaValle 1998)art solvers. We extend the logistics benchmark and the Job\nfor path planning (Layer 1) and ST-RRT* (Grothe et al.Shop Problem with transportation (JSP) (Nouri, Driss, and\n2022) for space-time multi-robot motion planning (tp = 10Gh´edira 2016) to include navigation tasks, stressing both the\ns, Layer 2), following the layering approach of the gray boxscheduler and motion planner with space-time refinements:\nof Algorithm 1. We also customize the collision checker to\n• Logistics: nr robots, starting at a depot, must trans- record obstacles encountered during the search.\nport items from ns shelves back to the depot (as in Fig- Layer 2 checks motion feasibility sequentially: for parure 1). Shelves are arranged into narrow corridors, some- allel robots, spatio-temporal trajectories are planned one\ntimes blocked by obstacles (closed doors) that must be at a time, each respecting previously planned trajectomoved to allow access. Each shelf contains ni items, acces- ries to avoid collisions.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 18, + "total_chunks": 44, + "char_count": 3043, + "word_count": 472, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0270278a-4d9b-4c62-a937-ad7758ba1bfd", + "text": "This instantiation trades completesible from either the corridor (inner) side or the outer side. ness for scalability, and remains aligned with the ST-RRT*\n• JSP: nr robots must move ni items between nm machines setup (Grothe et al. 2022; Kerimov, Onegin, and Yakovlev\nfor treatment. Then, each item is placed on a pallet for col- 2025), where scalability has been proved for up to 11 conlection. The machines are initially blocked by closed doors. current robots. This highlights the difficulty of our setting,", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 19, + "total_chunks": 44, + "char_count": 511, + "word_count": 83, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac32de1d-32da-4192-83d5-b029742d76b7", + "text": "Benchmark CPSE (no fluents) Aries (no fluents) Aries (with fluents) CPSE-opt (no fluents) Aries-opt (no fluents) Aries-opt (with fluents)\n#sol t [s] (% tp) refs #sol t [s] (% tp) refs #sol t [s] (% tp) refs #sol t [s] (% tp) refs #sol t [s] (% tp) refs #sol t [s] (% tp) refs\nLOG. OC-DO 16.7 323 (84%) 0.0, 6.4, 1.6 9.0 60 (81%) 0.0, 4.2, 0.0 18.3 286 (88%) 0.0, 6.1, 2.5 13.3 227 (77%) 0.0, 5.0, 0.9 10.3 126 (72%) 0.0, 4.2, 0.2 12.7 166 (78%) 0.0, 4.9, 0.8\nLOG. OC-DC 14.7 359 (80%) 1.0, 9.1, 1.5 11.0 161 (85%) 1.0, 7.0, 0.2 19.3 377 (87%) 1.0, 10.9, 2.1 12.3 416 (75%) 1.0, 8.6, 1.9 8.0 251 (75%) 1.0, 6.5, 0.0 10.0 203 (78%) 1.0, 6.9, 0.3\nLOG. ALL-DO 14.0 447 (72%) 0.0, 10.5, 2.1 9.3 96 (79%) 0.0, 6.4, 0.1 17.0 270 (87%) 0.0, 10.5, 2.6 9.7 349 (76%) 0.0, 8.7, 3.7 7.3 198 (81%) 0.0, 7.3, 3.1 8.3 195 (79%) 0.0, 6.8, 1.8\nLOG. ALL-DC 11.7 401 (68%) 1.0, 12.3, 1.1 10.3 166 (81%) 0.9, 7.2, 0.4 16.0 392 (81%) 1.0, 11.9, 2.5 7.3 336 (76%) 1.0, 12.4, 3.1 2.3 165 (92%) 1.0, 3.6, 2.7 6.0 224 (76%) 1.0, 10.3, 0.6\nJSP 13.0 355 (76%) 1.5, 6.7, 0.3 12.0 152 (89%) 1.5, 3.6, 0.1 17.0 389 (91%) 1.9, 9.5, 0.5 11.7 342 (65%) 1.4, 3.9, 0.1 12.3 210 (77%) 1.5, 3.7, 0.0 17.7 291 (79%) 2.0, 6.5, 0.1 TOTAL 70.0 378 (76%) 0.7, 8.8, 1.4 51.7 132 (83%) 0.8, 5.7, 0.2 87.7 343 (87%) 0.8, 9.7, 2.1 54.3 332 (74%) 0.7, 7.2, 1.7 40.3 194 (77%) 0.7, 5.1, 0.8 54.7 227 (78%) 0.9, 6.7, 0.6 Table 1: Overall performance: each cell shows the number of problems solved (# sol), average planning time in seconds (t),\npercentage of time spent in motion planning (%tp), and average refinement counts (refs) by type and layer: geometric (single\nactivity), temporal (single activity), and group (combined geometric and temporal). All averaged over three runs per instance. which combines an already challenging multi-robot motion- due to unrealistic duration estimates: we use a standard symplanning problem with a scheduling problem that must se- metric trapezoidal velocity profile, but it ignores multi-agent\nquence many (potentially optional) activities. interactions, necessitating additional temporal refinements. Tests were run on an AMD EPYC 7413 with a 1800 s Focusing on performance, Aries with fluents performs\ntimeout and a 20 GB memory limit. best, solving 87.7 instances.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 20, + "total_chunks": 44, + "char_count": 2259, + "word_count": 410, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "553d6f81-7bcf-4dd7-b3e1-d65cb03ce588", + "text": "In logistics (see Figure 5), at\nResults. Table 1 shows the performance averaged over three least one instance is solved with 1 robot and 8 items, 2\nruns per instance, accounting for variability of sampling- robots and 8 items, and 3 robots and 7 items; in JSP, inbased motion planners. In both domains, all solvers solve stances are solved with 1–3 robots, up to 2 machines, and\nat least one instance with 3 robots, confirming correct han- 3 pallets.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 21, + "total_chunks": 44, + "char_count": 450, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f3cf43af-3e13-47cf-829f-a6672784748a", + "text": "Using fluents consistently improves performance,\ndling of temporal constraints and inter-object synchroniza- showing that richer state representations better guide retion. Scheduling requires many refinement loops (avg. 9.1) finements. Solvers generally handle more instances without\ndue to optional navigation activities, while motion planning makespan minimization, reflecting the added complexity of\nremains costly, taking up to 92% of total planning time. optimization. In Aries-Opt with fluents, the planning time\nspent on motion planning drops to 78%, compared to 87% The framework effectively handles geometric complexity:\nin the non-optimal version, indicating more effort devotedin logistics scenarios, performance is similar whether doors\nto optimization. Without fluents, CPSE outperforms Ariesare open or closed, showing robustness to spatial constraints.\n(124.3 vs. 92.0 total instances solved, opt and non-opt), sug-On the temporal side, solving instances with up to 3 robots\ngesting CPSE is more effective in this configuration. A fi-indicates that the framework can manage synchronization.\nnal observation concerns refinements under different doorTo further evaluate this ability and the benefits of paralsettings. In the logistics domain, Aries-Opt without fluentslelizing multi-robot activities, we compared the makespan\ngenerates more refinements when doors are open (ALL-DO)of solutions produced by our approach with fully sequential\nthan when closed (ALL-DC). With open doors, fewer geo-schedules (i.e., no parallelization). For the instances conmetric bottlenecks and ordering constraints exist: the plan-sidered, the average theoretical maximum improvement in\nner generates highly parallel schedules early in the search.makespan due to parallelization is 50% (e.g., 2 robots pickAlthough symbolically consistent, these schedules can causeing 2 items from shelves in parallel versus sequentially). In\nspatio-temporal conflicts at the motion-planning level (e.g.,all cases where parallelization can improve the makespan,\ncorridor congestion), requiring additional temporal refine-our approach achieves an average reduction of 41%.\nments. When doors are closed, door-opening activities intro- Handling synchronization justifies the use of ST-RRT*,\nduce explicit ordering and synchronization constraints thata typically expensive motion planner that, however, acrestrict parallelism and reduce invalid combinations.counts for kinodynamic constraints and produces timeoptimal trajectories. Planning times remain relatively low\ndue to the layered architecture, which absorbs most geo- Conclusion and Future Work\nmetric and temporal refinements at the single-action level, This paper formally defines the SAMP problem and presents\nreducing multi-robot ST-RRT* calls (refs column: single- a framework that solves it by interleaving off-the-shelf\naction geometric refinements, single-action temporal refine- schedulers with motion planners, guided by incremental\nments, joint geometric–temporal refinements at the motion- learning-based motion abstractions. The scheduler proposes\nparallel-group level). Layering also improves coverage: with candidate plans, and the motion planner checks feasibility,\nboth layers, 359 instances are solved on average; disabling returning symbolic constraints to refine spatial and temLayer 1 reduces coverage to 140, and disabling only its poral decisions when needed. Experiments on scheduling\nsingle-robot temporal check (keeping the geometric one) benchmarks with navigation tasks, testing various schedulyields 182.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 22, + "total_chunks": 44, + "char_count": 3567, + "word_count": 463, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3cc099b5-d5dc-4bea-809f-df61321cb92e", + "text": "To further assess the value of refinements, we ing strategies (optimal, non-optimal, with/without fluents)\nevaluated a sequential pipeline: first solve scheduling in a and planners, show the framework's effectiveness in hanmotion-agnostic way, then invoke motion planning once, dling multiple synchronized agents, coordinated stop-andwithout refinement if it fails. In our setup, such sequential go behaviors, and complex spatio-temporal constraints.\npipeline cannot solve any problem. Although some instances In future work, we plan to extend our framework to suprequire no geometric refinements at the single-action level, port MAPF in addition to motion planning. Once we underthere is always at least one temporal refinement. This is not stand how to generate refinements, we will layer scheduling on top of MAPF to obtain a MAPF-aware scheduler, en- Garrett, C. R.; Lozano-P´erez, T.; and Kaelbling, L. P. 2018.\nabling the framework to tackle this problem as well.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 23, + "total_chunks": 44, + "char_count": 969, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b184070c-1dca-4f78-9a1f-244de33ab98f", + "text": "FFRob: Leveraging symbolic planning for efficient task and\nmotion planning. The International Journal of Robotics ReAcknowledgments search, 37(1): 104–136. This work has been partially supported by the AI4Work Grothe, F.; Hartmann, V. N.; Orthey, A.; and Toussaint, M.\nproject funded by the EU Horizon 2020 research and innova- 2022. ST-RRT*: Asymptotically-Optimal Bidirectional Motion program under GA n. 101135990, the STEP-RL project tion Planning through Space-Time. In Proc. of the IEEE Int.\nfunded by the European Research Council under GA n. Conf. on Robotics and Automation (ICRA).\n101115870, and by the Interconnected Nord-Est Innovation H¨onig, W.; Kumar, T. S.; Cohen, L.; Ma, H.; Xu, H.;\nEcosystem (iNEST) funded by the European Union Next- Ayanian, N.; and Koenig, S. 2017. Summary: multi-agent\nGenerationEU (Piano Nazionale di Ripresa e Resilienza path finding with kinematic constraints. In Proceedings\n(PNRR) – mission 4 component 2, investment 1.5 – D.D. of the 26th International Joint Conference on Artificial\n1058 23/06/2022, ECS00000043). Intelligence, IJCAI'17, 4869–4873. References\nJiang, H.; Lin, M.; and Li, J. 2025. Speedup techAndreychuk, A.; Yakovlev, K.; Atzmon, D.; and Stern, R. niques for switchable temporal plan graph optimization.\n2019. Multi-Agent Pathfinding with Continuous Time.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 24, + "total_chunks": 44, + "char_count": 1319, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "097cd956-743a-43f7-9603-093417e56699", + "text": "In AAAI'25/IAAI'25/EAAI'25. ISBN 978-1-\nProceedings of the Twenty-Eighth International Joint Con- 57735-897-8.\nference on Artificial Intelligence, IJCAI-19, 39–45. InternaKerimov, N.; Onegin, A.; and Yakovlev, K. 2025. Safe inter-tional Joint Conferences on Artificial Intelligence Organizaval randomized path planning for manipulators. K.; Stepanova, K.; and Babuska, R. 2020.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 25, + "total_chunks": 44, + "char_count": 377, + "word_count": 46, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1eefa377-0a12-436f-8815-23daea0a997b", + "text": "Simultaneous task allocation and motion scheduling for com- LaValle, S. Rapidly-exploring random trees : a\nplex tasks executed by multiple robots. In 2020 IEEE Inter- new tool for path planning. The annual research report.\nnational Conference on Robotics and Automation (ICRA), Leet, C.; Oh, C.; Lora, M.; Koenig, S.; and Nuzzo, P.\n11443–11449. 2023. Task Assignment, Scheduling, and Motion Planning\nBit-Monnot, A. 2023. Enhancing Hybrid CP-SAT Search for Automated Warehouses for Million Product Workloads.\nfor Disjunctive Scheduling.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 26, + "total_chunks": 44, + "char_count": 535, + "word_count": 77, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c6b33ee-bb5a-45dd-8f89-19c1bafc491a", + "text": "In European Conference on Ar- In 2023 IEEE/RSJ International Conference on Intelligent\ntificial Intelligence (ECAI). Robots and Systems (IROS), 7362–7369. Cashmore, M.; Fox, M.; Long, D.; Magazzeni, D.; Ridder, Li, J. 2023. Intelligent planning for large-scale multi-robot\nB.; Carrera, A.; Palomeras, N.; Hurtos, N.; and Carreras, M. coordination. In Proceedings of the Thirty-Seventh AAAI\n2015.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 27, + "total_chunks": 44, + "char_count": 395, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7ca18e03-fc3a-45a6-a6eb-ec85e28eb330", + "text": "ROSPlan: Planning in the Robot Operating System. Conference on Artificial Intelligence and Thirty-Fifth ConProceedings of the International Conference on Automated ference on Innovative Applications of Artificial Intelligence\nPlanning and Scheduling, 25(1): 333–341. and Thirteenth Symposium on Educational Advances in\nDantam, N. Task and Motion Planning, 1–9. Berlin, Artificial Intelligence, AAAI'23/IAAI'23/EAAI'23. AAAI\nHeidelberg: Springer Berlin Heidelberg. ISBN 978-3-642- Press. ISBN 978-1-57735-880-0.\n41610-1. Li, J.; Surynek, P.; Felner, A.; Ma, H.; Kumar, T. K.; Chaudhuri, S.; and Kavraki, Koenig, S. 2019.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 28, + "total_chunks": 44, + "char_count": 619, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "118a2877-d187-4279-9786-d645ed0e25ca", + "text": "Multi-agent path finding for large agents. Incremental Task and Motion Planning: A In Proceedings of the Thirty-Third AAAI Conference on ArConstraint-Based Approach. In Robotics: Science and Sys- tificial Intelligence and Thirty-First Innovative Applications\ntems. of Artificial Intelligence Conference and Ninth AAAI SymFaroni, M.; Umbrico, A.; Beschi, M.; Orlandini, A.; Cesta, posium on Educational Advances in Artificial Intelligence,\nA.; and Pedrocchi, N. 2024. Optimal Task and Motion AAAI'19/IAAI'19/EAAI'19. ISBN 978-1-\nPlanning and Execution for Multiagent Systems in Dynamic 57735-809-1. IEEE Transactions on Cybernetics, 54(6): Ma, H.; H¨onig, W.; Kumar, T. S.; Ayanian, N.; and\n3366–3377.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 29, + "total_chunks": 44, + "char_count": 700, + "word_count": 94, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "98997c2d-3f5f-4e4f-bf47-e32f8248a546", + "text": "Lifelong path planning with kinematic\nFox, M.; and Long, D. 2003. PDDL2.1: An Extension to constraints for multi-agent pickup and delivery. In ProPDDL for Expressing Temporal Planning Domains.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 30, + "total_chunks": 44, + "char_count": 192, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "63e8caa3-1386-4974-8caf-f65d32cd0220", + "text": "Artif. ceedings of the Thirty-Third AAAI Conference on ArtifiIntell. Res., 20: 61–124. cial Intelligence and Thirty-First Innovative Applications\nGarrett, C. R.; Chitnis, R.; Holladay, R.; Kim, B.; Silver, T.; of Artificial Intelligence Conference and Ninth AAAI SymKaelbling, L. P.; and Lozano-P´erez, T. 2021.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 31, + "total_chunks": 44, + "char_count": 311, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05887b7f-4caa-4592-ac95-b25ec129b8bd", + "text": "Integrated Task posium on Educational Advances in Artificial Intelligence,\nand Motion Planning. Annual Review of Control, Robotics, AAAI'19/IAAI'19/EAAI'19. ISBN 978-1-\nand Autonomous Systems, 4(1): 265–293. 57735-809-1. R.; Lozano-Perez, T.; and Kaelbling, L. Micheli, A.; Bit-Monnot, A.; R¨oger, G.; Scala, E.; ValenPDDLStream: Integrating Symbolic Planners and Blackbox tini, A.; Framba, L.; Rovetta, A.; Trapasso, A.; Bonassi, L.;\nSamplers. In International Conference on Automated Plan- Gerevini, A. E.; Iocchi, L.; Ingrand, F.; K¨ockemann, U.; Paning and Scheduling (ICAPS). trizi, F.; Saetti, A.; Serina, I.; and Stock, S. 2025. Planning: Modeling, manipulating and solving AI planning\nproblems in Python. SoftwareX, 29: 102012. Neville, G.; Chernova, S.; and Ravichandar, H. 2023.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 32, + "total_chunks": 44, + "char_count": 788, + "word_count": 107, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42bb7d3e-ca1d-4a59-97e6-5128d4312ce6", + "text": "DITAGS: A Dynamic Interleaved Approach to Resilient\nTask Allocation, Scheduling, and Motion Planning. IEEE\nRobotics and Automation Letters, 8(2): 1037–1044. B.; and Gh´edira, K. 2016. A Classification Schema for the Job Shop Scheduling Problem\nwith Transportation Resources: State-of-the-Art Review.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 33, + "total_chunks": 44, + "char_count": 299, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac7bfb3c-d62b-483c-9f20-14166967e781", + "text": "In\nSilhavy, R.; Senkerik, R.; Oplatkova, Z. K.; Silhavy, P.;\nand Prokopova, Z., eds., Artificial Intelligence Perspectives\nin Intelligent Systems, 1–11. Cham: Springer International\nPublishing. ISBN 978-3-319-33625-1. Pecora, F.; Andreasson, H.; Mansouri, M.; and Petkov, V.\n2018. A Loosely-Coupled Approach for Multi-Robot Coordination, Motion Planning and Control. Proceedings of\nthe International Conference on Automated Planning and\nScheduling, 28(1): 485–493. Perron, L.; and Didier, F. 2025.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 34, + "total_chunks": 44, + "char_count": 497, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "03f62e23-eda9-43bf-83e5-2e8d1bdbc723", + "text": "CP-SAT. https://developers.\ngoogle.com/optimization/cp/cp solver/. Stern, R.; Sturtevant, N.; Felner, A.; Koenig, S.; Ma, H.;\nWalker, T.; Li, J.; Atzmon, D.; Cohen, L.; Kumar, T. K.;\nBart´ak, R.; and Boyarski, E. 2021. Multi-Agent Pathfinding:\nDefinitions, Variants, and Benchmarks. Proceedings of the\nInternational Symposium on Combinatorial Search, 10(1):\n151–158. A.; Moll, M.; and Kavraki, L. The Open\nMotion Planning Library. IEEE Robotics & Automation\nMagazine, 19(4): 72–82. https://ompl.kavrakilab.org. Benchmarks for basic scheduling problems. European Journal of Operational Research, 64(2):\n278–285. Tosello, E.; Valentini, A.; and Micheli, A. 2024. A metaengine framework for interleaved task and motion planning\nusing topological refinements. In ECAI 2024, 4377–4384. Tosello, E.; Valentini, A.; and Micheli, A. 2025. Temporal\nTask And Motion Planning with Metric Time for Multiple\nObject Navigation.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 35, + "total_chunks": 44, + "char_count": 913, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc00f610-c1b6-4cce-9ad2-e57a631e2e00", + "text": "Logic-geometric programming: an\noptimization-based approach to combined task and motion\nplanning. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI'15, 1930–1936. A.; Dirksmeier, P.; Long, P.; Padir, T.; and\nBobadilla, L. 2021.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 36, + "total_chunks": 44, + "char_count": 263, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "be64a91d-f7d8-457d-9c99-7c049421797d", + "text": "Scheduling and Path-Planning for Operator Oversight of Multiple Robots. Appendix Lemma 3 (Temporal Progression). Suppose a temporal refinement is derived out of a candidate schedule ρ for theThis appendix contains the formal proofs of the guarantees\nSAMP problem ψ with parallel motion group G, resulting in(soundness, completeness, and optimality) underlying the\na new SAMP problem ψ′. Then ρ is not a candidate scheduleproposed framework, providing theoretical support for the\nfor ψ′.properties discussed in the main text. The temporal refinement added to ψ′ is the\nSoundness constraint:\nThe goal is to prove that if our framework returns a solu- RCOND(G) →CHTIME(G)\ntion, then this solution is guaranteed to be correct, meaning\nthat all scheduling constraints are satisfied, and the motion with G ⊆G and CHTIME(G) equals to\nplanner has found valid trajectories for all motion activities.\n_ CHCONF(a, conf(o)) ∨ _ a.start −min(b.start) < δa∨\nTheorem 1 (Soundness). Let ψ be a SAMP problem, if b∈G\nSOLVE(ψ, opt, tp, timeout) produces a solution π, then π o∈Oa∈G a∈G\nis valid. _ (a.end −min(b.start)) ≥δa + da\nb∈G\na∈G|(da+δa)<(δa+da)\nProof. Let π = ⟨p, s, e, τ⟩and let ρ = ⟨p, s, e⟩. We need to prove that π is non-conflicting and satisfies the To show that ρ is not a candidate schedule for ψ′, it suffices\ncondition of Definition 9. to show that ρ violates this constraint. RCOND(G) is satisAs noted in the main paper, we model movable objects fied by ρ as in the previous proof.\nas unary resources, so since ρ is a solution for the OS prob- The formula W a∈G,o∈O CHCONF(a, conf(o)) is violated\nlem ϕ (because it is generated by the scheduler on the OS because the configuration of each o ∈O only depends on ρ,\nproblem itself), it follows that π is non-conflicting. Condi- and thus all CHCONF(b, conf(o)) are false for any b and o.\ntion 3 of Definition 9 is satisfied by the scheduling problem The formula W a∈G a.start −minb∈G(b.start) < δa is vioposted to the scheduler, while conditions 1 and 2 are ensured lated because in ρ, for any a ∈G, s(a)−minb∈G(s(b)) = δabecause Algorithm 1 only returns at line 16, where the mo- by definition.\ntion planner has successfully found a valid trajectory for all\nW a∈G|(da+δa)<(δa+da)(a.end −minb∈G(b.start)) ≥δa +parallel motion groups.\nda is violated because each activity a is such that e(a) −\nCompleteness minb∈G(s(b)) = da + δa by definition. Hence, all disjuncts\nare trivially false.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 37, + "total_chunks": 44, + "char_count": 2431, + "word_count": 408, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b3f958a9-c279-4df1-8daa-08e86727b0b1", + "text": "The goal is to prove that our framework returns a solution\nwhenever one exists, assuming the motion planner is com- Lemma 4 (Geometric pruning soundness). Suppose a geoplete and produces time-optimal solutions. This proof relies metric refinement is derived out of a candidate schedule ρ\non showing that the learned constraints always cut the cur- for ψ with parallel motion group G and conflict Σ and Ω,\nrent candidate schedule from the solution space of the sched- resulting in SAMP ψ′. Any SAMP solution π of ψ is a SAMP\nuler, but never cut valid solutions. solution for ψ′.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 38, + "total_chunks": 44, + "char_count": 577, + "word_count": 99, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ad3f8902-340e-457e-a692-edacf37bf201", + "text": "Lemma 2 (Geometric Progression). Suppose a geometric Proof. Suppose, for the sake of contradiction, that\nrefinement is derived out of a candidate schedule ρ for the there exists a solution π = ⟨p, s, e, τ⟩for ψ which is not a\nSAMP problem ψ with parallel motion group G, introducing solution for ψ′. Let ρ′ be the OS schedule of π. Since the\nconflicts Σ and Ωand yielding a refined SAMP problem ψ′. only difference between ψ and ψ′ is the constraint\nThen, ρ is no longer a candidate schedule for ψ′. RCOND(G)→ _ CHCONF(b, conf(o))\nProof. The geometric refinement added to ψ′ is the b∈G,o∈ωb\nconstraint:\nthen, π must violate this constraint. RCOND(G)→ _ CHCONF(b, conf(o)) To violate this constraint, π must satisfy RCOND(G) and\nb∈G,o∈ωb violate W b∈G,o∈ωb CHCONF(b, conf(o)). Hence, π has the set of activities G, all present and such\nTo show that ρ is not a candidate schedule for ψ′, it suffices that any other motion activity is either before the start of the\nto show that ρ violates this constraint. Clearly, RCOND(G) first activity in G or after the last end. This is because of the\nis satisfied by ρ: all activities in G exist in ρ, the order of first and third conjuncts of RCOND(G).\nactivities is the same (as it only depends on ρ) and rivals Note that conf(o) in ρ must be equal to conf(o) in ρ′ for all\nare either before or after the activities in G, otherwise they b ∈G and o ∈ωb. Moreover, conf(oa) = conf′(oa) for all\nwould have been part of G in the first place. a ∈G, because of the second conjunct of RCOND(G): the\nInstead, the formula W b∈G,o∈ωb CHCONF(b, conf(o)) is first motion action for any moved object is kept, therefore\nviolated because the configuration of every o ∈O only de- every movable object moved by G is initially in the same\npends on ρ, and thus all CHCONF(b, conf(o)) are false for configuration. Thus, all relevant obstacles (we assume that\nany b and any o. the motion planner is complete, therefore all obstacles that can be encountered by any movable object are returned) and Theorem 6 (Completeness). Let ψ be SAMP problem admovable objects are in the same configurations in ρ and ρ′. mitting at least one solution.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 39, + "total_chunks": 44, + "char_count": 2155, + "word_count": 386, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5c48a73f-a9fe-4cfc-b237-8aa48b9bdd03", + "text": "SOLVE(ψ, opt, tp, ∞) evenNow, consider the paths for motion activities in G that tually returns a solution for ψ assuming a motion planner\nmust exist for π to be a solution. Observe that, changing the which is complete and optimal.\norder of the activities (after the first one), moving oa canProof. The theorem follows from four observations.not change the motion planner verdict on the feasibility of\nFirst, no candidate schedule is evaluated twice by the algo-the combined motion of the activities in G. This is because\nrithm because of Lemmas 2 and 3. Second, no solution is cutin geometric refinements we are not considering the timings\nfrom the solution space because of Lemmas 4 and 5. Third,and the problem constraints require the sequence of activithe set of candidate plans for a SAMP problem is finite (eventies moving oa to be such that the ending configuration of\nthough we formalized time over the natural numbers, thereone activity is the initial configuration of the following one.\nis an obvious time horizon defined by the sum of all the max-The only thing that matters for the path existence is whether\nimal durations ov every activity). Fourth, assuming the mo-the paths are geometrically realizable, and this is unaffected\ntion planner is complete and optimal, eventually we will ar-by changing the order of waypoints to be reached by one\nrive at a time bound tp sufficient to construct the trajectoriesmovable object (but without constraining the order of wayfor the solution. Therefore, a solution is eventually found ifpoints between different movable objects).\nit exists. However, we know that the motion planner instantiated\non the candidate schedule ρ deemed the problem unsolvable. Finally, note that the approach is not a decision procedure,\nThis leads to a contradiction, as π cannot be a solution for ψ if no solution exists for the SAMP problem the algorithm\nunder these conditions. might diverge. Lemma 5 (Temporal pruning soundness). Suppose a tem- Optimality\nporal refinement is derived out of a candidate schedule ρ for The goal is to prove that our framework is relatively optimal\nψ with parallel motion group G, resulting in a new SAMP for makespan optimization.\nψ′. Any SAMP solution π of ψ is a SAMP solution for ψ′. Theorem 7 (Relative Optimality). Assuming opt is the function aiming to minimize the makespan of the schedule, a soProof.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 40, + "total_chunks": 44, + "char_count": 2376, + "word_count": 391, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e74bc60d-4f8e-46a7-8366-0a8098d1681f", + "text": "We proceed as in the previous lemma. Suplution π returned by SOLVE(ψ, opt, tp, timeout) is optimal,pose, for the sake of contradiction, that there exists a solution\nassuming the motion planner is complete and optimal w.r.t.π = ⟨p, s, e, τ⟩for ψ which is not a solution for ψ′. Let ρ′\nthe duration of motions.\nbe the OS schedule of π.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 41, + "total_chunks": 44, + "char_count": 333, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3acbcac0-e1cc-4a00-a540-0491f7b83a03", + "text": "Since the only difference between\nψ and ψ′ is the constraint Proof. For the sake of contradiction, assume there exists a\nsolution π′ with a makespan smaller than π. Since we asRCOND(G) →CHTIME(G) sume that the scheduler is optimal, complete and correct, the\ncandidate schedule ρ′ of π′ would be encountered before the\nwith G ⊆G and CHTIME(G) equals to\ncandidate schedule ρ of π. Because of Lemmas 4 and 5 we\nknow that no valid solution is discarded, therefore π′ would _ CHCONF(a, conf(o)) ∨ _ a.start −min(b.start) < δa∨\nb∈G be returned instead of π, leading to the contradiction.\na∈G a∈G\no∈O\n_ (a.end −min(b.start)) ≥δa + da\nb∈G\na∈G|(da+δa)<(δa+da) then, π must violate this constraint. To violate this constraint, π must satisfy RCOND(G) and\nviolate CHTIME(G). As before, π has the set of activities G,\nall present and such that all other motion activity is either\nbefore the start of the first activity in G or after the last end.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 42, + "total_chunks": 44, + "char_count": 934, + "word_count": 163, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2860e1e0-5778-4d47-8b48-6e486010a0eb", + "text": "Moreover, conf(o) in ρ must be equal to conf(o) in ρ′ for\nall o ∈O. Additionally, for all a ∈G it must hold s(a) −\nminb∈G(s(b)) ≥δa and e(a) −minb∈G(s(b)) < δa + da. But then consider the trajectories τ(a) for all a ∈G, these\ntrajectories are such that each movable object oa is stationary in the interval [minb∈G(s(b)), s(a)] and the movement\nassociated with a ends before e(a). Therefore, τ witnesses a\nsolution for a motion planning problem that is at least as constrained (from the temporal point of view) as the one derived\nfrom ρ, which is strictly faster than the one used to generate\nthe temporal refinement. But we assumed the motion planner was optimal and complete, hence the contradiction.", + "paper_id": "2603.10651", + "title": "Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions", + "authors": [ + "Elisa Tosello", + "Arthur Bit-Monnot", + "Davide Lusuardi", + "Alessandro Valentini", + "Andrea Micheli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10651v1", + "chunk_index": 43, + "total_chunks": 44, + "char_count": 701, + "word_count": 123, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10652_semantic.json b/data/chunks/2603.10652_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..bcf0fa7a4f4cd6589759915c648de55b66a19da1 --- /dev/null +++ b/data/chunks/2603.10652_semantic.json @@ -0,0 +1,1769 @@ +[ + { + "chunk_id": "b793f4ea-1549-421e-982a-0767392a1cce", + "text": "Yangfan He Changgyu Boo\nNTU Singapore Korea University\nyhe873232@gmail.com 2019150348@korea.ac.kr Jaehong Yoon∗\nNTU Singapore\njaehong.yoon@ntu.edu.sg2026 AbstractMar\nIn real-world deployment, vision-language models often encounter disturbances\nsuch as weather, occlusion, and camera motion. Under such conditions, their under-11\nstanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address\nthis limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal\ncorruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically,[cs.CV] it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also\nintroduce PVRBench, a new benchmark that injects real-world perturbations into\nembodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo,\nand VisBench, where open-source and proprietary models suffer up to 35% and\n28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least\n24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL,\nInternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks,\nyielding consistent improvements.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 1, + "total_chunks": 93, + "char_count": 1637, + "word_count": 199, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d30ed974-c19b-4ea4-aeee-66273e1df3db", + "text": "Project Page: https://robust-video-reason.github.io/ 1 IntroductionarXiv:2603.10652v1 Vision-language models (VLMs) [Zhang et al., 2023, Maaz et al., 2024, Shu et al., 2025, Yuan et al.,\n2025, Li et al., 2025, Yu et al., 2025, Clark et al., 2026] have rapidly advanced video understanding\nand reasoning, allowing systems to interpret complex scenes and perform temporally grounded\ninference. These capabilities support many real-world applications, yet a key question remains: are\ncurrent VLMs robust enough to operate reliably beyond clean, controlled conditions? In practice,\nthese models frequently face challenging video streams, corrupted by adverse weather (e.g., rain,\nfog, snow), dynamic occlusions (e.g., pedestrians, vehicles, vegetation), abrupt illumination changes\n(e.g., glare, shadows, low light), and camera motion induced by vibration or viewpoint shifts. Such\nperturbations are common in the real world, yet these models severely degrade perception and lead\nto brittle or unreliable reasoning (Fig. 1). For instance, under conditions such as video occlusion\nor adverse weather, baseline models may incorrectly output \"Turn Left\" or \"Turn Right\" rather ∗Corresponding author Guess the driving direction and movement trajectory? What is the inferred final driving decision?", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 2, + "total_chunks": 93, + "char_count": 1289, + "word_count": 178, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "32c6463b-ce30-4878-a908-2a9fba175e1f", + "text": "Question Question\nframe 1 frameframe4 2 frameframe4 4 frame 1 frame 4 frame 4 occlusion occlusion occlusion fog fog fog\nframe 8 frame 12 frame 16 frame 8 frame 12 frame 16 occlusion occlusion occlusion fog fog fog Heavy rain made it hard to see and the wipers were Fog severely reduced visibility and obscured lane\nblocking my view, I, I formulated a driving strategy markings, with the vehicle gradually drifting to the\nin accordance with current traffic rules and signals, right. Based on this, I formulated a driving strategy Reasoning in accordance with the traffic rules, determining the vehicle should execute a left-turn Reasoning Process determining Process vehicle should execute path. a right-turn path. Pred Action: Turn Left Pred Action: Turn Right\nGT Action: Go Ahead GT Action: Go Ahead Figure 1: Failure cases of Qwen2.5-VL under two representative perturbations: (a) occlusion (left)\nand (b) adverse weather (right). The model incorrectly predicts Turn Left\" under occlusion and Turn\nRight\" under fog, despite the ground-truth being \"Go Ahead\" in both cases, demonstrating how\nrealistic perturbations mislead reasoning and motivating the need for robustness-aware training. than the ground-truth \"Going Ahead.\" This gap between benchmark assumptions and real-world\nconditions highlights the need for training frameworks that promote reliable generalization under\nrealistic variability and uncertainty. A few prior studies [Mao et al., 2022, Zhou et al., 2024, Zhang\net al., 2024] have explored improving the robustness of VLMs through generic data augmentation,\nrandom frame masking, zero-shot, or adversarial training. However, these methods typically treat\nrobustness as a single objective, overlooking that different perturbations induce distinct failure modes. Consequently, they struggle to address structured, semantically meaningful corruptions common in\nreal-world environments, since perturbation-specific failure behaviors are not explicitly modeled. To address this challenge, we propose RObust Video Alignment (ROVA), a novel training approach\nfor robust vision reasoning under realistic visual disturbances. We first apply corruption-based\naugmentation to generate perturbed videos. ROVA then measures divergence in reasoning coherence\nand answer quality between clean and corrupted videos as a proxy for corruption-induced difficulty.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 3, + "total_chunks": 93, + "char_count": 2364, + "word_count": 337, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "73141a2c-5509-4078-b5ab-2d17a70b803e", + "text": "Moderately difficult instances are used for training, while overly easy samples are discarded and excessively difficult ones are stored in a temporal memory buffer for later revisiting. Unlike curriculum\nlearning, which follows a fixed, easy-to-hard schedule, this self-reflective evaluation estimates the\ndifficulty and informativeness of each video–query instance based on the model's current capability,\nenabling an adaptive curriculum that prioritizes informative samples while deferring overly difficult\nones through memory replay.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 4, + "total_chunks": 93, + "char_count": 536, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ed877400-8d0d-419e-96c5-7e882b25d30b", + "text": "Next, we introduce a dual-branch alignment objective that enforces\noutput consistency between paired clean and perturbed inputs. This robustness-aware consistency\nalignment is guided by reward modeling over reasoning and answer consistency, and optimized using\ngroup relative policy optimization. Specifically, we enforce output consistency between paired clean\nand perturbed video inputs through reward-guided optimization that evaluates both reasoning and\nanswer consistency, trained via group relative policy optimization [Shao et al., 2024]. We further introduce Perturbed Video Reasoning Benchmark (PVRBench), for evaluating the robustness of video reasoning under diverse realistic perturbations. Unlike existing benchmarks, including\nVisBench [Yang et al., 2025a] and UrbanVideo [Zhao et al., 2025a], which primarily evaluate models on curated environments, PVRBench systematically injects perturbations from 12 corruption\nstyles associated with lighting, camera motion, occlusion, and weather (Tab. 1), across 27 scene\ncategories. Notably, all perturbations are spatially aware and temporally coherent, capturing realistic\nvideo disturbances. We observe that performant proprietary models (GPT-4o [Hurst et al., 2024] /\nGemini-3-Pro [Team et al., 2023]) suffer 11–17% and 10–14% drops in accuracy and reasoning, and\nopen-source models degrade by up to 35% and 26%, respectively, highlighting robustness gaps in\nVLMs under realistic conditions. ROVA consistently outperforms proprietary and open-source models on PVRBench, UrbanVideo, and\nVisBench across all perturbation types in both answer accuracy and reasoning quality. Specifically,\nROVA surpasses the strongest open-source baselines of comparable size, Embodied-R, by 17%, while\nlarger variants (13B/72B) match or exceed leading proprietary models such as Gemini-3-Pro and GPT-", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 5, + "total_chunks": 93, + "char_count": 1841, + "word_count": 242, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "848c7477-2356-4b87-a603-058af1ccdbbc", + "text": "Table 1: Comparison of PVRBench with existing video understanding benchmarks. #Types counts\nperturbation subtypes. #Cat. counts scene or class categories. Synthetic, Spatial, and Temporal indicate artificially generated, spatially grounded, and temporally consistent perturbations, respectively. PVRBench covers 27 tasks covering indoor, outdoor, and embodied AI scenarios. ‡: An image-level\nbenchmark for reference. Scale Perturbation Properties Scene Coverage\nBenchmark\n#Videos #QAs Synthetic Real Spatial Temporal #Types Ind. ImageNet-C‡ [Xie et al., 2020] 50K 50K ✓ ✗ ✗ ✗ 19 ✓ ✓ ✗ 1K\nMVBench [Li et al., 2024] 4K 4K ✗ ✗ ✗ ✗ 0 ✓ ✓ ✗ 20\nVideo-MME [Fu et al., 2025] 900 2.7K ✗ ✗ ✗ ✗ 0 ✓ ✓ ✗ 30\nALFRED [Shridhar et al., 2020] 8K 25K ✗ ✗ ✗ ✗ 0 ✓ ✗ ✓ 7\nEgo4D [Grauman et al., 2022] 3.7K 3.8M ✗ ✗ ✗ ✗ 0 ✓ ✓ ✓ 5\nVisBench [Yang et al., 2025a] 500 3K ✗ ✗ ✗ ✗ 0 ✓ ✗ ✓ 11\nUrbanVideo [Zhao et al., 2025a] 1.5K 6K ✗ ✗ ✗ ✗ 0 ✗ ✓ ✓ 16\nPVRBench (Ours) 9K 52K ✓ ✗ ✓ ✓ 12 ✓ ✓ ✓ 27 Notably, these improvements extend to clean videos, demonstrating enhanced generalizability\nand stronger performance on clean data. Furthermore, ROVA achieves higher reasoning quality, with\nimproved consistency and belief scores, reflecting more stable, confident reasoning under visual\ncorruption.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 6, + "total_chunks": 93, + "char_count": 1264, + "word_count": 226, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4155de7c-dd62-4641-a4f7-5f53dfb8f339", + "text": "Robust Training for Multimodal Models. Several works [Mao et al., 2022, Zhao et al., 2023, Sheng\net al., 2025, Oh et al., 2025, Agarwal et al., 2025, Schiappa et al., 2022] have explored robustness to\ndistribution shifts and adversarial inputs through data augmentation [Duan et al., 2023], test-time\nadaptation [Zhao et al., 2024], and transfer-based strategies [Tong et al., 2025, Cai et al., 2024]. However, these approaches primarily address generic perturbations or optimization efficiency, rather\nthan the structured, semantically grounded disturbances encountered in real-world video settings. In\nvideo reasoning, recent methods [Zhou et al., 2025, Wang et al., 2025a, Chen et al., 2025a, Wang\net al., 2025b] improve efficiency via adaptive frame sampling or data filtering, but they do not\nexplicitly model realistic corruption patterns [Zeng et al., 2024, Yang et al., 2025b] that alter scene\nvisibility and temporal coherence. As a result, robustness is treated as incidental resilience rather\nthan being explicitly modeled during optimization. In contrast, ROVA incorporates structured and\nsemantically grounded perturbations that reflect realistic environmental disturbances. The proposed\narchitecture and training objectives enforce representation consistency between clean and perturbed\nvideos, progressively strengthening disturbance-aware reasoning. Robust Video Reasoning in Real-World Environments. Recent advances in video–language\nmodels [Zhang et al., 2023, Nguyen et al., 2024, Yuan et al., 2025, Yu et al., 2025, Clark et al., 2026]\nhave substantially improved temporal reasoning and long-horizon embodied planning [Chen et al.,\n2025b, Azzolini et al., 2025, Zhang et al., 2025, Zhao et al., 2025b, Yu et al., 2026, Yeo et al., 2026]. However, most existing benchmarks evaluate models under nearly clean visual conditions [Maaz\net al., 2024], implicitly assuming stable lighting, unobstructed views, and smooth camera movement. Although robustness is sometimes measured via synthetic textual perturbations [Wu et al., 2025],\nsuch evaluations do not capture structured, semantically grounded visual disturbances encountered in\nreal-world environments. Consequently, no standardized benchmark systematically integrates realistic\ndisturbances into embodied video reasoning, leaving a gap between benchmarks and deployment\nconditions. In contrast, we introduce PVRBench that integrates semantically meaningful perturbations\ninto temporally coherent reasoning tasks. Rather than treating corruption as incidental noise, we ask\nmodels to reliably reason about scene content, even in the presence of disturbances. 3 Training Robust Video Reasoning Models with ROVA", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 7, + "total_chunks": 93, + "char_count": 2680, + "word_count": 369, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a8a9ca75-23a1-4aab-8b04-89578d6073bd", + "text": "As illustrated in Fig. 2, ROVA, a novel training approach for robust video reasoning under real-world\nperturbations, comprises three stages: we first generate corruption-augmented video-query pairs via\ndynamic, physically plausible perturbations (Sec. 3.1). Next, a difficulty-aware curriculum performs Structured Spatio-Temporal Corruptio Self-Reflective Difficulty-Aware Trainin\nQuery What action does Discard Evict samples after too many re-evals\nthe agent need\nto take to avoid Store difficult samples\nobstacle ahead? Spatial Mask & Easy Difficult Periodic ... Temporal Shuffle Self-Reflective Re-evaluation\nDifficulty Assessment\nClean video Corrupted video Memory Buffer Dual-Branch Alignmen\nReference Output I can clearly see the road\nahead, and my lane is Rubustness-Aware\nClean clear, thereˇs Consistency Reward GRPO Selected Input Shared Format Reward Total Policy (Informative) Weights reasoning I canˇt see the road clearly, (Check tags) Reward Update\nInput: (video, query) Iˇm not sure whatˇs ahead. Corrupted Iˇll slow down and prepare Accuracy Reward\nAligned Output to stop to avoid a potential (vs Ground Truth)\nobstacle. Figure 2: Overview of ROVA: (1) structured spatio-temporal corruption that generates realistic\nperturbations, (2) self-reflective evaluation with difficulty-aware online training that adaptively\nprioritizes informative samples, and (3) dual-branch alignment reward modeling that enforces output\nconsistency between clean and perturbed inputs. self-reflective evaluation to selectively curate informative training samples conditioned on the model's\nevolving capability (Sec. 3.2) . Finally, dual-branch alignment enforces consistency between clean\nand perturbed videos via reasoning-aware rewards and group relative policy optimization (GRPO)\n(Sec. 3.3). 3.1 Learning with Structured Spatio-Temporal Corruption We first design a structured spatio-temporal corruption pipeline that models four realistic disturbances,\nincluding weather, lighting, occlusion, and camera motion, using style-specific, cross-frame coherent\nmasks for spatial perturbations and temporal shuffling to disrupt temporal order. Unlike generic\naugmentations that apply independent pixel or frame perturbations (e.g., random masking, color\njittering) [Xie et al., 2020], we explicitly model perturbation styles with spatial grounding and\ntemporal coherence, yielding structured spatio-temporal disturbances. Each video is then paired\nwith its corrupted counterpart in a dual-branch alignment framework to optimize output consistency. Through this design, the model learns perturbation-invariant representations for robust real-world\ngeneralization.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 8, + "total_chunks": 93, + "char_count": 2655, + "word_count": 339, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5f07c066-d171-4212-98ee-e4eb60b85d67", + "text": "Let a video sequence be denoted as V = {f1, f2, . . . , fT }, where ft ∈RH×W ×C denotes the t-th\nframe of height H, width W, and C channels. To disrupt temporal coherence, we randomly permute the frame sequence. A permutation π : {1, . . . , T} →{1, . . . , T} is sampled uniformly at random, and the temporally\nshuffled video is defined as\nVtemp = {fπ(1), fπ(2), . . . , fπ(T )}, (1)\nwhich completely scrambles temporal order while preserving all frame content. Rather than coarse block-wise masking that risks removing critical cues, we apply fine-grained masks across four perturbation styles m ∈ P =\n{weather, lighting, camera, occlusion}. For each frame ft, the mask Pt(m) = B(m)t ⊙C(m)t fuses\na binary map B(m)t ∈{0, 1}H×W , where 1/0 denotes corrupted/clean pixels, with layouts driven\nby depth awareness or stochastic sampling, and a continuous modulation map C(m)t ∈[0, 1]H×W\nencoding per-pixel effect intensity (e.g., rain strength, shadow depth, blur kernel; see Sec. B.2.) The\ncorrupted frame is computed as ftmasked = ft ⊙P t(m) , where ⊙denotes element-wise multiplication. Spatio-Temporal Corruption. For each video, a perturbation style m ∈P is uniformly sampled to\ngenerate the corrupted frame sequence:\n′ n (m) oT V = fπ(t) ⊙Pt , (2)\nt=1 where Pt(m) denotes the smooth, style-specific mask associated with style m. By jointly introducing temporal order disruption and spatially realistic, continuous masking, our approach promotes\nperturbation-invariant representation learning while preserving essential visual semantics. 3.2 Self-Reflective Difficulty-Aware Training Introducing structured visual corruptions exposes the model to a broader spectrum of reasoning\ndifficulty than training on clean videos alone. While clean inputs typically lie within a narrow difficulty\nrange, corrupted versions vary widely in severity, expanding the diversity of learning signals during\ntraining. Crucially, training is most effective on samples that are neither too easy nor excessively\ndifficult [Wang et al., 2025b] under the model's current capacity, as these instances provide the\nmost informative learning signals and support stable optimization. Rather than uniformly sampling\nacross the expanded difficulty range, we therefore prioritize appropriately challenging examples\nthrough a self-reflective, difficulty-aware strategy that implicitly forms an online curriculum.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 9, + "total_chunks": 93, + "char_count": 2382, + "word_count": 363, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c75a774f-451f-4314-b059-ad7c42473695", + "text": "By\ncontinuously focusing on corrupted samples that provide meaningful learning signals, the model\nenables to promote robust and reliable reasoning under realistic visual disturbances. To this end, we propose a self-reflective, difficulty-aware training pipeline that implicitly builds an\nadaptive curriculum in an online manner. Formally, let Fθ denote a learnable VLM parameterized by θ. We assume that training video–text pairs arrive sequentially, and let θi denote the model parameters at\ntraining iteration i. At each iteration, ROVA performs two internal steps: 1) self-reflective evaluation,\nwhere Fθi estimates the usefulness of incoming samples for training under its current state; and 2)\ndifficulty-aware selective training, where model updates are performed using only a subset of samples\nselected according to the proposed policy. Self-Reflective Evaluation. At iteration i, the model F evaluates each masked video V i′ and produces\na difficulty label d ∈{easy, difficult, informative} and a confidence score c ∈[0, 1], defined as,\nd, c = Fθi(qi, Vi′ , Se), (3) where qi denotes the input query and Se denotes the evaluation prompt (See Fig. 10). Specifically,\nd is obtained by prompting Fθi with Se to compare its responses on clean and corrupted inputs:\nif the model answers correctly and consistently, the sample is labeled d = easy; if responses\ndiverge substantially or are incorrect, it is labeled d = difficulty; otherwise, the sample is labeled\nd = informative, indicating moderate uncertainty that is most beneficial for training. The confidence\nscore c is derived from the model's output token probabilities. Unlike traditional curriculum learning\nwith a fixed schedule, our prompt-based sample-level evaluation dynamically estimates the model's\ncurrent capability and prioritizes informative samples to stabilize the effective training distribution. Based on d and c, we design the following data selection policy: (i) high-confidence easy samples (d = easy, c > τ, where τ is a confidence threshold) are considered\nas sufficiently learned and filtered out, enabling the model to prioritize disturbance-sensitive samples\nthat provide strong learning signals. (ii) difficult samples (d = difficult) are stored in a temporal\nmemory buffer M for deferred training and periodically re-evaluated. While potentially informative,\nthey may yield weak or unstable learning signals under the current model state, and are revisited\nonce the model has sufficiently improved. (iii) informative samples (d = informative) as well as\nlow-confidence easy samples (d = easy, c <= τ) are treated as high-information instances and\nprioritized for immediate training. Difficulty Re-evaluation and Deferred Training with Memory.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 10, + "total_chunks": 93, + "char_count": 2730, + "word_count": 405, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "64fd8cc9-f5fd-43dc-bb88-fbf68a057c34", + "text": "As the model improves over time,\nsamples that were previously too difficult to learn from may later provide meaningful training signals. To leverage this evolving capability, we introduce a memory-based deferred training mechanism that\nperiodically re-evaluates difficult instances. Formally, when newly arriving data are evaluated as\ndifficult, it is stored in a temporal memory buffer M as:", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 11, + "total_chunks": 93, + "char_count": 392, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9f6fa7df-fba9-4dd5-b3c8-31448bd3af5e", + "text": "M ←M ∪{(q, ˜V , k = 0)}, (4) where ˜V encodes the mask metadata, including perturbation style, parameters, and spatial-temporal\nregions. This design allows the corrupted video V ′ to be regenerated on demand during re-evaluation,\navoiding the need to store full video data. During training, instances in M are periodically reevaluated under the updated model. The counter k tracks the number of re-assessments performed\nfor each sample. For each entry (qn, ˜Vn, kn) ∈M, the current model F periodically re-assesses its difficulty using the current parameter θi:\nd′, c′ = F(qn, ˜Vn, Se; θi), kn ←kn + 1. (5)\nHere, d′ and c′ denote the updated difficulty level and confidence score, respectively.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 12, + "total_chunks": 93, + "char_count": 694, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8de43f3a-0192-4594-9479-ab0f27fd3614", + "text": "Entries\nreclassified as informative are immediately used for training, whereas those labeled easy are removed\nfrom the memory buffer. Entries that remain difficult are retained in M with their re-evaluation\ncounter incremented. The confidence score c′ serves as an auxiliary diagnostic signal for self-monitoring and stability analysis, but is not used directly for memory retention decisions to avoid sensitivity to noisy confidence\nestimates.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 13, + "total_chunks": 93, + "char_count": 444, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e8c2c5ce-0615-4ada-8c5d-1f34db38f291", + "text": "As training progresses, samples that were previously difficult may transition to informative\nor easy categories, allowing the curriculum to adapt to the model's evolving capability. However,\nrepeated re-evaluation can lead to unbounded memory growth, particularly when samples remain\npersistently difficult or heavily corrupted, yielding little effective learning signal. To prevent this, we\nimpose a maximum re-evaluation threshold and evict entries exceeding it:\nM ←M \\ {(q, ˜V , k) | k > Kmax}. (6)\nOverall, the proposed self-reflective, difficulty-aware training framework establishes a closed-loop\nmechanism that dynamically adjusts the training data distribution to the model's evolving capability. By prioritizing samples based on estimated difficulty and confidence, the framework selects instances\nthat yield effective learning signals under corrupted conditions while filtering low-utility ones. Although periodic re-evaluation incurs modest computational overhead, this cost is negligible relative\nto the high per-sample cost of reinforcement learning on videos. In addition, selectively discarding\nuninformative instances leads to substantial gains in training efficiency (See Tab. 3).", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 14, + "total_chunks": 93, + "char_count": 1197, + "word_count": 160, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c172e77-ee30-48d6-b545-fdacd365a3bb", + "text": "3.3 Dual-Branch Alignment Optimization ROVA trains the model through a dual-branch alignment mechanism that aligns representations from\nclean and partially perturbed video inputs. The training objective enforces consistency between two\nbranches using the proposed reward modeling combined with GRPO [Shao et al., 2024]. Here, the\nclean video branch serves as a fixed anchor with gradients detached, while the perturbed branch is\noptimized to align its outputs with those of the clean branch. Given a group of G paired samples, the\nclean branch produces reference outputs {oj}Gj=1 and the perturbed branch generates aligned outputs\n{˜oj}Gj=1. Each pair (oj, ˜oj) corresponds to the same video query under clean and perturbed visual\nconditions. We treat a Fθ as a policy that generates reasoning outputs conditioned on video inputs: min rjAj, J(θ) = E(q,V )∼D, {oj}Gj=1∼Fθold(O|q,V ) G X (7) j=1\nclip(rj, 1 −ϵ,1 + ϵ)Aj −βDKL Fθ∥Fref , where rj = Fθ(oj|q)/Fθold(oj|q), ϵ and β are hyperparameters, and DKL Fθ∥Fref denotes the\nKL-divergence penalty term. The advantage Aj corresponding to output oj is calculated from the\nassociated reward set {r1, r2, . . . , rG}:\nrj −mean {r1, r2, . . . , rG}\nAj = . (8)\nstd {r1, r2, . . . , rG}\nFormat Reward. The model is required to generate an output oj consisting of an embodied\nreasoning process pj followed by a final answer aj, enclosed within and\n tags, respectively. Compliance with this format is verified via a regular\nexpression, producing the format reward rFj. 1, if the format is correct;\nrFj = (9) 0, if the format is incorrect. The accuracy reward rAccj evaluates whether the extracted answer oj is semantically consistent with the ground truth g. Multiple-choice questions typically have a unique and\nprecise answer that can be directly compared once the response follows the required format.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 15, + "total_chunks": 93, + "char_count": 1876, + "word_count": 304, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "841c4119-0bb2-4042-b65c-93caf7ca7cf7", + "text": "1, oj = g;\nrAccj = (10)\n0, oj ̸= g. For each output pair (oj, ˜oj), the alignment reward is decomposed into reasoning\nand answer components: rAj = ralign,j r + ralign,j a , where ralign,j r = αr · Simr(oj, ˜oj) and ralign,aj =\nαa · Sima(oj, ˜oj). Here, αr and αa weight the respective contributions, with Simr and Sima to\nmeasure semantic consistency in the reasoning process and answer segment (see Figs. 8 and 9). The\ntotal reward combines alignment with three rewards: Rj = rFj + rAccj + rAj . With the proposed dual-branch alignment framework, the model is optimized via GRPO using a\ncombined reward signal with robustness-aware consistency, encouraging stable reasoning and answer\npredictions across clean and perturbed video inputs, thereby improving robustness and generalization. 4 Evaluating Video Reasoning under Various Realistic Disturbances Existing video reasoning benchmarks, including MVBench [Li et al., 2024], VideoMME [Fu et al., 2025], ALFRED [Shridhar et al., 2020], Ego4D [Grauman et al., 2022], and\nUrbanVideo [Zhao et al., 2025a], evaluate models primarily under clean visual conditions (Tab. 1). In\ncontrast, real-world deployment often exposes VLMs to adverse weather, dynamic occlusions, abrupt\nillumination changes, and camera instability. As shown in Tab. 1, such perturbations can degrade both\naccuracy and reasoning quality by 12 to 35%. Although ImageNet-C [Xie et al., 2020] introduced the\nevaluation of corruption robustness for image classification, no existing benchmark systematically\nmeasures how temporally coherent and spatially grounded visual perturbations affect reasoning over\nvideos. This leaves a critical blind spot: we lack the tools to diagnose whether failures under visual\ncorruption arise from perceptual errors, reasoning fragility, or both. To close this gap, we introduce Perturbed Video Reasoning Benchmark (PVRBench), designed to evaluate the robustness of video reasoning models under structured, real-world visual variations beyond simple pixel-level corruption. Our focus\nis on reasoning reliability, defined as the ability to\nmaintain coherent and logically consistent inference\nchains grounded in correct visual observations and\nvalid causal steps despite degraded video input. PVRBench integrates four categories of realistic, videospecific disturbances: lighting (dusk, night, overexposure, shadow), camera motion (translation, zoom,\nrotation), occlusion (static, dynamic), and weather\n(fog, rain, snow). Each disturbance is applied with\nspatial awareness (e.g., depth-conditioned occlusion\nplacement and scene-adapted weather rendering) and Figure 3: Overview of the perturbation types\ntemporal coherence across frames.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 16, + "total_chunks": 93, + "char_count": 2684, + "word_count": 385, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db2dbc70-c291-4e86-881e-b95ad4e875be", + "text": "The benchmark in PVRBench.\ncomprises over 9K videos and 51K question-answer\npairs spanning diverse indoor, outdoor, and embodied scenarios, with 27 task coverage from Zhao\net al. [2025a], Yang et al. [2025a], which exercise a broad spectrum of video reasoning capabilities. Perturbation Injection. At its core, we generate video-specific masks (Equation (2)) that contain\nsemantically coherent perturbations conditioned on each video's content, including depth layout,\nobject locations, and motion patterns. These perturbations are contextually adapted to scene semantics;\nfor instance, weather appears as windshield rain refraction in driving scenes, while occlusions are\nplaced at plausible foreground locations. For benchmark evaluation, we adopt a static protocol in\nwhich masks are pre-generated and fixed per video to ensure reproducible cross-model comparison,\nwhile ROVA training (Sec. 3) uses a dynamic protocol that generates perturbations on the fly with\nstochastically sampled parameters at each iteration to prevent overfitting and promote perturbation\ninvariant representation learning. To quantify reasoning reliability, PVRBench introduces five complementary\nmetrics (Fragility, Consistency, Belief, Recovery, and Attention; see Tab. 2) that assess the quality\nand stability of intermediate reasoning, as well as final-answer accuracy. To assess reasoning process\nquality, we leverage a powerful vision-language foundational model (e.g., GPT-4o) to score reasoning\ntraces in coherence, perturbation awareness, and evidence grounding via a structured template\n(see Fig. 9), following the LLM-as-judge paradigm [Zheng et al., 2023, He et al., 2024]. Table 2: Evaluation on PVRBench. We report accuracy under four visual perturbations (Lighting,\nOcclusion, camera Shake, Weather) on the left, and reasoning quality metrics on the right, including\nFragility, Consistency, Belief, Recovery, and Attention (0 - 5 scale; Higher is better, except for Fra\n(↓)). #Fr: the number of frames, Avg.: the average performance, and Orig.: the average performance\non clean (unperturbed) data. We exclude Fra. when computing Avg.† and Orig.†. Answer Accuracy Reasoning Quality\nModel Size #Fr Lig. Proprietary Models\nGPT-4o – 32 .54 .47 .50 .52 .51 ↓14% .59 1.85 3.42 3.55 3.38 3.21 3.39 ↓11% 3.82\nGemini-3-Pro – 32 .57 .52 .54 .55 .55 ↓11% .62 1.72 3.61 3.48 3.58 3.41 3.52 ↓10% 3.91\nClaude-3.5-Son. – 32 .45 .41 .44 .45 .44 ↓17% .53 2.08 3.18 3.22 2.95 3.15 3.13 ↓14% 3.65\nVideo Reasoning Models\nVideo-R1 7B 32 .43 .37 .42 .41 .41 ↓20% .51 2.48 2.75 2.85 2.68 2.65 2.73 ↓20% 3.42\nVideo-R1 72B 32 .51 .45 .49 .49 .49 ↓16% .58 2.11 3.25 3.18 3.21 2.98 3.16 ↓14% 3.68\nVideoChat-R 7B 16 .36 .31 .36 .35 .35 ↓22% .45 2.65 2.62 2.55 2.71 2.28 2.54 ↓22% 3.25\nLLaVA-Video-R 7B 32 .40 .34 .38 .38 .38 ↓21% .48 2.58 2.68 2.61 2.78 2.42 2.62 ↓21% 3.32\nEmbodied-R 7B 32 .45 .38 .42 .43 .42 ↓22% .54 2.45 2.82 2.91 2.72 2.68 2.78 ↓19% 3.45\n+ ROVA (Ours) 7B 32 .52 .46 .49 .51 .50 ↓9% .55 2.25 3.15 3.18 3.22 2.91 3.12 ↓13% 3.58\nOpen-Source Video LLMs\nLLaVA-Video 7B 32 .32 .29 .30 .32 .31 ↓30% .44 2.78 2.45 2.35 2.52 2.25 2.39 ↓23% 3.12\nVideoLLaMA2 7B 16 .28 .25 .27 .29 .27 ↓25% .36 2.92 2.18 2.25 2.12 2.15 2.18 ↓28% 3.01\nVideoChat2 7B 16 .26 .23 .25 .27 .25 ↓26% .34 3.01 2.08 2.15 2.05 2.02 2.08 ↓28% 2.88\nMiniCPM-V 2.6 8B 64 .34 .28 .31 .32 .31 ↓28% .43 2.75 2.48 2.42 2.55 2.21 2.42 ↓24% 3.18\nInternVL2.5 8B 32 .31 .26 .32 .33 .31 ↓33% .46 2.85 2.38 2.28 2.42 2.18 2.32 ↓26% 3.15\n+ ROVA (Ours) 8B 32 .43 .36 .41 .40 .40 ↓15% .47 2.45 2.82 2.75 2.78 2.58 2.73 ↓17% 3.28\nQwen2.5-VL 7B 32 .35 .28 .34 .34 .33 ↓35% .51 2.71 2.58 2.62 2.68 2.31 2.55 ↓25% 3.41\n+ ROVA (Ours) 7B 32 .48 .43 .47 .49 .47 ↓11% .53 2.31 3.05 3.08 2.98 2.85 2.99 ↓15% 3.52\nQwen2.5-VL 72B 32 .48 .41 .44 .47 .45 ↓21% .57 2.18 3.15 3.08 2.92 3.12 3.07 ↓16% 3.64\n+ ROVA (Ours) 72B 32 .57 .53 .56 .56 .56 ↓5% .59 1.95 3.45 3.35 3.42 3.18 3.35 ↓10% 3.72\nQwen3-VL 13B 32 .43 .35 .39 .42 .40 ↓25% .53 2.41 2.85 2.92 2.78 2.72 2.82 ↓19% 3.48\n+ ROVA (Ours) 13B 32 .53 .49 .52 .54 .52 ↓7% .56 2.12 3.28 3.32 3.18 3.05 3.21 ↓11% 3.62", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 17, + "total_chunks": 93, + "char_count": 4085, + "word_count": 702, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f124c58b-c999-4ee1-9d40-7e9ae6b649c7", + "text": "5.1 Implementation Details. We train our model on 4 NVIDIA A100 (80GB) GPUs. For optimization, we set the ordered\ngroup size to G = 8 and the shuffled group size to ˜G = G/2.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 18, + "total_chunks": 93, + "char_count": 174, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cd526083-ccba-475e-bc52-f7d14c5b911d", + "text": "Details are provided in Sec. We use both clean and perturbed video data for training and evaluation. For training,\nwe curate an outdoor-scene-relevant subset of Video-R1-260k (∼10% of its video split, filtered\nby scene category labels) and apply dynamic, randomly sampled perturbation masks to construct\ncorruption-augmented video-query pairs. For evaluation, we assess generalization on the proposed\nPVRBench, which contains over 51K question answer pairs across more than 9K videos spanning\ndiverse scene categories beyond the training distribution. Static perturbation masks are systematically\ninjected to measure model accuracy, reasoning quality, and robustness under both clean and corrupted\nconditions. We further evaluate the generalization of VLMs on standard VisBench and UrbanVideo. ROVA Performance on PVRBench. We extensively evaluate our approach on PVRBench and the\nclean benchmark (Orig.: UrbanVideo and VSI-Bench) across diverse backbones, including video\nreasoning models and open-source video LLMs ranging from 7B to 72B. As shown in Tab. 2, among\ndedicated video reasoning models, ROVA consistently outperforms prior methods. In the 7B setting,\nit improves the best-performing model, Embodied-R, from 0.42 to 0.50 average accuracy under\nperturbations (more than 17% relative gain), and even matches or surpasses the much larger Video-R1 Table 3: Training efficiency comparison (Qwen2.5-VL-7B, Orig. Acc. = 0.43; GPU-h =\n#GPUs × wall-clock hours).", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 19, + "total_chunks": 93, + "char_count": 1466, + "word_count": 207, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "010beb46-23ce-4d4a-bf37-184f7091a5ae", + "text": "SRE = Self-Reflective Evaluation, DRE = Difficulty Re-Evaluation, ME\n= Memory Eviction. Robust. = dual-branch alignment with structured corruption (Secs. 3.1 and 3.3). Curric. = SRE + DRE + ME (Sec. 3.2). Training Data Architecture Config. SFT RL Total Branch Robust. GRPO — — — Single ✗ ✗ 4×A100 71.6 .45\nNaïve Dual — — — Dual ✓ ✗ 4×A100 142.8 .48\nVideo-R1 165K 260K 425K Single ✗ ✗ 8×A100 339.2 .49\nROVA 6.5K 26K 32.5K Dual ✓ ✓ 4×A100 134.4 .53 12 Easy Difficult Total samples fixed (%)40\nDifficult (%) Informative\n(%)10 Easy 38\nRate Ratio50 +3.4% 8 PVRBench36\n6 25 Discard 34\nRandom Discard 4 Cumulative Difficulty-Aware (Ours) 0 Accuracy32\n1 2 3 4 5 0 100 200 300 0 10 20 30 40\nEpoch Training Steps Discard Rate (%) (a) Sample discard rate evolution (b) Evolution of estimated easy, in- (c) Difficulty-aware confidenceduring self-reflective curriculum formative, and difficult sample pro- threshold discard vs. random across\ntraining. portions over training. retention. Figure 4: Analysis of Self-Reflective Evaluation and Difficulty-Aware Training for ROVA during\nthe first Epoch of Qwen-VL-2.5-7B Training. Importantly, it also achieves consistent improvements in reasoning quality, indicating stable\nand reliable reasoning under visual corruption. Most open-source video LLMs suffer substantial\ndegradation under perturbations, with 21–35% drops in accuracy and 16–28% declines in reasoning\nquality relative to clean inputs. Notably, ROVA not only withstands the proposed perturbations but also enhances the model's generalization performance, observing consistent gains on PVRBench and across unseen benchmarks\n(VisBench and UrbanVideo, Fig. 19) in both answer accuracy and reasoning quality under clean and\nperturbed videos. These findings suggest that ROVA is able to learn perturbation-robust representations with strong transferability, enabling improved robustness and semantic understanding beyond\nthe training distribution without domain-specific fine-tuning, while maintaining superior performance\non clean data. Beyond the accuracy and reasoning quality improvements, Tab. 3 shows that ROVA is highly resourceefficient. Although the dual-branch design doubles the forward pass, the proposed curriculum (SRE\n+ DRE + ME) more than offsets this overhead, reducing GPU-hours by 5.9% compared to naive\nDual-Branch (134.4 vs. 142.8) while improving accuracy from 0.37 to 0.47.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 20, + "total_chunks": 93, + "char_count": 2387, + "word_count": 351, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f72a1858-2b00-48db-8e29-229e774344a2", + "text": "Moreover, ROVA\nsurpasses Video-R1 by 23.7% (0.47 vs. 0.38) while using 60.4% fewer GPU-hours (134.4 vs. 339.2),\nhalf the GPUs, and less than 8% of the training data (32.5K vs. 425K). These results suggest that\nthe dual-branch alignment objective learns transferable, perturbation-robust representations that\ngeneralize beyond the training distribution without domain-specific fine-tuning, while maintaining\nstrong performance on clean data. Analysis of self-reflective evaluation and sample-selective training. We also analyze the behavior\nof our self-reflection evaluation mechanism during training. As shown in Fig. 4a, the discard rate\nfor easy samples increases steadily over epochs while that for difficult samples declines, indicating\nthat the model keeps evolving and smarter and prefers to decline more samples as they are already\ngood at those, Fig. 4a, shows a moderate fraction of samples is discarded overall, and the model\nselectively filters low-utility or overly noisy instances rather than aggressively pruning data. Fig. 4b\nfurther illustrates the evolution of the estimated sample difficulty in training steps. While the\ntotal number of discarded samples is fixed, the composition gradually shifts toward easy samples,", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 21, + "total_chunks": 93, + "char_count": 1236, + "word_count": 176, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "69fd84c6-4ff3-431f-b99f-5d9b524970f3", + "text": "Inference\nFull Components +15.66% Original Weather Occlusion Shake Lighting\n(0.5236)\nReasoning Reward +4.91% Weather + Occlusion 0.5649 0.5418 0.5037 0.52890.0780 0.53610.0666 0.56\n(0.4749)\n0.54\nEasy Sample Discarding +3.46% (0.4684) Weather + Shake 0.5672 0.5389 0.52460.0885 0.5327 0.52810.0586 0.52\nMemory +2.73%\n(0.4651) 0.50 Occlusion + Lighting 0.5591 0.53180.0649 0.5019 0.52470.0738 0.5487 Training\nTemporal Shuffle +1.82%\n(0.4609) 0.48\nFixed-shape Random 0.5214 0.4687 0.4319 0.4541 0.4706\nAnswer-Only Alignment (Avg: 0.4527) 0.46\n0 5 10 15 20 Pixel-level Random 0.5187 0.4652 0.4403 0.4478 0.4683 0.44\nImprovement over Qwen3-VL-13B (%) (a) Accuracy improvements from each component in (b) Models trained on two mask styles are evaluated on\nROVA over the base model (final-answer alignment in-domain and held-out OOD perturbations (highlighted\nonly). in red). Figure 5: Ablation studies of ROVA. (a) Impact of individual components on answer accuracy. (b)\nComparison of corruption mask strategies across perturbation types. Experiments are conducted\nusing the Qwen3-VL-13B model trained for 3 epochs. reflecting the improving competence of the model: samples initially deemed difficult are increasingly\nreclassified as easy as training progresses. This dynamic redistribution suggests that the self-reflective\nevaluator captures meaningful learning signals and adapts the curriculum in a data-driven manner. Fig. 4c demonstrates the effectiveness of difficulty-aware data selection for training. Compared to\nrandom discarding, our strategy consistently achieves higher accuracy across discard rates, with an\nimprovement of up to 3.4% on PVRBench. This indicates that selective removal of samples based on\nestimated difficulty preserves informative training signals while avoiding detrimental noise.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 22, + "total_chunks": 93, + "char_count": 1807, + "word_count": 243, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3ff2c1fa-2f14-48d2-b474-23ca1ff44ade", + "text": "5.3 Ablation Study and Analysis Ablation of Core Components. We ablate each component of ROVA to assess its contribution\n(Fig. 5a). The reasoning reward yields the largest gain, followed by easy sample discarding, underscoring the central role of semantic reasoning and targeted curation. The memory module and\ntemporal shuffle provide smaller but consistent gains, serving as complementary regularizers that\nstabilize training and enhance robustness. Ablation of Mask Styles. We explore the generalizability of the proposed structured masking\nstrategy compared to random masking baselines. As shown in Fig. 5b, models trained on only two\ncorruption mask styles achieve strong in-domain performance on the perturbation types seen during\ntraining, and more importantly, transfer effectively to held-out perturbation types (highlighted in red):\nout-of-domain performance remains close to in-domain results while both consistently surpass fixedshape and pixel-level random masking by a significant margin (6 - 9% absolute). This indicates that\nstructured, perturbation-aware masks capture transferable corruption patterns rather than overfitting\nto specific disturbance types, confirming that a small subset of mask styles suffices to achieve broad\nrobustness under diverse real-world disturbances. Ablation of reward models. Notably, our LLM judge (GPT- Table 4: Ablation study of the re-\n4o by default) outperforms rule- or embedding-based matching ward model on PVRBench using\nin evaluating semantic consistency across reasoning traces and commercial and open source VLMs.\nfinal answers. Replacing it with open-source models (e.g.,\nQwen3-13B) yields comparable results, suggesting that the Reward Judge Acc. Free\napproach generalizes beyond proprietary APIs (Tab. 4). In contrast, more granular reward designs, such as conditional align- GPT-4o 0.470 2.99 ✗\nment or step-level consistency, introduce additional variance Qwen3-13B 0.467 2.97 ✓\nthat destabilizes GRPO and degrades performance (Tab. 15),\nQwen2.5-7B 0.463 2.95 ✓further supporting LLM-based evaluation as the most effective\napproach.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 23, + "total_chunks": 93, + "char_count": 2096, + "word_count": 290, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8d2a1e4-2229-4235-8087-5596894e74e8", + "text": "Navigation under Heavy Rain 2. Trajectory Planning under Rain Question: Question:\nBased on the video frames captured during heavy rain, Based on the video frames showing the path from the\nshould the agent move backward or turn left to each park to the store, which trajectory correctly describes\nthe tower crane? the agent's path? Answer: Answer:\nStart from the scooter lane in the park, turn left to face Turn left from the park, rise to building height, then\nthe street, rise to building height, then move forward move forward and downward to the store.\nand slightly downward to reach the store. Reasoning: Reasoning:\nThe video shows the agent starting from the scooter By analyzing the spatial relationship in the video\nlane in the park, turning left to face the street, rising frames, the tower crane is located directly behind the\nto building height, and then moving forward slightly agent's current viewpoint, requiring backward rather\ndownward to reach the destination. than a left turn to reach the target.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 24, + "total_chunks": 93, + "char_count": 1014, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "202bc0fb-f4a1-4c64-ac57-a4710c31daa5", + "text": "Figure 6: Qualitative examples of ROVA-trained Qwen2.5-VL-7B performing obstacle avoidance\nand target identification under night-time low-light conditions. See more examples in Figs. 20 to 23. 5.4 Qualitative Analysis We further validate the robustness of ROVA through qualitative examples on representative tasks\nin Fig. 6. Even in challenging scenarios where adverse weather or visual disturbances significantly\ndegrade visibility, ROVA remains effective, correctly reasoning about the scene and task requirements. For instance, when heavy rain and glare obscure key visual cues, ROVA can still infer spatial\nrelationships and scene structure, and when large objects block the field of view, it correctly reasons\nabout the underlying layout rather than relying on partial appearances. This shows that ROVA reliably\ninterprets and reasons in visually impaired conditions, demonstrating robustness beyond controlled\nsettings and confirming its effectiveness in difficult, realistic environments. In this work, we present ROVA, a robust training framework for embodied video reasoning that leverages structured spatio-temporal corruptions, dual-branch alignment, and self-reflective data curation\nto learn perturbation-robust representations. To evaluate robustness under realistic disturbances, we\nintroduce PVRBench. We show that ROVA consistently improves robustness under diverse real-world\nperturbations in video inputs while also improving performance on clean video–question pairs. These\ncontributions provide both a principled benchmark and a practical training recipe, enabling future\nstudies on broader perturbation families and more complex long-horizon embodied tasks. Amit Agarwal, Srikant Panda, Angeline Charles, Hitesh Laxmichand Patel, Bhargava Kumar, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Hansa Meghwani, Karan Gupta, and\nDong-Kyu Chae.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 25, + "total_chunks": 93, + "char_count": 1872, + "word_count": 245, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "71e14261-c862-40e9-87c0-658e1d1ace55", + "text": "MVTamperBench: Evaluating robustness of vision-language models. In Findings\nof the Association for Computational Linguistics (ACL Findings) 2025, pages 1–10, Stroudsburg,\nPA, 2025. Association for Computational Linguistics. 3 Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen,\nJinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common\nsense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025. 3 Yichao Cai, Yuhang Liu, Zhen Zhang, and Javen Qinfeng Shi.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 26, + "total_chunks": 93, + "char_count": 541, + "word_count": 72, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b60015dc-5f70-4ba4-8020-cae0f682d100", + "text": "Clap: Isolating content from style\nthrough contrastive learning with augmented prompts. In Proceedings of the European Conference\non Computer Vision (ECCV), pages 1–10, Cham, 2024. Ruizhe Chen, Zhiting Fan, Tianze Luo, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang,\nZhuochen Wang, Zuozhu Liu, and Huaijian Zhang.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 27, + "total_chunks": 93, + "char_count": 321, + "word_count": 46, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "97b00ec9-bf87-47fc-a7d5-d16aa4504b48", + "text": "Datasets and recipes for video temporal\ngrounding via reinforcement learning. In Proceedings of the 2025 Conference on Empirical\nMethods in Natural Language Processing: Industry Track, pages 1–10, Stroudsburg, PA, 2025a. Association for Computational Linguistics. 3 Shoubin Chen, Zehao Wu, Kai Zhang, Chunyu Li, Baiyang Zhang, Fei Ma, Fei Richard Yu, and\nQingquan Li. Exploring embodied multimodal large models: Development, datasets, and future\ndirections. arXiv preprint arXiv:2502.15336, 2025b. 3 Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi,\nSangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights\nand data for vision-language models with video understanding and grounding. arXiv preprint Jinhao Duan, Quanfu Fan, Hao Cheng, Xiaoshuang Shi, and Kaidi Xu. Improve video representation\nwith temporal adversarial augmentation.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 28, + "total_chunks": 93, + "char_count": 907, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9ffe08a0-1f81-4a22-926b-8b72236f3cba", + "text": "In Proceedings of the International Joint Conference on\nArtificial Intelligence (IJCAI), pages 1–10, Palo Alto, CA, 2023. IJCAI Organization. 3 Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu\nZhou, Yunhang Shen, Mengdan Zhang, et al.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 29, + "total_chunks": 93, + "char_count": 276, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ef870e6a-9c62-4cb5-bbf8-d2c919fec3b7", + "text": "Video-mme: The first-ever comprehensive evaluation\nbenchmark of multi-modal llms in video analysis. In Proceedings of the IEEE International\nConference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, Piscataway, NJ,\n2025. Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit\nGirdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world\nin 3,000 hours of egocentric video. In Proceedings of the IEEE International Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 18995–19010, Piscataway, NJ, 2022. Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of\nalgorithmic differentiation. SIAM, Philadelphia, PA, 2008. 29 Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil\nChandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate\nfine-grained human feedback for video generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1–10, Stroudsburg, PA, 2024. Association for Computational Linguistics. 7 Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint Mvbench: A comprehensive multi-modal video understanding benchmark.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 30, + "total_chunks": 93, + "char_count": 1404, + "word_count": 192, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a43848a2-4fb7-4f96-a67c-5f38551c9e8b", + "text": "In\nCVPR, pages 1–10, Piscataway, NJ, 2024. Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao,\nYi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement\nfine-tuning. arXiv preprint arXiv:2504.06958, 2025. 1 Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 31, + "total_chunks": 93, + "char_count": 338, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e194f530-4be3-4955-8da4-f78c65b815b5", + "text": "Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the Association\nfor Computational Linguistics (ACL), pages 12585–12602, Stroudsburg, PA, 2024. Association for\nComputational Linguistics. 1, 3 Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick. Understanding zero-shot\nadversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016, 2022. 2, 3 Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen,\nSee-Kiong Ng, and Anh Tuan Luu.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 32, + "total_chunks": 93, + "char_count": 558, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8922311f-2b19-4370-81a2-71da7e9827e7", + "text": "Video-language understanding: A survey from model architecture, model training, and data perspectives. In Findings of the Association for Computational\nLinguistics (ACL Finding), pages 1–10, Stroudsburg, PA, August 2024. Association for Computational Linguistics. 3 Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, and Yixuan Li. Understanding multimodal LLMs\nunder distribution shifts: An information-theoretic approach. In Proceedings of the International\nConference on Machine Learning (ICML), 2025. 3 Schiappa, Shruti Vyas, Hamid Palangi, Yogesh S. Rawat, and Vibhav Vineet.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 33, + "total_chunks": 93, + "char_count": 575, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "019d4cda-087e-45c7-94b5-97f1fb6b5e40", + "text": "Robustness\nanalysis of video-language models against visual and language perturbations. In 36th Conference\non Neural Information Processing Systems Track on Datasets and Benchmarks, 2022. 3 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,\nMingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 6 Lijun Sheng, Jian Liang, Zilei Wang, and Ran He. R-tpt: Improving adversarial robustness of\nvision-language models through test-time prompt tuning. In Proceedings of the IEEE International\nConference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, Piscataway, NJ,\n2025. Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi,\nLuke Zettlemoyer, and Dieter Fox.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 34, + "total_chunks": 93, + "char_count": 852, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "91cd6b44-8879-4492-91c7-7b2966f42f8b", + "text": "Alfred: A benchmark for interpreting grounded instructions for\neveryday tasks. In Proceedings of the IEEE International Conference on Computer Vision and\nPattern Recognition (CVPR), pages 10740–10749, Piscataway, NJ, 2020. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang,\nand Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. In\nProceedings of the IEEE International Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1–10, Piscataway, NJ, 2025. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut,\nJohan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly\ncapable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2 Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 35, + "total_chunks": 93, + "char_count": 848, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b2c6393-7a8a-4fa4-bab7-8e7c38c44d4b", + "text": "On the zero-shot adversarial robustness of\nvision-language models: A truly zero-shot and training-free approach. In Proceedings of the\nIEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10,\nPiscataway, NJ, 2025. Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju,\nLiang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal\nvideo grounding. In Advances in Neural Information Processing Systems (NeurIPS), pages 1–10,\nRed Hook, NY, 2025a. Curran Associates, Inc. 3 Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video\nreasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1–10, Stroudsburg, PA, 2025b. Association for Computational Linguistics. 3, Yulong Wu, Viktor Schlegel, and Riza Batista-Navarro. Pay attention to real world perturbations!\nnatural robustness evaluation in machine reading comprehension. arXiv preprint arXiv:2502.16523,\n2025. 3 Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 36, + "total_chunks": 93, + "char_count": 1217, + "word_count": 168, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8dec589a-7da3-4082-8953-a4277a15819c", + "text": "Self-training with noisy student\nimproves imagenet classification. In Proceedings of the IEEE International Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 1–10, Piscataway, NJ, 2020. Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 37, + "total_chunks": 93, + "char_count": 286, + "word_count": 40, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bc83bfa9-be48-4201-b691-9de8e21440b3", + "text": "Thinking in\nspace: How multimodal large language models see, remember, and recall spaces. In Proceedings\nof the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages\n1–10, Piscataway, NJ, 2025a. IEEE/CVF. 2, 3, 7, 17, 18 Zixi Yang, Jiapeng Li, Muxi Diao, Yinuo Jing, and Kongming Liang. Ro-bench: Large-scale\nrobustness evaluation of mllms with text-driven counterfactual videos. arXiv:2510.08936, 2025b. Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning. In Proceedings of the IEEE International\nConference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, Piscataway, NJ,\n2026. Shoubin Yu, Jaehong Yoon, and Mohit Bansal.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 38, + "total_chunks": 93, + "char_count": 747, + "word_count": 106, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fd279785-9c6d-4f51-910d-37278f919d66", + "text": "Crema: Generalizable and efficient video-language\nreasoning via multimodal modular fusion. In Proceedings of the International Conference on\nLearning Representations (ICLR), pages 1–10. OpenReview.net, 2025. 1, 3 Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, and Mohit Bansal. When and how much to imagine: Adaptive test-time scaling with world models for visual spatial\nreasoning. arXiv preprint arXiv:2602.08236, 2026. 3 Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large\nvision-language models from detailed video description to comprehensive video understanding. Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, and Yong Guo. Benchmarking the robustness of temporal action detection models against temporal corruptions. In\nProceedings of the IEEE International Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1–10, Piscataway, NJ, 2024. Hang Zhang, Xin Li, and Lidong Bing.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 39, + "total_chunks": 93, + "char_count": 985, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "900b16b6-143b-423b-8cfb-3e3da58fa916", + "text": "Video-llama: An instruction-tuned audio-visual language\nmodel for video understanding. In Proceedings of the 2023 conference on empirical methods\nin natural language processing: system demonstrations, pages 543–553, Stroudsburg, PA, 2023. Association for Computational Linguistics. 1, 3 Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal\nmodels against common corruptions. arXiv preprint arXiv:2401.11943, 2024. 2 Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou,\nZhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing visual search, reasoning,\nand action for embodied interactive tasks. arXiv preprint arXiv:2503.21696, 2025. 3 Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao,\nYue Wang, Jinqiang Cui, Xinlei Chen, et al.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 40, + "total_chunks": 93, + "char_count": 859, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "332773e8-deb9-4b71-afbe-a30826e382a9", + "text": "Urbanvideo-bench: Benchmarking vision-language\nmodels on embodied intelligence with video data in urban spaces. In Proceedings of the Association for Computational Linguistics (ACL), pages 1–10, Stroudsburg, PA, 2025a. Association for\nComputational Linguistics. 2, 3, 7, 17 Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei\nChen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied\nspatial reasoning in foundation models via reinforcement learning. In Proceedings of the 33rd\nACM International Conference on Multimedia, pages 1–10, New York, NY, 2025b. Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 41, + "total_chunks": 93, + "char_count": 680, + "word_count": 95, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6648f9d4-a793-4b97-9548-d9bb26d2b63b", + "text": "Test-time adaptation with clip reward for\nzero-shot generalization in vision-language models. In Proceedings of the International Conference\non Learning Representations (ICLR), pages 1–10. OpenReview.net, 2024. 3 Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Advances in Neural\nInformation Processing Systems (NeurIPS), pages 1–10, Red Hook, NY, 2023. Curran Associates,\nInc. 3", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 42, + "total_chunks": 93, + "char_count": 495, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dfcadec3-c7da-44dd-b45e-bf5606d62886", + "text": "Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,\nZi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and\nchatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), pages 1–10, Red\nHook, NY, 2023. Curran Associates, Inc. 7 Wanqi Zhou, Shuanghao Bai, Danilo P. Mandic, Qibin Zhao, and Badong Chen. Revisiting the\nadversarial robustness of vision language models: a multimodal perspective. arXiv preprint Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, and\nHuaxiu Yao. Reagent-v: A reward-driven multi-agent framework for video understanding. In\nAdvances in Neural Information Processing Systems (NeurIPS), pages 1–10, Red Hook, NY, 2025. Curran Associates, Inc. 3", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 43, + "total_chunks": 93, + "char_count": 796, + "word_count": 115, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4ebe7f64-d855-4453-848b-7f17fbc60d2a", + "text": "B Full Details of Dataset Construction 17 B.1 Source Dataset Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Video Perturbation Generation System . . . . . . . . . . . . . . . . . . . . . . . . 19 C Prompt Templates 19 C.1 Alignment Reward Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Difficulty Assessment Judge Prompt . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.3 Complete Reward Computation Pipeline . . . . . . . . . . . . . . . . . . . . . . . 20 E Additional Experimental Results 24 F Additional Case Study 28 G Time Complexity Analysis 28 G.1 Per-Step Cost Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 G.2 Amortized Cost Savings from Curriculum . . . . . . . . . . . . . . . . . . . . . . . 31 G.3 Wall-Clock Time Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.4 Amortized Memory Re-evaluation Cost . . . . . . . . . . . . . . . . . . . . . . . 33 H Analysis of Reward Modeling Design 33", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 44, + "total_chunks": 93, + "char_count": 1021, + "word_count": 318, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2d7a4195-b246-4ce9-9eef-69cda2f610b1", + "text": "H.1 Motivation: Why Multi-Component Rewards? . . . . . . . . . . . . . . . . . . . . 33 H.2 Alignment Reward: Optimizing Geodesic distance . . . . . . . . . . . . . . . . . 34 H.3 Interaction Between Reward Components and Curriculum . . . . . . . . . . . . . 34 H.4 Comparison with Alternative Reward Designs . . . . . . . . . . . . . . . . . . . . 35 I Theoretical Analysis 36 While the proposed composite reward design proves effective in practice, several design choices\nwarrant further investigation. First, both the format reward and accuracy reward are binary (0\nor 1), offering no partial credit for nearly correct answers or partially well-structured outputs; a\nsofter, continuous reward signal could provide richer gradients for GRPO optimization. Second, the\nproposed reward components are combined with equal weights, but the optimal balance among format\ncompliance, answer correctness, and cross-branch alignment may vary across perturbation types\nand reasoning complexity. For simplicity, our framework does not adaptively adjust these weights\nduring training.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 45, + "total_chunks": 93, + "char_count": 1073, + "word_count": 197, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6ae8692d-9f80-4397-80c6-eac7d1ccbcde", + "text": "Third, the alignment reward relies on an external LLM judge to assess semantic\nconsistency between clean and perturbed outputs, which introduces a dependency on the judge's\nown capability and potential biases; although we show that open-source alternatives (Qwen3-13B)\nyield comparable results, the reward signal remains bounded by the judge model's understanding of\ndomain-specific reasoning. Fourth, our reward operates only at the holistic output level, evaluating the\nfinal answer and the overall reasoning trace, without providing step-level feedback on intermediate\nreasoning quality. As our ablation study confirms, more fine-grained reward designs, such as steplevel consistency checks, tend to introduce variance that destabilizes GRPO training. Addressing this\nchallenge between reward granularity and optimization stability, for instance, through hierarchical or\ncurriculum-based reward shaping, remains an important direction for future work.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 46, + "total_chunks": 93, + "char_count": 954, + "word_count": 127, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6bf51e20-c22f-4b9f-929c-739c29804724", + "text": "B Full Details of Dataset Construction This section provides comprehensive documentation of the PVRBench benchmark construction,\nincluding data sources, curation methodology, perturbation generation algorithms, and quality assurance protocols. Our benchmark integrates and augments two established embodied video reasoning\ndatasets, UrbanVideo-Bench [Zhao et al., 2025a] and VSI-Bench [Yang et al., 2025a], to create\nthe first large-scale robustness evaluation benchmark for video reasoning under realistic visual\nperturbations. B.1 Source Dataset Integration PVRBench is constructed by systematically combining the complete video corpora and questionanswer annotations from two complementary benchmarks, resulting in a unified evaluation framework\nspanning both outdoor urban navigation and indoor spatial reasoning scenarios (Fig. 7). B.1.1 UrbanVideo-Bench UrbanVideo-Bench [Zhao et al., 2025a] is an embodied video reasoning benchmark specifically\ndesigned for evaluating Video-LLMs on aerial agent motion in urban open-ended three-dimensional\nspaces. The benchmark addresses a critical gap in existing evaluations by focusing on the unique\nchallenges of drone-based navigation in complex urban environments.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 47, + "total_chunks": 93, + "char_count": 1212, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d92ec038-4dd4-40e9-8345-6f993cccc4e4", + "text": "Data Collection Sources. The video corpus comprises 1,547 video clips collected from three\ndistinct sources: Real-World Drone Footage (Guangdong Province, China): Videos captured using two DJI Mini\n4K drones operated by experienced pilots with over 1,000 hours of flight time. Data collection was\nconducted in Shenzhen and Zhaoqing, covering diverse urban landscapes including commercial\ndistricts, residential areas, parks, and waterfront regions.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 48, + "total_chunks": 93, + "char_count": 448, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9a83101c-8453-43fd-a254-bde295c1d080", + "text": "Resolution: 1280 × 720 pixels. EmbodiedCity Simulator: A high-fidelity simulation environment built on Unreal Engine using\nreal Beijing city data. The simulator provides realistic 3D urban modeling with over 100 categories\nof micro urban elements (buildings, vehicles, pedestrians, signage, etc.). Resolution: 960 × 720\npixels. AerialVLN Simulator: A virtual urban environment specifically designed for aerial visionlanguage navigation research, built on Unreal Engine with AirSim integration for realistic drone\nphysics. Resolution: 520 × 520 pixels. VSI-Bench Problem Type Distribution\nUrbanVideo VQA Problem Type Distribution\nSize Estimation 20.7%\n22.7% Action Generation\nAbsolute Distance 14.5% Landmark Position 16.8%\nProgress Evaluation 14.5% Relative Distance 14.5% 7.8% Trajectory Captioning\nGoal Detection 5.9% Direction (Medium) 11.7%\nTypes High-levelCognitivePlanningMap 4.7%5.1% Object Counting 11.3%\nAssociation Reasoning 4.3% Appearance Order 10.9%\nStart/End Position 2.7%\nCounterfactual 2.3% Direction (Hard) 9.4%\n2.3% Duration\n2.3% Room Size Estimation 3.1% Size Estimation (20.7%)Problem Object Recall\n2.3% Proximity\nCausal 2.3% ActionLandmarkGenerationPosition (22.7%)(16.8%) Route Planning 2.7% AbsoluteRelative DistanceDistance (14.5%)(14.5%)\nScene Recall 2.0% Progress Evaluation (14.5%) Direction (Easy) 1.2% Other Types (1.2 11.7%)\nSequence Recall 2.0% Other Types (2.0 7.8%)\n0 5 10 15 20 25 0 5 10 15 20 25\nSampling Ratio (%) Sampling Ratio (%) (a) UrbanVideo-Bench QA type distribution. Action (b) VSI-Bench QA type distribution. Size EstimaGeneration (22.7%), Landmark Position (16.8%), and tion (20.7%) and distance tasks (29.0% combined) are\nProgress Evaluation (14.5%) dominate, reflecting the most prevalent, reflecting the spatial measurement fonavigation-centric design. cus. Figure 7: Question-answer type distributions for PVRBench source datasets. The complementary\ndistributions - UrbanVideo emphasizing navigation/action and VSI-Bench emphasizing spatial perception - together provide comprehensive coverage of embodied video reasoning capabilities. Table 5: Complete task taxonomy for UrbanVideo-Bench with 16 tasks across 4 cognitive ability\ncategories. Category Task Description Trajectory Captioning Summarize agent movement using visual landmarks\nSequence Recall Identify next action after specific movement\nRecall Object Recall Locate objects relative to landmarks\nScene Recall Describe observations during specific actions\nStart/End Position Identify journey origin and destination", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 49, + "total_chunks": 93, + "char_count": 2525, + "word_count": 320, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5b2ba09-49b6-4226-b0b9-8ffe7b00b22f", + "text": "Proximity Track distance changes to landmarks\nDuration Compare temporal duration of movements\nPerception Landmark Position Determine egocentric position relative to goals\nGoal Detection Identify if/where destination is visible\nCognitive Map Summarize spatial environment layout Causal Explain reasons for specific movements\nReasoning Counterfactual Evaluate alternative action consequences\nAssociation Identify relevant objects when the goal is not visible Progress Evaluation Assess current step in navigation route\nNavigation High-level Planning Determine next waypoint toward goal\nAction Generation Output specific control actions Video Characteristics. The collected videos span a wide range of characteristics. Their durations\nvary from 10 seconds to 10 minutes, with a mean length of 87.3s and a median of 52.1s, and frame\nrates range from 24 to 30 fps depending on the source.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 50, + "total_chunks": 93, + "char_count": 883, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1e93c0ff-cecc-468f-81a0-4500dc898b85", + "text": "All videos are captured using a single\nforward-facing camera mounted on a gimbal that supports a downward tilt between 0◦and 90◦. In\nterms of motion, the videos feature purposeful navigation trajectories, including ascent and descent,\nhorizontal translation, rotation, as well as compound movements that combine multiple motion types. UrbanVideo-Bench defines 16 task types organized into four cognitive ability\ncategories, as shown in Tab. 5. VSI-Bench (Visual Spatial Intelligence Benchmark) [Yang et al., 2025a] evaluates spatial reasoning\ncapabilities from egocentric video perspectives in indoor environments. The benchmark focuses on Table 6: VSI-Bench scene category distribution across 288 videos. Scene Type Proportion Characteristics Living Rooms 22.1% Social spaces with seating, entertainment systems\nBedrooms 19.3% Sleeping areas with beds, wardrobes, personal items\nKitchens 18.4% Cooking areas with appliances, countertops, cabinets\nOffices 15.8% Workspaces with desks, chairs, equipment\nBathrooms 12.7% Sanitary facilities with fixtures\nHallways/Other 11.7% Transitional spaces and miscellaneous areas", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 51, + "total_chunks": 93, + "char_count": 1117, + "word_count": 147, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f2550500-1594-4f6a-8ee1-a12811a3d06c", + "text": "Table 7: VSI-Bench task distribution with spatial reasoning focus. Size Estimation 20.7% Estimate absolute dimensions of objects\nAbsolute Distance 14.5% Measure distance between camera and objects\nRelative Distance 14.5% Compare distances to multiple objects\nDirection (Medium) 11.7% Determine object directions with moderate complexity\nObject Counting 11.3% Count instances of object categories\nAppearance Order 10.9% Sequence objects by order of appearance\nDirection (Hard) 9.4% Complex directional reasoning with occlusions\nRoom Size Estimation 3.1% Estimate room dimensions\nRoute Planning 2.7% Plan navigation paths through spaces\nDirection (Easy) 1.2% Simple directional questions fundamental spatial cognition tasks that require understanding of 3D space from sequential visual\nobservations. VSI-Bench aggregates videos from three public indoor scene datasets: ARKitScenes,\nwhich provides real-world indoor scans captured using Apple ARKit; ScanNet, a widely used dataset\nof RGB-D indoor scene reconstructions; and 3RScan, a large-scale real-world indoor dataset enriched\nwith instance-level annotations. The 288 videos span six indoor environment types, as detailed in Tab. 6. VSI-Bench defines 11 spatial reasoning tasks, as shown in Tab. 7.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 52, + "total_chunks": 93, + "char_count": 1249, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "320f0ac3-7545-406b-8b85-a47ec7989942", + "text": "B.2 Video Perturbation Generation System We develop a comprehensive video perturbation system that generates semantically coherent, temporally consistent, and physically plausible visual corruptions. Unlike generic image augmentation\ntechniques (e.g., random cropping, color jittering, and Gaussian noise), our system models realistic\ndisturbances that preserve the answerable nature of questions while challenging model robustness.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 53, + "total_chunks": 93, + "char_count": 432, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fb552c69-4325-4bee-a310-b487ab6a798b", + "text": "B.2.1 System Architecture Overview The perturbation system comprises four specialized modules organized in a modular pipeline architecture. Each module can be applied independently or in combination, with perturbation type sampled\nuniformly from M = {lighting, camera, occlusion, weather}. This section documents the complete prompt templates used in ROVA for alignment reward computation and self-reflective difficulty assessment. Table 8: Video perturbation system architecture overview. Input video V = {f1, . . . , fT } is transformed to perturbed video V ′ = {f1,′ . . . , fT′ } via one of four modules. Module Effects Real-World Scenario Lighting Dusk, Night, Overexposure, Shadow Time-of-day changes, exposure errors\nCamera Motion Translation, Zoom, Rotation Handheld shake, platform instability\nOcclusion Static, Dynamic Lens obstruction, passing objects\nWeather Fog, Rain, Snow Atmospheric conditions", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 54, + "total_chunks": 93, + "char_count": 909, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f61d193d-33d1-4c29-b971-81c5b631e91e", + "text": "C.1 Alignment Reward Prompts As shown in Algorithm 2, the alignment reward rAj evaluates the consistency between outputs from\nthe original and perturbed video branches by decomposing it into two complementary components:\nanswer-level consistency and reasoning-level consistency, both assessed using GPT-4o. For answer consistency, the evaluator employs a strict binary matching rule: if the candidate answer\nexactly matches or is semantically equivalent to the reference answer (e.g., \"0\" vs. \"zero\"), a score of\n1.0 is assigned; otherwise, the score is 0.0, with no partial credit allowed (see answer consistency\nprompt template (Fig. 8)). For reasoning consistency, a three-tier scoring scheme is used: a score of 1.0 indicates that the\ncandidate reasoning is fully consistent with the reference, allowing for paraphrasing and minor\nomissions; 0.5 indicates general consistency but includes unsupported additions or missing key\nsteps; and 0.0 indicates contradiction or hallucination of core facts. Critically, scoring is based\nsolely on the reasoning process, independent of the final answer (see reasoning consistency prompt\ntemplate (Fig. 9)). Together, these two metrics - answer matching and reasoning alignment - enable a fine-grained evaluation of output consistency under perturbation, promoting both semantic robustness and reasoning\nfidelity in the model.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 55, + "total_chunks": 93, + "char_count": 1367, + "word_count": 195, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7447f2af-e9e0-441a-9b10-3ea912dd0d47", + "text": "C.2 Difficulty Assessment Judge Prompt Fig. 10 illustrates the self-reflective difficulty assessment that employs an LLM judge to determine\nsample answerability under visual perturbations. The LLM receives a binary assessment prompt\nthat strictly constrains it to evaluate only using the masked video. If the masked video provides\nsufficient information to reliably answer the given question, the LLM must output YES; otherwise,\nit must output NO. Following this judgment, samples classified as YES are treated as easy with\nlow confidence or informative difficulty and are retained for training, while those classified as NO\nare deemed hard and are placed into a buffer for later re-evaluation—thereby enabling an adaptive,\ndifficulty-aware curriculum that dynamically prioritizes informative training instances and defers\noverly challenging ones until the model is better equipped to handle them.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 56, + "total_chunks": 93, + "char_count": 897, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b562f5d5-a403-4f83-9c5e-371727341f9c", + "text": "C.3 Complete Reward Computation Pipeline Algorithm 1 details the complete reward computation pipeline used in ROVA. Given a paired output\n(oj, ˜oj) generated from the original and perturbed video branches, the pipeline proceeds in five\nsequential steps. First, format validation checks whether the output adheres to the required First,\nformat validation checks whether the output adheres to the required format:", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 57, + "total_chunks": 93, + "char_count": 411, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0d169e5c-09fe-4de1-9fd9-c9c812be9bf4", + "text": "· · · · · · Second, the reasoning trace and final answer are extracted from both branches. Third, a binary\naccuracy reward rAccj is computed by comparing the extracted answer against the ground truth. Fourth, two alignment rewards are obtained via GPT-4o: a three-tier reasoning consistency score\nralign,rj ∈{0, 0.5, 1} that evaluates whether the key logical steps are preserved across branches, and a\nbinary answer consistency score ralign,aj ∈{0, 1} that checks semantic equivalence of the final answers. Finally, these components are aggregated into the total reward Rj = rFj +rAccj +αr·ralign,rj +αa·ralign,aj , ▷Answer Consistency Evaluation Prompt [Task]\nYou are a strict evaluator responsible for assessing whether the candidate answer matches\nthe reference answer. Score consistency only based on whether the CANDIDATE ⟨answer⟩\nis semantically identical to the REFERENCE ⟨answer⟩. Do not consider reasoning quality,\nexplanation depth, or stylistic differences. [Evaluation Criteria]\nRate the answer on a binary scale:\n• Score 1.0: The candidate answer is exactly the same as, or clearly equivalent to, the\nreference answer (e.g., \"0\" vs. \"zero\", \"NYC\" vs. \"New York City\").\n• Score 0.0: The candidate answer differs from the reference answer in any substantive\nway.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 58, + "total_chunks": 93, + "char_count": 1306, + "word_count": 195, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ea373e25-656c-4dc3-bcd3-189f79077e8b", + "text": "Do not reward partial credit. Minor formatting or punctuation differences should be tolerated,\nbut semantic mismatches must receive a score of 0.0. [Input]\n• Reference Answer: {reference_answer}\n• Candidate Answer: {candidate_answer} [Output Format]\nReturn a JSON object with the following fields. Only output the JSON object - no explanations,\nno justifications, and no extra text of any kind. {\"score\": 0.0 or 1.0,\n\"match_type\": \"exact\" or \"equivalent\" or \"mismatch\"} Figure 8: Answer consistency evaluation prompt for binary answer matching. Algorithm 1 Alignment Reward Computation Require: Output pair (oj, ˜oj) from original and perturbed branches, ground truth g\nEnsure: Total reward Rj\n1: Step 1: Format Validation\n2: rFj ←regex_match(oj, \".*.*.*\")\n3: Step 2: Extract Components\n4: pj ←extract(oj, \"\"); aj ←extract(oj, \"\")\n5: ˜pj ←extract(˜oj, \"\"); ˜aj ←extract(˜oj, \"\")\n6: Step 3: Accuracy Reward\n7: rAccj ←1[aj = g]\n8: Step 4: Alignment Rewards via GPT-4o\n9: ralign,rj ←GPT4o(reasoning_prompt, pj, ˜pj) {∈{0, 0.5, 1}}\n10: ralign,aj ←GPT4o(answer_prompt, aj, ˜aj) {∈{0, 1}}\n11: Step 5: Aggregation\n12: rAj ←αr · ralign,rj + αa · ralign,aj\n13: Rj ←rFj + rAccj + rAj\n14: Return Rj where the asymmetric weights αr = 0.3 and αa = 0.7 prioritize answer-level robustness while still\nencouraging reasoning fidelity (see Sec. D for detailed hyperparameter specifications). ▷Reasoning Consistency Evaluation Prompt [Task]\nYou are a strict evaluator responsible for assessing whether the candidate reasoning is consistent with the reference reasoning. Score consistency only based on whether the CANDIDATE\n⟨think⟩matches the REFERENCE ⟨think⟩in key evidence and logical steps. Do not evaluate\nthe correctness of the final answer.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 59, + "total_chunks": 93, + "char_count": 1790, + "word_count": 258, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1e3c99fa-9a88-42a1-899e-3ee27d0fc05e", + "text": "[Evaluation Criteria]\nRate the reasoning on a three-tier scale:\n• Score 1.0: The candidate reasoning is consistent with the reference up to paraphrasing\nand minor omissions. All key observations and logical steps are preserved.\n• Score 0.5: The candidate reasoning is mostly consistent but contains unsupported additions, missing key intermediate steps, or minor logical deviations.\n• Score 0.0: The candidate reasoning contradicts core observations from the reference or\nhallucinates key facts not present in the reference. [Evaluation Guidelines]\n• Focus exclusively on the reasoning process — ignore the final answer.\n• Tolerate stylistic and structural differences if the underlying logic is equivalent.\n• Penalize fabricated evidence or contradictions to reference observations. [Input]\n• Reference Reasoning: {reference_think}\n• Candidate Reasoning: {candidate_think} [Output Format]\nReturn a JSON object with the following fields. Only output the JSON object — no explanations, no justifications, and no extra text of any kind. {\"score\": 0.0 or 0.5 or 1.0,\n\"justification\": \"\"} Figure 9: Reasoning consistency evaluation prompt with three-tier scoring.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 60, + "total_chunks": 93, + "char_count": 1172, + "word_count": 164, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5f0c64f9-4b82-4bae-ae52-b85d3b860e04", + "text": "All hyperparameters used in ROVA are summarized in Fig. 11. For the reward function, the alignment\ncomponent assigns αr = 0.3 to reasoning consistency and αa = 0.7 to answer consistency, reflecting\nthe greater difficulty of strict reasoning alignment while prioritizing answer robustness; the base\nreward uses binary format and accuracy terms (wF = wAcc = 1.0) with KL regularization β = 0.01\nand Kmax = 537. For GRPO training, ordered and shuffled group sizes G = 8 and ˜G = 4 ensure\nreliable advantage estimation, PPO clipping ϵ = 0.2 with gradient norm 1.0 stabilizes policy updates,\nand GAE λGAE = 0.95 with γ = 0.99 yields a favorable bias–variance trade-off. For the difficultyaware curriculum, confidence threshold τ = 0.8 with bounds amin = 0.3 and amax = 0.85 governs\nsample selection, while the buffer permits Nmax = 3 replay attempts over at most |M|max = 1000\nsamples with re-evaluation every 50 steps. Training uses 16 frames at 128×28×28 (32 frames at\n256×28×28 at inference), AdamW with lr = 1×10−5 and cosine schedule on 4×A100 (80GB)\nGPUs, with 1 SFT epoch and 300 RL steps. D.0.1 Hyperparameter Sensitivity Analysis We conduct ablation studies on key hyperparameters to validate our design choices, as shown in Fig 9. The results indicate that setting the alignment weights to αr = 0.3 and αa = 0.7, which prioritizes\nanswer alignment, leads to improved downstream accuracy while preserving reasoning quality. A\nconfidence threshold of τ = 0.8 provides an effective balance: lower thresholds retain an excessive ▶LLM Judge Prompt for Difficulty Assessment [Task]\nYou may ONLY use the MASKED video to judge. [Evaluation Criteria]\n• If the masked video DOES give enough information to reliably answer, respond: YES.\n• If the masked video does NOT give enough information, respond: NO.\n• Additionally, provide a confidence score in [0.0, 1.0] (one decimal place) reflecting how\ncertain you are in your judgment. Reply with ONE WORD and ONE NUMBER only. [Input]\n• Question: {question_text} \"answer\": \"YES or NO\",\n\"confidence\": 0.0", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 61, + "total_chunks": 93, + "char_count": 2044, + "word_count": 334, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3c532011-3ed8-4629-9ff9-03e0699d9e9f", + "text": "Figure 10: LLM judge prompt for binary answerability assessment under perturbation. The confidence\nscore controls the sample discard rate via threshold τ. Algorithm 2 ROBUST VIDEO ALIGNMENT (ROVA)\nRequire: Policy Fθ, buffer M=∅, data D, params (α, τ, Kmax, G)\n# Self-Reflective Difficulty-Aware Training\n1: for (q, V ) ∼D do\n2: ˜V ←PERTURB(V ) ▷Spatio-temporal corruption\n3: {oj}Gj=1 ∼Fθ(·|q, V ); {˜oj}Gj=1 ∼Fθ(·|q, ˜V ) ▷Dual-branch\n4: Rj ←rj + α·SIM(oj, ˜oj); Aj ←(Rj−¯R)/σR ▷Alignment reward\n5: Fθ ←GRPOSTEP(Fθ, {Ai}) ▷Policy update\n6: (d, c) ←F(q, ˜V , Se; θ) ▷Self-assessment\n7: if d=HARD then\n8: M ←M ∪{(q, ˜V , 0)} ▷Buffer hard sample\n9: else if d=EASY ∧c>τ then\n10: skip ▷Prune mastered\n11: end if\n12: # Difficulty Re-Evaluation\n13: only when the memory is full or after sufficient iterations:\n14: for (q, ˜V , n) ∈M do\n15: d′ ←A(q, ˜V , θcurr); n ←n+1\n16: if d′ =INFORMATIVE then\n17: Train on (q, ˜V ); remove from M ▷Promote\n18: else if d′ =EASY or n>Nmax then\n19: Remove from M ▷Evict\n20: end if\n21: end for\n22: end for number of easy samples, whereas higher thresholds discard valuable training signals. We find that\na group size of G = 8 is sufficient to ensure stable advantage estimation, with larger group sizes\nyielding diminishing returns. Finally, a perturbation intensity of η = 0.7 achieves an appropriate Hyperparameter Sensitivity Analysis r G\n42 42 42 42\n40.2\n40 40 40 40\n38.7\n38.2 (%)\n38 37.9 39.1 38 39.1 38 39.1 37.8 38 37.4 39.1\nAcc 36.8\n36.2\n36 36 36 36 34 34 34 34\n0.1 0.3 0.5 0.6 0.8 0.95 4 8 16 0.5 0.7 0.9 Figure 11: Hyperparameter sensitivity analysis of ROVA on the validation set, illustrating the effect\nof key training hyperparameters on model performance. Table 9: Hyperparameter sensitivity analysis on the PVRBench validation set for Qwen2.5-VL-7B\nafter the first training epoch. Best values are highlighted in bold. Hyperparameter Value Avg. 0.1 36.2\nαr (reasoning weight) 0.3 39.1\n0.5 37.8 0.6 37.4\nτ (confidence threshold) 0.8 39.1\n0.95 38.2", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 62, + "total_chunks": 93, + "char_count": 1986, + "word_count": 349, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6102197b-3551-43ea-ba2e-35bece5e0531", + "text": "4 37.9\nG (group size) 8 39.1\n16 38.7 0.5 40.2\nη (perturbation intensity) 0.7 39.1\n0.9 36.8 balance between challenge and solvability - lower intensities fail to sufficiently enhance robustness,\nwhile higher intensities render samples unanswerable. E Additional Experimental Results Fine-Grained Performance Analysis. We further analyze ROVA's performance through complementary perspectives (Figs. 14 to 18), which present radar charts comparing per-task accuracy of\nROVA against the baselines across multiple task categories, revealing consistent improvements in\nhigh-level planning and associative reasoning. Fig. 12 shows the impact of input frame count on\nrobustness: increasing frames from 16 to 64 improves both baseline and ROVA performance across\nall perturbation types, confirming the benefit of longer temporal context. Notably, ROVA consistently\noutperforms the baseline at every frame count, indicating that our framework learns more robust\nrepresentations rather than merely exploiting additional frames. Evolution of Reasoning and Answer Rewards. We examine the reward dynamics of core components during ROVA training (Fig. 13). The total reward converges stably, while decomposed rewards\nshow distinct patterns: accuracy reward rises rapidly and plateaus, reflecting task-specific learning;\nreasoning reward grows gradually, indicating deeper semantic understanding; and temporal reward\nshows gradual growth with the lowest variation rate among all components, acting as a temporal\nregularizer. This confirms that each component effectively guides different learning aspects.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 63, + "total_chunks": 93, + "char_count": 1589, + "word_count": 215, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59289921-8de8-4b3c-a10a-efa88750c576", + "text": "Cross-Benchmark Evaluation. Fig. 19 compares ROVA against baselines on the VisBench and\nUrbanVideo benchmarks under various perturbation types. Our method achieves consistent improvements across both benchmarks, with average accuracy gains of +14.6% on VisBench and +12.9% on\nUrbanVideo, demonstrating strong cross-benchmark generalization. Baseline ROVA\n16F 32F 48F 64F 16F 32F 48F 64F\n0.50\n0.45 Score\n0.40\n0.35\n0.30\n0.25 Performance\n0.20 Original Weather Occlusion Shake Lighting\nCondition Figure 12: Performance of ROVA vs. baseline on Qwen2.5-VL-7B across varying frame counts (F =\nNumber of Frames). ROVA outperforms the baseline at every frame count. ROVA Training Reward Curves\n2.5 Accuracy Reward Temporal Reward Table 10: The stability of easy-classified samples\nFormat Reward Total Reward\nfor Qwen2.5-VL-7B\n2.0\nRetain Rate (%) ↑ Confidence ↑ 1.5 Step\nEp.1 Ep.2 Ep.3 Ep.1 Ep.2 Ep.3Reward1.0\n0 – – – – – –\n0.5 50 82.3 86.1 89.4 0.71 0.74 0.77\n100 87.5 90.2 92.8 0.73 0.78 0.81\n0.0 150 91.2 93.6 95.1 0.76 0.81 0.84\n0 50 100 150 200 250\nTraining Steps 200 93.8 95.2 96.3 0.79 0.83 0.86 250 95.1 96.0 96.8 0.81 0.85 0.88\nFigure 13: First epoch of Qwen-VL-2.5-7B train- 300 95.4 96.2 97.1 0.82 0.86 0.89\ning, the reward curves of ROVA Stability of Easy-Classified Samples. Tab. 10 further quantifies the stability of easy-sample classification. Easy samples are re-evaluated at each training step; the retention rate measures the proportion\nthat remain classified as easy upon re-evaluation, while the confidence score reflects the model's\ncertainty in its classification. Both metrics increase steadily over training, with the retention rate\nreaching 97.1% and confidence reaching 0.89 by step 300 (epoch 3), confirming that the self-reflective\nevaluation mechanism becomes increasingly reliable as training progresses. Analyses of Self-Reflective Evaluation. We analyze the discarding statistics across training runs\nand track the evolving proportions of medium, difficult, and easy samples throughout training. Difficult samples consistently exhibit the highest retention rate, confirming their role as persistent\nlearning bottlenecks that require sustained attention. In contrast, easy samples show lower and more\nvariable retention, highlighting their context-dependent utility -once learned, they act as reusable\nprimitives that facilitate generalization.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 64, + "total_chunks": 93, + "char_count": 2366, + "word_count": 348, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e730db4a-6f56-4bd0-90f9-e1a0b96274e2", + "text": "This evolving behavior is further quantified in Tab. 11. As\ntraining progresses, both pairwise overlap rates and all-epoch overlap increase substantially, while the\nconsistency ratio improves from 0.68 to 0.88, demonstrating that easy-sample identification becomes\nincreasingly stable over time. This growing stability reinforces that easy samples transition from\nbeing context-sensitive to consolidated, transferable knowledge units. Collectively, these patterns\nvalidate the difficulty estimation mechanism and reveal the curriculum's adaptive nature, where\nchallenging samples persistently push the learning frontier while easier ones consolidate and transfer\nacquired knowledge, enabling efficient and robust representation learning.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 65, + "total_chunks": 93, + "char_count": 737, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "81de0d52-28ba-4f61-a717-ae27e7020b55", + "text": "(a) PVRBench (Outdoor) (b) PVRBench (Indoor)\nRel. Object Duration (Easy) Recall Causal\nCounter- 0.8 0.8 factual Proximity\nRoute Rel. Cognitive Plan (Hard)\nCaption 0.4 Map 0.4 High-level Scene\nPlanning Recall Goal Sequence\nDetect Recall Rel. Distance (Medium)\nAction Start/End\nGen. Baseline + ROVA (Ours) Figure 14: Per-task accuracy comparison of QwenVL-2.5-7B baseline vs. +ROVA on indoor spatial\nreasoning (left) and outdoor urban navigation (right) tasks, where the inner curve denotes the baseline,\nand the outer curve denotes +ROVA. PVRBench (Indoor) PVRBench (Outdoor)\nRel. Dir.\n(Easy) Object Duration Recall Causal\nCounter-\n0.6 fact. 0.6 Proximity Route Rel. Plan 0.4 (Hard) Traj. 0.4 Cog. Caption Map\n0.2 0.2\nHigh-lvl Scene\nPlan. Detect Recall\nDist. (Med.) Action Start/End\nGen.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 66, + "total_chunks": 93, + "char_count": 786, + "word_count": 116, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c1c61f55-03e3-4c85-85e8-efc805c1ccd7", + "text": "Figure 15: Per-task accuracy comparison of Embodied-R-7B baseline vs. +ROVA on indoor spatial\nreasoning (left) and outdoor urban navigation (right) tasks, where the inner curve denotes the baseline,\nand the outer curve denotes +ROVA. PVRBench (Indoor) PVRBench (Outdoor)\nRel. Dir.\n(Easy) Object Duration Recall Causal\nCounter-\n0.6 fact. 0.6 Proximity Route Rel. Plan 0.4 (Hard) Traj. 0.4 Cog. Caption Map\n0.2 0.2\nHigh-lvl Scene\nPlan. Detect Recall\nDist. (Med.) Action Start/End\nGen.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 67, + "total_chunks": 93, + "char_count": 482, + "word_count": 72, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "066a5595-f401-4992-8979-bc76e75f1870", + "text": "Figure 16: Per-task accuracy comparison of InternVL2.5-8B baseline vs. +ROVA on indoor spatial\nreasoning (left) and outdoor urban navigation (right) tasks, where the inner curve denotes the baseline,\nand the outer curve denotes +ROVA. PVRBench (Indoor) PVRBench (Outdoor)\nRel. Dir.\n(Easy) Object Duration Recall Causal\nCounter-\n0.6 fact. 0.6 Proximity Route Rel. Plan 0.4 (Hard) Traj. 0.4 Cog. Caption Map\n0.2 0.2\nHigh-lvl Scene\nPlan. Detect Recall\nDist. (Med.) Action Start/End\nGen. Figure 17: Per-task accuracy comparison of Qwen2.5-VL-72B baseline vs. +ROVA on indoor spatial\nreasoning (left) and outdoor urban navigation (right) tasks, where the inner curve denotes the baseline,\nand the outer curve denotes +ROVA. Qwen3-VL-13B (Indoor) Qwen3-VL-13B (Outdoor)\nRel.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 68, + "total_chunks": 93, + "char_count": 768, + "word_count": 111, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "14a7defb-596a-4009-b9a4-a21e48ff044b", + "text": "Dir.\n(Easy) Object Duration Recall Causal\nCounter-\n0.6 fact. 0.6 Proximity Route Rel. Plan 0.4 (Hard) Traj. 0.4 Cog. Caption Map\n0.2 0.2\nHigh-lvl Scene\nPlan. Detect Recall\nDist. (Med.) Action Start/End\nGen.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 69, + "total_chunks": 93, + "char_count": 206, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c7382fc-3714-4ef2-b0ad-effa6afd509e", + "text": "Figure 18: Per-task accuracy comparison of Qwen3-VL-13B baseline vs. +ROVA on indoor spatial\nreasoning (left) and outdoor urban navigation (right) tasks, where the inner curve denotes the baseline,\nand the outer curve denotes +ROVA. Figure 19: Cross-benchmark evaluation on VisBench and UrbanVideo under various perturbation\ntypes. ROVA achieves +14.6% and +12.9% average accuracy gains, respectively, demonstrating\nconsistent cross-benchmark improvements. Table 11: Consistency of easy-sample identification across training epochs. Pairwise: percentage\nof samples identified as easy in both epochs. All-Epoch: percentage identified as easy in all three\nepochs. Consistency: ratio of samples easy in all epochs to those easy in at least one. Pairwise Overlap (%) All-Epoch Consist. Step Ep.1 ∩Ep.2 Ep.2 ∩Ep.3 Ep.1 ∩Ep.3 Ovlp. (%) Ratio 50 78.4 81.2 76.8 72.1 0.68\n100 83.7 86.5 82.4 78.9 0.74\n150 87.2 89.8 86.1 83.4 0.79\n200 90.5 92.1 89.7 87.2 0.83\n250 92.8 94.3 91.9 89.6 0.86\n300 94.1 95.2 93.5 91.3 0.88 F Additional Case Study Qualitative analyses show that ROVA-trained models develop perturbation-aware reasoning: under\ndense fog (Fig. 20), Qwen2.5-VL-7B recognizes fog-induced depth distortion to correctly estimate a\ncrane at over 200m and conservatively limits visibility to 30m refusing path continuity assumptions;\nunder heavy snowstorm (Fig. 21), InternVL2.5-8B chains multi-frame evidence tracking vertical\nedges (Frames 0–16) for building identification, estimating NW-to-SE wind from snow trajectories\n(Frames 27–38), locating entrances via illuminated ground-floor areas (Frame 50), and selecting 2/3\ntallest-building altitude by reasoning about upper-frame snow density and obscured building tops\n(Frame 0, 4); under sandstorm (Fig. 22), Qwen3-VL-13B shifts from unreliable color cues to structural\nmatching via vertical edge tracking (Frames 0–27) and silhouette cross-referencing to locate the target\nat 10 o'clock while avoiding a 2 o'clock trap, and infers easterly headwind from left-to-right sand\nmovement to plan steeper descent avoiding building turbulence; under sun glare (Fig. 23), Qwen2.5-\nVL-7B identifies overexposed regions as sensor artifacts, confirms target via cross-frame consistency\n(glare shifts while store remains fixed), and plans southeast descent toward shadowed lower-right\nregions avoiding the glare direction—all consistently exhibiting, without explicit supervision, three\nemergent behaviors: (1) explicit perturbation identification naming perturbations in reasoning\ntraces, (2) strategy adaptation modifying approaches per perturbation type (e.g., color-to-structural\ncue switching), and (3) cross-frame evidence integration distributing attention across frames to\ncompensate per-frame information loss, suggesting the dual-branch alignment objective implicitly\nencourages perturbation-aware meta-reasoning as a byproduct of output-consistency optimization. G Time Complexity Analysis We provide a detailed analysis of the computational cost of ROVA and demonstrate that, despite\nintroducing additional components, the difficulty-aware curriculum significantly reduces the effective\ntraining cost compared to a naïve dual-branch baseline that trains on all samples uniformly. G.1 Per-Step Cost Decomposition", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 70, + "total_chunks": 93, + "char_count": 3259, + "word_count": 436, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c775262-a4ca-4688-8e70-6e9eee2c10e3", + "text": "Let N denote the batch size, Gtotal = G + ˜G = 12 the total group size, T the number of frames, L\nthe maximum sequence length, and Cfwd the cost of a single model forward pass on one video-query\npair. We decompose the per-step cost of each training paradigm.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 71, + "total_chunks": 93, + "char_count": 258, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "178fca1f-6892-4b8e-b54a-75c97f99b8c8", + "text": "Standard GRPO (Baseline). Standard GRPO generates Gtotal rollouts per sample from clean video\nonly and performs one backward pass: CGRPO = N · Gtotal · Cfwd + Cbwd, (11) where Cbwd ≈0.5 · N · Gtotal · Cfwd. The coefficient 0.5 arises from the asymmetry between rollout\ngeneration and gradient computation: during generation, each token is decoded autoregressively,\nrequiring a full forward pass per step; in contrast, the backward pass operates on the already-generated\nsequences in a single teacher-forced forward - backward sweep, which can be fully parallelised across\nall token positions. Although the gradient computation itself costs roughly 2× the corresponding Figure 20: Qualitative examples of ROVA-trained Qwen2.5-VL-7B performing depth estimation and\npath continuity reasoning under dense fog conditions. Figure 21: Qualitative examples of ROVA-trained InternVL2.5-8B performing structure recognition\nand visibility-aware altitude control under heavy snowstorm conditions. forward pass [Griewank and Walther, 2008], the teacher-forced forward is substantially cheaper than\nautoregressive decoding (approximately 1/4 to 1/3 of the total generation cost in our setting due to\nKV-cache reuse and parallel position processing), yielding an effective backward cost of roughly half\nthe total rollout budget.2 2We empirically verified this ratio on our 4×A100 setup; the measured backward-to-forward cost ratio was\n0.48 ± 0.03 across 300 steps.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 72, + "total_chunks": 93, + "char_count": 1449, + "word_count": 206, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1dcadc20-955a-4955-85c2-aad7072a6921", + "text": "Figure 22: Qualitative examples of ROVA-trained Qwen3-VL-13B performing landmark matching\nand wind-aware path planning under sandstorm conditions. Figure 23: Qualitative examples of ROVA-trained Embodied-R (Qwen2.5-VL-7B as Vision Language\nModels) performing glare region identification and glare-aware approach planning under strong sun\nglare conditions. A straightforward dual-branch approach generates Gtotal rollouts from both\nclean and perturbed videos for every sample, computes alignment rewards, and updates the policy:\nCnaive = N · Cpert + 2N · Gtotal · Cfwd + 2N · CAPI + C′bwd , (12)\n|perturbation{z } | dual{zrollout } alignment| {z reward} backward|{z} where Cpert is the per-sample perturbation generation cost, CAPI is the GPT-4o API call latency per\nevaluation, and C′bwd ≈0.5 · 2N · Gtotal · Cfwd reflects the doubled rollout pool entering the backward\npass. ROVA (with difficulty-aware curriculum). ROVA introduces two additional stages—selfreflective assessment and memory re-evaluation—but critically, it also discards a fraction of samples\nfrom training via its difficulty-aware curriculum (Sec. 3.2). Let ρt ∈[0, 1] denote the effective\ntraining ratio at step t, i.e., the fraction of samples that survive curriculum filtering (neither pruned as\nhigh-confidence easy nor deferred as excessively hard). The per-step cost becomes:\nCROVA = N · Cpert + 2N · Gtotal · Cfwd + N · Cjudge\n|perturbation{z } | dual rollout{z (all N) } self-assessment| {z }\n(13)\n+ 2ρtN · CAPI + |Mt| · Cjudge · 1[t mod Tre = 0] + C′′bwd ,\nalignment| {z(selected)} | memory re-eval{z (periodic) } backward|{z}(selected)\nwhere Cjudge ≈0.4 · Cfwd denotes the cost of the self-reflective difficulty assessment (a single forward\npass with a shortened prompt over the perturbed video), |Mt| is the current memory buffer size, and\nTre is the re-evaluation period. Three design choices jointly explain why this formulation leads to a favorable cost–accuracy trade-off\ndespite the added components: (i) Curriculum filtering reduces downstream cost. Although dual rollouts are performed over the\nfull batch of N samples (necessary for the self-assessment stage to observe model behavior before\nfiltering), the expensive alignment reward calls and the backward pass operate only on the ρtN\nselected samples. In practice, ρt stabilizes around 0.55–0.65 during training (see Tab. 10, effectively\nhalving the API and gradient costs relative to the naïve dual-branch baseline. (ii) Self-assessment is lightweight. The self-reflective difficulty judgment Cjudge reuses the alreadyloaded model weights and operates on a single truncated prompt per sample, costing only ∼0.4× a\nstandard rollout forward pass. This modest overhead is more than compensated by the downstream\nsavings from filtering: the net cost reduction from discarding (1 −ρt)N samples far exceeds the\nN · Cjudge assessment cost.\n(iii) Memory re-evaluation is amortized.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 73, + "total_chunks": 93, + "char_count": 2915, + "word_count": 435, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8b36717-c4bf-45f0-ae24-7178d62aaaa6", + "text": "Re-evaluating the memory buffer Mt is the most expensive\nauxiliary operation, as it requires a difficulty re-assessment of all |Mt| stored samples under the\ncurrent policy. We set the re-evaluation period to Tre = 50 steps, which we found to balance freshness\nand overhead: the model's difficulty landscape shifts meaningfully over ∼50 update steps (see Fig. 4),\nwhile more frequent re-evaluation yields diminishing returns at linearly increasing cost. Amortized\nover Tre steps, the per-step memory overhead is only |Mt| · Cjudge/Tre, which constitutes less than\n2% of the total per-step budget in our experiments. Combining these factors, we obtain C′′bwd ≈0.5 · 2ρtN · Gtotal · Cfwd, since only the selected samples\ncontribute to the policy gradient. The overall per-step cost of ROVA is thus approximately:\nCROVA ≈ 2 + 0.4 + 2ρt · N · Gtotal · Cfwd + (minor terms), (14)\ncompared with (2 + 2) · N · Gtotal · Cfwd for the naïve baseline (Eq. 12), yielding a theoretical speedup\nof 4/(2.4+2ρt). At ρt ≈0.6, this gives ∼1.11× speedup, consistent with the 1.06× effective speedup\nmeasured in Tab. 13 (the small gap is attributable to scheduling and synchronization overhead on our\nmulti-GPU setup). G.2 Amortized Cost Savings from Curriculum The key insight is that the self-assessment overhead is more than compensated by the reduction in\ndownstream computation. Specifically, for each discarded sample, ROVA saves the cost of alignment\nreward API calls and a portion of the backward pass gradient computation. We formalize this tradeoff\nbelow. Proposition 1 (Amortized cost advantage of ROVA). Let ρt denote the effective training ratio at step\nt, and let ¯ρ = 1 PTRLt=1 ρt be the average training ratio over TRL RL steps. Ignoring the amortized TRL\nmemory re-evaluation cost (which occurs every 50 steps), the per-step cost ratio of ROVA relative to\nnaïve dual-branch training satisfies:\nCROVA ≈2Gtotal · Cfwd + Cjudge + 2¯ρ · CAPI + 1.5¯ρ · Gtotal · Cfwd . (15)\nCnaive 2Gtotal · Cfwd + 2CAPI + 1.5Gtotal · Cfwd\nWhen ¯ρ < 1 (i.e., the curriculum discards some fraction of samples), and Cjudge < (1 −¯ρ)(2CAPI +\n1.5Gtotal · Cfwd), then CROVA < Cnaive. Table 12: Effective training ratio ρt and corresponding discard rates over training. \"Easy Disc.\"\ndenotes high-confidence easy samples discarded; \"Difficult Def.\" denotes hard samples deferred to\nthe buffer. Step Easy Disc. (%) Difficult Def. (%) Effective ρt Buffer |Mt|", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 74, + "total_chunks": 93, + "char_count": 2424, + "word_count": 398, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd33c89e-9e59-47c2-b275-8d777fea576c", + "text": "0–50 2.1 11.8 0.861 127\n50–100 3.8 9.5 0.867 248\n100–150 5.4 7.2 0.874 341\n150–200 7.1 5.8 0.871 389\n200–250 8.6 4.3 0.871 352\n250–300 9.8 3.5 0.867 298 Average 6.1 7.0 ¯ρ = 0.869 293", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 75, + "total_chunks": 93, + "char_count": 183, + "word_count": 37, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b36c2689-7407-4834-b775-8a0bde124c23", + "text": "For the naïve dual-branch, every sample incurs full rollout, alignment reward, and backward\ncosts. For ROVA, the dual-branch rollout is performed for all N samples (needed for difficulty\nassessment), but the expensive alignment reward computation (2CAPI per sample) and the backward\npass are performed only for the ρtN selected samples. The additional cost is the self-assessment\njudge call (Cjudge per sample). Substituting and simplifying per sample:\nCper-samplenaive = 2GtotalCfwd + 2CAPI + 1.5GtotalCfwd, (16)\nCper-sampleROVA = 2GtotalCfwd + Cjudge + 2ρtCAPI + 1.5ρtGtotalCfwd. (17) The saving per sample is: ∆C = (1 −ρt) (2CAPI + 1.5GtotalCfwd) −Cjudge. (18) .This is positive whenever ρt < 1 − 2CAPI+1.5GtotalCfwdCjudge Empirical training ratio. From the training dynamics shown in Sec. 3.2, the effective training\nratio evolves over training. In early steps, most samples are informative (ρ ≈0.90), but as the\nmodel improves, more samples are classified as high-confidence easy and discarded. We measure the\nempirical training ratio across three runs in Tab. 12. With ¯ρ = 0.869, approximately 13.1% of samples are removed from each training step on average\n(6.1% easy discarded + 7.0% hard deferred). Substituting our measured values (Cjudge ≈0.4Cfwd,\nCAPI ≈0.9Cfwd, Gtotal = 12): CROVA 24Cfwd + 0.4Cfwd + 2(0.869)(0.9Cfwd) + 1.5(0.869)(12Cfwd)\nCnaive 24Cfwd + 2(0.9Cfwd) + 1.5(12Cfwd)\n24 + 0.4 + 1.56 + 15.64\n= (19)\n24 + 1.8 + 18\n41.60\n= ≈0.950.\n43.80 Thus, ROVA is approximately 5.0% cheaper per step than naïve dual-branch training, despite\nthe additional self-assessment overhead.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 76, + "total_chunks": 93, + "char_count": 1592, + "word_count": 243, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4e16a192-1fa9-4b44-aa91-220cd5430bf9", + "text": "The savings come from avoiding expensive alignment\nreward API calls and reducing gradient computation for uninformative samples.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 77, + "total_chunks": 93, + "char_count": 128, + "word_count": 17, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "289e03bb-3c67-4e3e-90f7-b4241e09dc9f", + "text": "G.3 Wall-Clock Time Measurements To validate the theoretical analysis, we measure actual wall-clock times on our 4× A100 (80GB)\ntraining setup. Tab. 13 reports per-step and total training times across paradigms. Several observations emerge from Tab. 13.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 78, + "total_chunks": 93, + "char_count": 253, + "word_count": 37, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc82bc6a-d4c4-4c1f-b26f-6224a9321d9c", + "text": "First, ROVA (full) requires 403s per step compared to\n428s for naïve dual-branch, achieving a 1.06× wall-clock speedup while delivering +2.3% higher\naccuracy. Second, removing memory re-evaluation saves only 7s per step (since re-evaluation occurs\nevery 50 steps, amortized to ∼7s), confirming that memory management overhead is minimal. Table 13: Wall-clock time comparison across training paradigms on 4× A100 GPUs. Per-step times\nare averaged over 300 RL steps. \"Eff. Speedup\" measures speedup relative to naïve dual-branch. Method Per-Step (s) Total 300 Steps (h) Eff.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 79, + "total_chunks": 93, + "char_count": 572, + "word_count": 84, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9958f5f6-8526-4b70-9eb2-7b162050ff4b", + "text": "Standard GRPO 215 ± 12 17.9 — 33.0\nNaïve Dual-Branch 428 ± 18 35.7 1.00× 36.8\nROVA (full) 403 ± 21 33.6 1.06× 39.1\nw/o memory re-eval 396 ± 19 33.0 1.08× 38.4\nw/o self-assessment 422 ± 17 35.2 1.01× 37.2 Table 14: Component-wise wall-clock timing breakdown per training step for ROVA on 4× A100\nGPUs (N = 4 per GPU, Gtotal = 12). Component Time (s) Fraction (%) Parallelizable? Perturbation generation 8.2 2.0 Yes (CPU)\nClean-branch rollout 142.5 35.4 Yes (GPU 0–1)\nPerturbed-branch rollout 142.5 35.4 Yes (GPU 2–3)\nSelf-reflective assessment 18.6 4.6 Yes (batched)\nAlignment reward (API) 38.4 9.5 Yes (async)\nBackward pass (selected) 46.8 11.6 No\nMemory re-eval (amortized) 6.0 1.5 Yes (batched) removing self-assessment entirely increases per-step cost to 422s—only 6s less than naïve dualbranch—because without difficulty-aware filtering, all samples proceed to the expensive alignment\nreward and backward stages, negating any potential savings and reducing accuracy by 1.9%. Component-wise timing breakdown. We further decompose the per-step time of ROVA in Tab. 14.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 80, + "total_chunks": 93, + "char_count": 1070, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b0c69af-6796-4cd4-bd47-50b4fb4b7187", + "text": "The dual-branch rollout dominates at 70.8% of total time, confirming that the additional components\n(self-assessment at 4.6%, memory re-evaluation at 1.5%) introduce marginal overhead. The alignment\nreward API calls (9.5%) benefit from asynchronous batching; without curriculum-based filtering, this\nwould increase to 9.5/0.869 ≈10.9%. G.4 Amortized Memory Re-evaluation Cost Memory re-evaluation occurs every 50 steps, with the buffer containing on average |M| ≈293\nsamples (Tab. 12). Each re-evaluation requires one judge forward pass per buffered sample:\nCre-eval = |M| · Cjudge = 293 × 0.4Cfwd. (20)\nAmortized over 50 steps, this contributes 293×0.450 ≈2.3Cfwd per step-less than 1% of the total\nper-step cost. Furthermore, approximately 18% of re-evaluated samples are promoted to training\n(classified as informative) and 12% are evicted (classified as easy or exceeding Kmax), confirming\nthat the memory mechanism provides a meaningful stream of recovered training signal at negligible\ncost.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 81, + "total_chunks": 93, + "char_count": 997, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c8664377-d3d1-485d-83bf-333597d2cb07", + "text": "H Analysis of Reward Modeling Design In this section, we provide an in-depth analysis of the reward modeling design in ROVA, discussing\nthe motivation behind our multi-component formulation, its theoretical grounding, the interplay with\nthe difficulty-aware curriculum, and empirical evidence supporting each design choice. H.1 Motivation: Why Multi-Component Rewards? Standard reinforcement learning from human feedback (RLHF) and its variants typically employ a\nsingle scalar reward signal. However, the robustness objective in embodied video reasoning presents multiple, partially orthogonal desiderata: (1) task accuracy, ensuring correct answers; (2) format\ncompliance, maintaining structured output for downstream parsing; and (3) perturbation invariance,\nensuring both final answers and underlying reasoning remain stable under visual corruptions. A single\nscalar reward conflates these objectives, making it difficult for the policy to disentangle which aspect\nof its behavior is being reinforced. Our multi-component reward Rj = rFj + rAccj + rAj addresses\nthis by providing separable gradient signals for each objective. To empirically validate this design, we compare our multi-component reward against two alternatives:\n(1) a single combined reward that merges all components into one scalar via weighted summation\nbefore advantage estimation, and (2) an accuracy-only reward that drops the alignment component\nentirely. The multi-component reward outperforms both alternatives across all metrics, with particularly large\ngains in reasoning quality (Consistency +0.24, Belief +0.23 over single combined). This confirms\nthat decomposed rewards provide more informative gradient signals. H.2 Alignment Reward: Optimizing Geodesic distance The alignment reward rAj = αr · ralign,rj + αa · ralign,aj is the central novelty of our reward design. This reward formula can easily optimize geodesic distance in manifold without additional cost. From Output Consistency to minimizing Geodesic path.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 82, + "total_chunks": 93, + "char_count": 2000, + "word_count": 274, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "72089a0a-3e5a-42a7-a726-7d89cc5ed4e3", + "text": "I), the KL divergence between induced output distributions π(z) and π(zϕ) is locally equivalent\nto the squared Fisher–Rao distance on the statistical manifold M. Maximizing the alignment reward\ndrives the policy toward producing identical outputs for clean and perturbed inputs, which—under\nthe Local Proximity Assumption—is equivalent to minimizing the Fisher - Rao distance:\nmax rAj ⇐⇒min d2FR(π(z), π(zϕ)) ≈min DKL(π(z)∥π(zϕ)). (21)\nThis connection suggests that the alignment reward serves as an informative, difficulty-aware signal\nwithin the training dynamics. By modulating updates according to sample complexity, it shapes\nthe model's trajectory on the underlying statistical manifold, encouraging stable and generalizable\nparameter movements while mitigating overfitting. Compared to random sampling, such rewardguided optimization is more likely to follow a favorable geodesic trajectory, ultimately reducing\nthe discrepancy between the probability distributions π(z) and π(zϕ) induced by the original and\nperturbed data. Asymmetric Weighting Rationale. The asymmetric weighting (αa = 0.7 > αr = 0.3) reflects\ntwo key observations. First, answer consistency provides a sharper, lower-variance gradient signal\n(binary {0, 1}) compared to reasoning consistency (three-tier {0, 0.5, 1}), making it a more reliable\noptimization target. Second, reasoning traces exhibit higher inherent variability - even for identical\ninputs, stochastic decoding produces diverse reasoning paths that may differ stylistically while\nremaining semantically equivalent. Assigning a lower weight to reasoning alignment prevents the\nreward from penalizing legitimate reasoning diversity while still encouraging core logical consistency. The sensitivity analysis (Tab. 9) confirms that this asymmetric weighting outperforms both symmetric\n(αr = αa = 0.5, Avg. Acc. 37.8%) and reasoning-dominated (αr = 0.5 > αa = 0.5) configurations. H.3 Interaction Between Reward Components and Curriculum A key insight of ROVA is that the reward components and the difficulty-aware curriculum are mutually\nreinforcing. We identify three specific interaction mechanisms. Accuracy Reward as Curriculum Bootstrapper.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 84, + "total_chunks": 93, + "char_count": 2182, + "word_count": 295, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4621491-4c03-475d-a6b9-2beb00c3abcb", + "text": "During early training, rAcc provides the dominant\nlearning signal, enabling the model to acquire basic task competence before the alignment reward\nbecomes informative. This is because alignment requires meaningful outputs on both branches—if the\nmodel cannot solve the task on clean inputs, comparing clean and perturbed outputs is uninformative.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 85, + "total_chunks": 93, + "char_count": 346, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2a7bcfc5-26a6-4ead-8e06-3f0f33c4ed10", + "text": "The curriculum amplifies this effect by initially presenting predominantly easy and medium samples,\nwhere the accuracy reward gradient is strongest. Alignment Reward as Implicit Difficulty Signal. The alignment reward also serves as an implicit\ndifficulty indicator that complements the LLM-judge-based assessment.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 86, + "total_chunks": 93, + "char_count": 314, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3f321d94-70a9-40e7-b10e-8857f2d0889a", + "text": "Samples that consistently receive low alignment scores (rAj ≈0) despite high accuracy (rAccj = 1) indicate that the perturbation\ndisrupts reasoning without affecting the final answer - a subtle failure mode that the binary judge\nmay miss. By incorporating rAj into the total reward, such samples receive lower overall rewards,\nnaturally reducing their influence on the policy gradient and preventing the model from learning\nbrittle shortcuts. Format Reward as Training Stabilizer. The format reward rFj , while seemingly trivial, plays a\ncritical stabilization role during early RL training. Without it, the policy may drift toward degenerate\noutputs (e.g., omitting the block) that trivially minimize the alignment penalty by producing\nempty reasoning traces. The format reward ensures structured outputs are maintained throughout\ntraining, preserving the prerequisite for meaningful alignment evaluation. H.4 Comparison with Alternative Reward Designs Beyond the default alignment reward used in ROVA, we explore two principled reward variants that\ntarget specific limitations of the default formulation, aiming to further improve training signal quality. Conditional Alignment Reward. A potential failure mode of the default alignment is the \"consistently wrong\" regime: when the clean branch itself produces an incorrect answer, enforcing\nconsistency with a flawed output may reinforce erroneous reasoning. To address this, we design a\nconditional variant that modulates the alignment target based on clean-branch correctness. When\nthe clean branch is correct, the perturbed branch is aligned to it as usual; when incorrect, the reward\ninstead encourages the perturbed branch to deviate from the erroneous output and align with the\nclosest correct rollout within the same generation group:", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 87, + "total_chunks": 93, + "char_count": 1801, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbda1145-ecff-4111-8d7b-f421e89eea30", + "text": "sim(ˆypert, ˆyclean) if ˆyclean = y∗,\nrcond =  ! (22)\nsim ˆypert, arg min d(yj, ˆypert) otherwise,\n yj∈Y+ where Y+ is the set of correct rollouts within the group and d(·, ·) denotes edit distance in the\nreasoning trace. Step-Level Reasoning Consistency Reward. The default GPT-4o-based evaluation assigns a\nholistic three-tier score to the entire reasoning trace, which may obscure perturbation-specific failure\nmodes at different reasoning stages. To enable finer-grained credit assignment, we decompose each\nreasoning trace into three atomic stages - visual observation, spatial/temporal reasoning, and action\ndecision - and compute per-stage similarity using a frozen sentence encoder (all-MiniLM-L6-v2):\nrstep = X βk · cos ecleank , epertk , (23)\nk∈{obs, reason, act} where e(·)k denotes the frozen encoder embedding for stage k, and βk are stage weights (βobs = 0.3,\nβreason = 0.5, βact = 0.2). This formulation offers the additional benefit of eliminating GPT-4o API\ncosts for reasoning evaluation, and in principle allows the policy gradient to independently target\neach failure mode.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 88, + "total_chunks": 93, + "char_count": 1099, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "60acfc95-339b-4ff7-9dc9-b5ceb347be9d", + "text": "Experimental Results. We evaluate both variants - as well as their combination - on PVRBench\nusing the Qwen2.5-VL-7B backbone under identical training configurations (Tab. 15). Contrary to\nour expectations, neither alternative improves upon the default ROVA reward; both lead to consistent\ndegradation across all metrics, with the step-level variant exhibiting the largest drop (−0.02 in Avg. Acc., −0.08 in Avg.†).", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 89, + "total_chunks": 93, + "char_count": 415, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "21ffc55b-7cfe-4de5-b5a9-f2e630d02577", + "text": "Combining both alternatives does not recover the lost performance, suggesting\nthat the two failure modes are compounding rather than complementary. We evaluate both variants and their combination on PVRBench using Qwen2.5-VL-7B under identical\ntraining configurations (Tab. 15), finding that neither alternative improves upon the default ROVA\nreward - both lead to consistent degradation across all metrics, with the step-level variant exhibiting\nthe largest drop (−0.02 in Avg. Acc., −0.08 in Avg.†), and their combination compounds rather\nthan complements the failure modes. Three underlying causes explain this negative result: (i)\nthe conditional reward's applicability diminishes rapidly as clean-branch accuracy rises during Table 15: Comparison of alternative reward designs on PVRBench (Qwen2.5-VL-7B). The default\nROVA reward consistently outperforms both alternatives and their combination. Answer Accuracy Reasoning Quality Reward Design Perturbed Clean Perturbed Clean Default ROVA .47 .53 2.99 3.52\nConditional Alignment .46 .52 2.95 3.48\nStep-Level Consistency .45 .51 2.91 3.45\nCond. + Step-Level .45 .52 2.93 3.46 early training and plateaus at a high level (Fig. 13), reducing applicable samples to below 20%\nby mid-training, and further degenerating for genuinely difficult samples where all G=12 rollouts\nare incorrect, yielding no corrective signal precisely when most needed; (ii) the step-level reward's\nheuristic segmentation of free-form reasoning traces into three predefined stages introduces substantial\nnoise - particularly for traces interleaving observation and inference - while the frozen sentence\nencoder captures only surface-level lexical similarity lacking GPT-4o's deeper semantic judgment,\ncausing semantically equivalent but lexically divergent paths to receive misleadingly low similarity\nscores that misguide policy updates; and (iii) both alternatives introduce additional stochasticity (Y+\nsampling and edit-distance in conditional alignment, heuristic segmentation boundaries in step-level\nconsistency) that increases reward variance, which in the GRPO framework directly translates to\nnoisier advantage estimates destabilizing policy updates and offsetting any theoretical benefit from\nfiner-grained credit assignment. These findings suggest that for dual-branch alignment, reward\nstability matters more than reward granularity: the default holistic GPT-4o evaluation, while coarser,\nprovides a substantially more stable optimization landscape that best balances informativeness and\noptimization reliability for consistent, monotonic policy improvement.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 90, + "total_chunks": 93, + "char_count": 2598, + "word_count": 340, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ca40f67-3b26-4a5c-89cc-d4b7b41f7d0e", + "text": "I Theoretical Analysis Geometry of the output space. Let (Y, B) be a measurable space and P(Y) the space of probability measures on Y. We consider the statistical manifold\nM := {PY |z : z ∈Z} ⊂P(Y),\nequipped with the Fisher–Rao metric. Let ξ denote the local coordinates on M.\ngMξ (u, v) = EY ∼pξ[∂uℓ(ξ; Y ) ∂vℓ(ξ; Y )] , ℓ(ξ; y) = log pξ(y), (24)\nwhere µ is a dominating measure. For convenience, we unify all training-used samples (medium samples and easy samples\nwith low confidence) under the term medium-level samples. And we let easy-level easy samples\ndiscarded during training.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 91, + "total_chunks": 93, + "char_count": 585, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8dc0995d-3d86-4e8f-8933-b4c7f5ffc002", + "text": "Definition of Representations. Let z denote the model representation induced by the original input\nx, i.e.,\nz = fθ(x), Local Proximity Assumption. We assume that, during stable training steps, the induced output\ndistributions π(z) and π(zϕ) remain sufficiently close such that their discrepancy lies within a locally\nlearnable regime. Formally, there exists ε > 0 such that\nDKL(π(z) ∥π(zϕ)) ≤ε,\nwhere ε is small enough to ensure that learning dynamics remain within the local trust region of the\nstatistical manifold. Local KL expansion Let pξ ∈M be a smooth statistical model with Fisher information I(ξ). For\nsufficiently small ∆ξ,\nDKLπ(pξ) ∥π(pξ+∆ξ) ≈1 + o(∥∆ξ∥3). 2∆ξ⊤I(ξ)∆ξ\nThus, in a normal neighborhood of M, KL divergence is locally equivalent to the Fisher information\nmetric. Hence, we can use local approximation of KL divergence on manifold. Model-induced semantic map. The model induces a semantic map π : Z →M defined by\nπ(z) = PY |z. Semantic discrepancy between a clean representation z and its perturbed counterpart\nzϕ is measured on M via their induced distributions π(z) and π(zϕ). DTV(π(z), π(zϕ)) ≤ (1/2) ∗DKL(π(z) ∥π(zϕ)) (25) by Pinsker's inequality. Reward-to-KL surrogate Let r(π(z), π(zϕ) ∈[0, 1] be a reward and define the surrogate\nL(π(z), π(zϕ)) ∝ψ(r(π(z), π(zϕ))), where ψ is decreasing. Then there exists κ > 0 and a\nlocal Lipschitz constant L > 0 such that for all z and zϕ satisfying DKL(π(z)∥π(zϕ)) ≤κ,", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 92, + "total_chunks": 93, + "char_count": 1436, + "word_count": 231, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f649e0b-51b5-4bf1-87a3-2fac8a90da2e", + "text": "L(π(z), π(zϕ)) ≤L ∗DTV(π(z), π(zϕ)) ≤L ∗ (1/2) ∗DKL(π(z)∥π(zϕ)). (16) (A1) (Local KL–Fisher equivalence). There exist constants 0 < cmin ≤cmax such that, in a normal\nneighborhood of the statistical manifold M:\ncmind2FR ≤DKL ≤cmaxd2FR. (A2) (Trust-region energy dissipation via Medium-first sampling). Let the active difficulty\nmeasure for a perturbation ϕ be defined as the semantic KL energy:\nUt(ϕ) := Ez∼pt[DKL(πt(z) ∥πt(zϕ))]. Medium-difficulty sampling qt restricts the update to a stable trust region on M. Unlike random\nsampling, this constraint ensures:", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 93, + "total_chunks": 93, + "char_count": 560, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e18cda6-9713-4dce-b925-daa9ab552ce7", + "text": "Gradient Alignment: The task gradient ∇θL remains well-aligned with the descent direction of the semantic energy ∇θUt.\n2. Non-vanishing Dissipation: By avoiding the singular regions of \"hard\" samples and\nthe flat regions of \"easy\" samples, the update maintains a strictly positive inner product\n⟨∇θUt, ∇θL⟩> 0. This alignment forces Ut to follow a dissipative path toward the invariant state.", + "paper_id": "2603.10652", + "title": "Are Video Reasoning Models Ready to Go Outside?", + "authors": [ + "Yangfan He", + "Changgyu Boo", + "Jaehong Yoon" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10652v1", + "chunk_index": 94, + "total_chunks": 93, + "char_count": 392, + "word_count": 59, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10661_semantic.json b/data/chunks/2603.10661_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..1c05bd6f3d3890310bc83164fb6a36dae208d630 --- /dev/null +++ b/data/chunks/2603.10661_semantic.json @@ -0,0 +1,1577 @@ +[ + { + "chunk_id": "01da215a-9357-4df1-b13f-32e290b1a913", + "text": "Published as a conference paper at ICLR 2026 FAME: FORMAL ABSTRACT MINIMAL EXPLANATION\nFOR NEURAL NETWORKS Ryma Boumazouza∗1,2, Raya Elsaleh∗3, Melanie Ducoffe1,2, Shahaf Bassan3 and Guy Katz3\n1Airbus SAS, France, 2IRT Saint-Exupery, France, 3The Hebrew University of Jerusalem, Israel We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method\nto scale to large neural networks while reducing explanation size. Our main contri-2026 bution is the design of dedicated perturbation domains that eliminate the need for\ntraversal order. FAME progressively shrinks these domains and leverages LiRPAbased bounds to discard irrelevant features, ultimately converging to a formal\nabstract minimal explanation. To assess explanation quality, we introduce aMar procedure that measures the worst-case distance between an abstract minimal ex-\n11 planationattacks withandanaoptionaltrue minimalVERIX+explanation.refinementThisstep.procedureWe benchmarkcombinesFAMEadversarialagainst\nVERIX+ and demonstrate consistent gains in both explanation size and runtime\non medium- to large-scale neural networks. 1 INTRODUCTION[cs.AI] Figure 1: FAME Framework. The pipeline operates in two main phases (1) Abstract Pruning\n(Green) phase leverages abstract interpretation (LiRPA) to simultaneously free a large number of\nirrelevant (pixels that are certified to have no influence on the model's decision) features based\non a batch certificate (Section 4.2). This iterative process operates within a refined, cardinalityconstrained perturbation domain, Ωm(x, A) (Eq. 5) to progressively tighten the domain; To ensurearXiv:2603.10661v1 that the final explanation is as small as possible, the remaining features that could not be freed\nin batches are tested individually (Section 5). (2) Exact Refinement (Orange) phase identifies\nthe final necessary features using singleton addition attacks and, if needed, a final run of VERIX+\n(Section 6). The difference in size, |wAXpA⋆| −|AXp|, serves as an evaluation metric of phase 1. Neural network-based systems are being applied across a wide range of domains. Given AI tools'\nstrong capabilities in complex analytical tasks, a significant portion of these applications now involves tasks that require reasoning.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 0, + "total_chunks": 75, + "char_count": 2328, + "word_count": 316, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c302d25a-d94a-4a57-aa53-72e702a5c858", + "text": "These tools often achieve impressive results in problems requiring intricate analysis to reach correct conclusions. Despite these successes, a critical challenge\nremains: understanding the reasoning behind neural network decisions. The internal logic of a neural\nnetwork is often opaque, with its conclusions presented without accompanying justifications. This\nlack of transparency undermines the trustworthiness and reliability of neural networks, especially in\nhigh-stakes or regulated environments. Consequently, the need for interpretable and explainable AI\n(XAI) has become a growing focus in recent research. Published as a conference paper at ICLR 2026 Two main approaches have emerged to address this challenge. The first employs statistical and\nheuristic techniques to infer explanations based on network's internal representations (Fel et al.,\n2022). While these methods estimate feature importance, that require empirical evaluation (such as\nthe µ-Fidelity metric (Bhatt et al., 2020)), the second approach leverages automated reasoners and\nformal verification to provide provably correct explanations grounded in logical reasoning. We ground our work in the formal definition of Abductive Explanations (AXp) (Ignatiev et al.,\n2019), a concept belonging to the broader family of \"formal XAI\" which includes minimal explanations, also known as local-minimal, minimal unsatisfiable subsets (Marques-Silva, 2010) and prime\nimplicants (Shih et al., 2018). An AXp is a subset of features guaranteed to maintain the model's\nprediction under any perturbation within a defined domain. In a machine learning context, these\nexplanations characterize feature sets where removing any single feature invalidates the guarantee,\neffectively representing subsets that preserve the decision's robustness. However, a major hurdle for\nformal XAI is its high computational cost due to the complexity of reasoning, preventing it from\nscaling to large neural networks (NNs) (Marques-Silva, 2023b). This limitation, combined with the\nscarcity of open-source libraries, significantly hinders its adoption. Initial hybrid approaches, such\nas the EVA method (Fel et al., 2023), have attempted to combine formal and statistical methods, but\nthese often fail to preserve the mathematical properties of the explanation.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 1, + "total_chunks": 75, + "char_count": 2301, + "word_count": 320, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d533bf5a-c386-4923-8bc0-cefce02bec60", + "text": "However, robustnessbased approaches address the scalability challenges of formal XAI for NN by leveraging a fundamental connection between AXps and adversarial examples (Huang & Marques-Silva, 2023). In this work, we present FAME, a scalable framework for formal XAI that addresses the core limitations of existing methods. Our contributions are fourfold:", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 2, + "total_chunks": 75, + "char_count": 355, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8f60d38a-6887-4439-b97e-6be630fb0716", + "text": "• Formal abstract explanations. We introduce the first class of abductive explanations\nderived from abstract interpretation, enabling explanation algorithms to handle highdimensional NNs. • Eliminating traversal order. We design perturbation domains and a recursive refinement\nprocedure that leverage Linear Relaxation based Perturbation Analysis (LiRPA)-based certificates to simultaneously discard multiple irrelevant features. This removes the sequential\nbottleneck inherent in prior work and yields an abstract minimal explanation. • Provable quality guarantees.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 3, + "total_chunks": 75, + "char_count": 566, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "434c1937-94b3-4490-9ab7-8b74b2603146", + "text": "We provide the first procedure to measure the worst-case\ngap between abstract minimal explanations and true minimal abductive explanations, combining adversarial search with optional VERIX+ refinement. • Scalable evaluation. We benchmark FAME on medium- and large-scale neural networks,\nshowing consistent improvements in both explanation size and runtime over VERIX+. Notably, we produce the first abstract formal abductive explanations for a ResNet architecture\non CIFAR-10, demonstrating scalability where exact methods become intractable. 2 ABDUCTIVE EXPLANATIONS & VERIFICATION", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 4, + "total_chunks": 75, + "char_count": 582, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "342d3e5c-61bd-4304-b5ab-3c8a253ffd63", + "text": "Scalars are denoted by lower-case letters (e.g., x), and the set of real numbers by R. Vectors are\ndenoted by bold lower-case letters (e.g., x), and matrices by upper-case letters (e.g., W). The i-th\ncomponent of a vector x (resp. line of a matrix W) is written as xi (resp. The matrix W ≥0\n(resp. W ≤0) represents the same matrix with only nonnegative (resp. nonpositive) weights.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 5, + "total_chunks": 75, + "char_count": 381, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c7de869a-ac6d-4628-9dcf-b8d5c5833213", + "text": "Sets are\nwritten in calligraphic font (e.g., S). We denote the perturbation domain by Ωand the property to\nbe verified by P. 2.2 THE VERIFICATION CONTEXT", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 6, + "total_chunks": 75, + "char_count": 153, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "11243a60-bfbe-40bd-9854-8374b14f5108", + "text": "We consider a neural network as a function f : Rn →Rk. The core task of verification is to\ndetermine whether the network's output f(x′) satisfies a given property P for every possible input\nx′ within a specified domain Ω(x) ⊆Rn. When verification fails, it means there is at least one input\nx′ in the domain Ω(x) that violates the property P (a counterexample). The verification task can be\nwritten as: ∀x′ ∈Ω(x), does f(x′) satisfy P? This requires defining two components: Published as a conference paper at ICLR 2026 The Perturbation Domain (Ω): This domain defines the set of perturbations. It is often\nan lp-norm ball around a nominal input x, such as an l∞ball for modeling imperceptible\nnoise: Ω= {x′ ∈Rn | ∥x −x′∥∞≤ϵ}.\n2. The Property (P): This is the specification the network must satisfy. For a classification\ntask where the network correctly classifies an input x into class c, the standard robustness\nproperty P asserts that the logit for class c remains the highest for any perturbed input x′:\nP(x′) ≡min {fc(x′) −fi(x′)} > 0 (1)\ni̸=c For instance, given an MNIST image x of a '7' and a perturbation radius ϵ, the property P holds if the\nnetwork's logit for class '7' provably exceeds all other logits for every perturbed image x′ ∈Ω(x). A large body of work has investigated formal verification of NNs, with adversarial robustness being the most widely studied property (Urban & Min´e, 2021). Numerous verification tools are now\navailable off-the-shelf, and for piecewise-linear models f with corresponding input domains and\nproperties, exact verification is possible (Katz et al., 2017; Botoeva et al., 2020). In practice, however, exact methods quickly become intractable for realistic networks. To address this, we rely on\nAbstract Interpretation, a theory of sound approximation.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 7, + "total_chunks": 75, + "char_count": 1798, + "word_count": 296, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35af7788-bd97-4361-adcb-53f99c1ea9e6", + "text": "Specifically, we utilize Linear Relaxationbased Perturbation Analysis (LiRPA) (Zhang et al., 2018; Singh et al., 2019) which efficiently overapproximates the network's output by enclosing it between linear upper and lower bounds. Such\nabstractions enable sound but conservative verification: if the relaxed property holds, the original\none is guaranteed to hold as well. We provide a comprehensive background in Appendix A.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 8, + "total_chunks": 75, + "char_count": 423, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "df11f708-9d82-43e6-87a4-f5850792c5af", + "text": "2.3 ABDUCTIVE EXPLANATIONS: PINPOINTING THE \"WHY\" Understanding Model Robustness with Formal Explanations: Neural networks often exhibit\nsensitivity to minor input perturbations, a phenomenon that certified training can mitigate but not\neliminate (De Palma et al., 2025). Even robustly trained models may only have provably safe regions\nspanning a few pixels for complex tasks like ImageNet classification (Serrurier et al., 2021). To build\nmore reliable systems, it is crucial to understand why a model's prediction is robust (or not) within\na given context. Formal explainability provides a rigorous framework for this analysis. We focus on abductive explanations (AXps, also called distance-restricted explanations (ϵ-AXp))\n(Ignatiev et al., 2019; Huang & Marques-Silva, 2023), which identify a subset of input features that\nare sufficient to guarantee that the property P holds. Formally, a local formal abductive explanation\nis defined as a subset of input features that, if collapsed to their nominal values (i.e., the sample x),\nensure that the local perturbation domain Ωsurrounding the sample contains no counterexamples. Definition 2.1 (Weak Abductive Explanation (wAXp) ). Formally, given a triple (x, Ω, P), an explanation is a subset of feature indices X ⊆F = {1, . . . , n} such that\nwAXp: ∀x′ ∈Ω(x), ^ (x′ i = xi) =⇒f(x′) |= P. (2)\ni∈X While many such explanations may exist (the set of all features F is a trivial one), the most useful\nexplanations are the most concise ones (Bassan & Katz, 2023). We distinguish three levels:", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 9, + "total_chunks": 75, + "char_count": 1542, + "word_count": 243, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9eceaa2c-4574-4231-999b-a1ed34464e43", + "text": "Minimal Explanation: An explanation X is minimal if removing any single feature from it would\nbreak the guarantee (i.e., X \\ {j} is no longer an explanation for any j ∈X). These are also known\nas minimal unsatisfiable subsets(Ignatiev et al., 2016; Bassan & Katz, 2023). Minimum Explanation: An explanation X is minimum if it has the smallest possible number of\nfeatures (cardinality) among all possible minimal explanations. The concept of an abductive explanation is illustrated using a classification task (details in Appendix\nD.1, Figure 4). The goal is to find a minimal subset of fixed features (X) that guarantees a sample's\nclassification within its perturbation domain. For the analyzed sample, fixing x2 alone is insufficient\ndue to the existence of a counterexample (Figure 5). However, fixing the set X = {x2, x3} creates a\n'safe' subdomain without counterexamples, confirming it is an abductive explanation. This explanation is minimal (neither x2 nor x3 work alone) but not minimum in cardinality, as X ′ = {x1} is also\na valid minimal explanation. In the rest of this paper, we will use the terms abductive explanation or\nformal explanation and the notation wAXp to refer to Definition 2.1. Published as a conference paper at ICLR 2026 Substantial progress has been made in the practical efficiency of computing formal explanations. While finding an abductive explanation (AXp) is tractable for some classifiers (Marques-Silva,\n2023a; Darwiche & Ji, 2022; Huang et al., 2022; 2021; Izza et al., 2020; Marques-Silva et al.,\n2020; 2021), it becomes computationally hard for complex models like random forests and neural\nnetworks (Ignatiev & Marques-Silva, 2021; Izza & Marques-Silva, 2021). To address this inherent\ncomplexity, these methods typically encode the problem as a logical formula, leveraging automated\nreasoners like SAT, SMT, and Mixed Integer Linear Programming (MILP) solvers (Audemard et al.,\n2022; Ignatiev, 2020; Ignatiev et al., 2022; Ignatiev & Marques-Silva, 2021; Izza & Marques-Silva,\n2021) . Early approaches, such as deletion-based (Chinneck & Dravnieks, 1991) and insertion-based\n(de Siqueira, 1988) algorithms, are inherently sequential, thus requiring an ordering of the input features traditionally denoted as traversal ordering. They require a number of costly verification calls\nlinear with the number of features, which prevents effective parallelization.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 10, + "total_chunks": 75, + "char_count": 2400, + "word_count": 365, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8d45cc40-a7ae-4936-8cb6-8c054e2b2702", + "text": "As an alternative, surrogate models have been used to compute formal explanations for complex models (Boumazouza\net al., 2021; 2023), but the guarantee does not necessary hold on the original model. Recent work aims to break the sequential bottleneck, by linking explainability to adversarial robustness and formal verification. DistanceAXp (Huang & Marques-Silva, 2023; La Malfa et al., 2021)\nis a key example, aligning with our definition of AXp and enabling the use of verification tools. The latest literature focuses on breaking the sequential bottleneck using several strategies that include parallelization. This is achieved either by looking for several counterexamples at once (Izza\net al., 2024; Bassan & Katz, 2023; La Malfa et al., 2021; Bassan et al., 2023) or by identifying a\nset of irrelevant features simultaneously, as seen in VERIX (Wu et al., 2023), VERIX+ (Wu et al.,\n2024b), and prior work (Bassan & Katz, 2023). For instance, VERIX+ introduced stronger traversal\nstrategies to alleviate the sequential bottleneck. Their binary search approach splits the remaining\nfeature set and searches for batches of consecutive irrelevant features, yielding the same result as sequential deletion but with fewer solver calls. They also adapted QuickXplain (Junker, 2004), which\ncan produce even smaller explanations at the cost of additional runtime by verifying both halves. Concurrently, (Bassan & Katz, 2023) proposed strategies like the singleton heuristic to reuse verification results and derived provable size bounds, but their approach remains significantly slower than\nVERIX+ and lacks publicly available code. The identified limitations are twofold. First, existing methods rely heavily on exact solvers such as\nMarabou (Katz et al., 2019; Wu et al., 2024a), which do not scale to large NNs and are restricted to\nCPU execution. Recent verification benchmarks (Brix et al., 2023; Ducoffe et al., 2024; Zhao et al.,\n2022) consistently demonstrate that GPU acceleration and distributed verification are indispensable\nfor achieving scalability. Second, these approaches critically depend on traversal order. As shown\nin VERIX, the chosen order of feature traversal strongly impacts both explanation size and runtime. Yet, determining an effective order requires prior knowledge of feature importance, precisely the\ninformation that explanations are meant to uncover, thus introducing a circular dependency.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 11, + "total_chunks": 75, + "char_count": 2423, + "word_count": 358, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "de7a5321-1d48-498d-b411-b11558764efc", + "text": "Nevertheless, VERIX+ currently represents the SOTA for abductive explanations in NNs, achieving the\nbest trade-off between explanation size and computation time. Our work builds on this foundation by directly addressing the sequential bottleneck of formal explanation without requiring a traversal order, a first in formal XAI. We demonstrate that leveraging\nincomplete verification methods and GPU hardware is essential for practical scalability. Our approach offers a new solution to the core scalability issues, complementing other methods that aim to\nreduce explanation cost through different means (Bassan et al., 2025b;a). 4 FAME: FORMAL ABSTRACT MINIMAL EXPLANATION In this section, we introduce FAME, a framework that builds abstract abductive explanations (Definition 4.1). FAME proposes novel strategies to provide sound abstract abductive explanations\n(wAXpA) such as an Abstract Batch Certificate using Knapsack formulation, and a Recursive Refinement, relying on raw bounds provided by a formal framework (we use LiRPA in this paper). Definition 4.1 (Abstract Abductive Explanation (wAXpA)). Formally, given a triple (x, Ω, P), an\nabstract abductive explanation is a subset of feature indices X A ⊆F = {1, . . . , n} such that, under", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 12, + "total_chunks": 75, + "char_count": 1246, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4983f57d-3067-40d2-9382-9f87027b1e75", + "text": "Published as a conference paper at ICLR 2026 an abstract interpretation f of the model f, the following holds:\nwAXpA : ∀x′ ∈Ω(x), ^ (x′ i = xi) =⇒f(x′) |= P. (3)\ni∈X A Here, f = LiRPA(f, Ω) denotes the sound over-approximated bounds of the model outputs on the\ndomain Ω, as computed by the LiRPA method. If Eq. (3) holds, any feature outside X A can be\nconsidered irrelevant with respect to the abstract domain. This ensures that the concrete implication\nf(x′) |= P also holds for all x′ ∈Ω. In line with the concept of abductive explanations, we define\nan abstract minimal explanation as an abstract abductive explanation (wAXpA⋆) a set of features\nX A from which no feature can be removed without violating Eq. (3). Due to the over-approximation, as detailed in Section 2.2, any abstract abductive explanation is a\nweak abductive explanation for the model f. In the following we present the first steps described in\nFigure 1 to build such a wAXpA.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 13, + "total_chunks": 75, + "char_count": 949, + "word_count": 168, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "04c97e9b-3ea6-450a-af0d-dc63cabbf0e5", + "text": "4.1 THE ASYMMETRY OF PARALLEL FEATURE SELECTION In the context of formal explanations, adding a feature means identifying it as essential to a model's\ndecision (causes the model to violate the desired property P), so its value must be fixed in the explanation.Conversely, freeing a feature means identifying it as irrelevant, allowing it to vary without\naffecting the prediction. A key insight is the asymmetry between these two actions: while adding\nnecessary features can be parallelized naively, freeing features cannot due to complex interactions. Proposition 4.1 (Simultaneous Freeing). it is unsound to free multiple features at once based only\non individual verification as two features may be individually irrelevant yet jointly critical. Parallelizing feature freeing based on individual verification queries is unsound due to hidden feature dependencies that stem from treating the verifier as a simple binary oracle (SAT/UNSAT; see\nAppendix A for formal definitions) (Proposition 4.1). To solve this, we introduce the Abstract Batch\nCertificate Φ(A) (Definition 4.2). Unlike naive binary checks, Φ(A) leverages abstract interpretation to compute a joint upper bound on the worst-case contribution of the entire set A simultaneously. If Φ(A) ≤0, it mathematically guarantees that simultaneously freeing A is sound, explicitly\naccounting for their combined interactions. The formal propositions detailing this asymmetry is\nprovided in the Appendix B. 4.2 ABSTRACT INTERPRETATION FOR SIMULTANEOUS FREEING Standard solvers act as a \"binary oracle\" and their outcomes (SAT/UNSAT) are insufficient to certify\nbatches of features for freeing without a traversal order. This is because of feature dependencies\nand the nature of the verification process. We address this by leveraging inexact verifiers based\non abstract interpretation (LiRPA) to extract proof objects (linear bounds) that conservatively track\nthe contribution of any feature set. Specifically, we use CROWN (Zhang et al., 2018) to define an\nabstract batch certificate Φ in Definition 4.2. If one succeeds in freeing a set of features A given Φ,\nwe denote such an explanation as a formal abstract explanation that satisfies Proposition 4.2.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 14, + "total_chunks": 75, + "char_count": 2209, + "word_count": 327, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b9bdfb97-ccbd-47c2-9c25-a78eb1960ea4", + "text": "Let A be a set of features and Ωany perturbation domain. The abstract batch certificate is defined as:\nΦ(A; Ω) = max b i(x) + X ci,j ,\ni̸=c\nj∈A where the baseline bias (worst-case margin of the model's output) at x is b i(x) = W · x + wi, n i,≥0 i,≤0 oand the contribution of each feature j ∈A is ci,j = max W j (xj −xj), W j (xj −xj) ,\nwith xj = max{x′j : x′ ∈Ω(x)} and xj = min{x′j : x′ ∈Ω(x)}. The weights W and biases\nwi are obtained from LiRPA bounds, which guarantee for each target class i ̸= c, with c being the\ngroundtruth class:\n∀x′ ∈Ω(x), fi(x′) −fc(x′) ≤f i,c(x′) = W i · x′ + wi, Published as a conference paper at ICLR 2026 Proposition 4.2 (Batch-Certifiable Freeing). If Φ(A; Ω) ≤0, then F \\ A is a weak abductive\nexplanation (wAXp). If Φ(A) ≤0, freeing all features in A is sound; that is, the property P holds for every\nx′ ∈Ω(x) with {x′k = xk}k∈F\\A. The proof of Proposition 4.2 is given in Appendix B. The trivial case A = ∅always satisfies the\ncertificate, but our goal is to efficiently certify large feature sets. The abstract batch certificate also\nhighlights two extreme scenarii.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 16, + "total_chunks": 75, + "char_count": 1104, + "word_count": 216, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d1e25c1a-dcad-4abb-9096-836de0d969c2", + "text": "In the first, if Φ(F) ≤0, all features are irrelevant, meaning the\nproperty P holds across Ωwithout fixing any inputs. In the second, if b i(x) ≥0 for some i ̸= c,\nthen Φ(∅) > 0 and no feature can be safely freed; this situation arises when the abstract relaxation\nis too loose, producing vacuous bounds. Avoiding this degenerate case requires careful selection of\nthe perturbation domain, a consideration we highlight for the first time in the context of abductive\nexplanations. The choice of abstract domain is discussed in Section 5. 4.3 MINIMIZING THE SIZE OF AN ABSTRACT EXPLANATION VIA A KNAPSACK\nFORMULATION Between the trivial and degenerate cases lies the nontrivial setting: finding a maximal set of irrelevant features A to free given the abstract batch certificate Φ. Let F denote the index set of features. Maximizing |A| can be naturally formulated as a 0/1 Multidimensional Knapsack Problem (MKP). For each feature j ∈F, we introduce a binary decision variable yj indicating whether the feature is\nselected. The optimization problem then reads:\nmax X yj s.t. X cijyj ≤−b i(x), i ∈I, i ̸= c (4)\nj∈F j∈F\nwhere ci,j represents the contribution of feature j to constraint i, and −b i(x) is the corresponding\nknapsack capacity. The complexity of this MKP depends on the number of output classes. For binary\nclassification (k = 2), the problem is linear1. In the standard multiclass setting (k > 2), however,\nthe MKP is NP-hard. While moderately sized instances can be solved exactly using a MILP solver,\nthis approach does not scale to large feature spaces.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 17, + "total_chunks": 75, + "char_count": 1567, + "word_count": 264, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8e684d34-924c-442d-b00a-206947651733", + "text": "To ensure scalability, we propose a simple\nand efficient greedy heuristic, formalized in Algorithm 1. Rather than solving the full MKP, the\nheuristic iteratively selects the feature j⋆that is least likely to violate any of the k −1 constraints, by\nminimizing the maximum normalized cost across all classes. An example is provided in Appendix\nD.2. This procedure is highly parallelizable, since all costs can be computed simultaneously.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 18, + "total_chunks": 75, + "char_count": 435, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5c0ad101-b8d7-4267-89af-fabea394225b", + "text": "While\nsuboptimal by design, it produces a set A such that Φ(A; Ω) ≤0. A key advantage of this greedy\nbatch approach is its computational efficiency. The cost is dominated by the computation of feature\ncontributions ci,j. This requires a single backward pass through the abstract network, which has\na complexity of O(L · N) (where L is depth and N is neurons) and is highly parallelizable on\nGPUs. In contrast, exact solvers require solving an NP-hard problem for each feature or batch. In\nSection 7, we compare the performance of this greedy approach against the optimal MILP solution,\ndemonstrating that it achieves competitive results with dramatically improved scalability. Algorithm 1 Greedy Abstract Batch Freeing (One Step)\n1: Input: model f, perturbation domain Ωm, candidate set F\n2: Initialize: A ←∅, linear bounds {W i, wi} = LiRPA(f, Ωm(x))\n3: Do: compute ci,j in parallel\n4: while Φ(A) ≤0 and |F| > 0 do\n5: pick j⋆= arg minj∈F \\A maxi̸=c ci,j/(−bi) ▷Parallel reduction\n6: if Φ(A ∪{j⋆}) ≤0 and |A| ≤m then\n7: A ←A ∪{j⋆}\n8: end if\n9: F ←F \\ {j⋆} ▷Remove candidate\n10: end while\n11: Return: A 1it can be solved optimally in O(n) time by sorting features by ascending contribution c1,j and greedily\nadding them until the capacity is exhausted. Published as a conference paper at ICLR 2026 5 REFINING THE PERTURBATION DOMAIN FOR ABDUCTIVE EXPLANATION Previous approaches for batch freeing reduce the perturbation domain using a traversal order π,\ndefining Ωπ,i(x) = {x′ ∈Rn : ∥x −x′∥∞≤ϵ, x′πi: = xπi:}. These methods only consider\nfreeing dimensions up to a certain order. However, as discussed previously, determining an effective\norder requires prior knowledge of feature importance, the very information that explanations aim to\nuncover, introducing a circular dependency. This reliance stems from the combinatorial explosion:\nthe number of possible subsets of input features grows exponentially, making naive enumeration of\nabstract domains intractable. To address this, we introduce a new perturbation domain, denoted the cardinality-constrained perturbation domain.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 19, + "total_chunks": 75, + "char_count": 2078, + "word_count": 333, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c9003cec-0f9d-41a8-9621-c4d38dd7bbf5", + "text": "For instance, one can restrict to ℓ0-bounded perturbations:\nΩm(x) = {x′ ∈Rn : ∥x −x′∥∞≤ϵ, ∥x −x′∥0 ≤m},\nwhich ensures that at most m features may vary simultaneously. This concept is closely related to\nthe ℓ0 norm and has been studied in verification (Xu et al., 2020), but, to the best of our knowledge,\nit is applied here for the first time in the context of abductive explanations. The greedy procedure\nin Algorithm 1 can then certify a batch of irrelevant features A under this domain. Once a set A\nis freed, the feasible perturbation domain becomes strictly smaller, enabling tighter bounds and the\nidentification of additional irrelevant features. We formalize this as the refined abstract domain that\nensures that at most m features can vary in addition to the set of previously seclected ones A:\nΩm(x; A) = {x′ ∈Rn : ∥x −x′∥∞≤ϵ, ∥xF\\A −x′ F\\A∥0 ≤m}. (5)\nBy construction, Ωm(x; A) ⊆Ωm+|A|(x), so any free set derived from Ωm(x; A) remains sound\nfor the original budget m + |A|. Recomputing linear bounds on this tighter domain often yields\nstrictly smaller abstract explanation. This refinement naturally suggests a recursive strategy: after\none round of greedy batch freeing, we restrict the domain to Ωm(x; A), recompute LiRPA bounds,\nand reapply Algorithm 1 for m = 1 . . . |F \\ A|. Unlike the static traversal of prior work (e.g.,\nVERIX+), FAME employs a dynamic, cost-based selection by re-evaluating abstract costs ci,j at\neach recursive step. This process functions as an adaptive abstraction mechanism: iteratively enforcing cardinality constraints tightens the domain, reducing LiRPA's over-approximation error and\nenabling the recovery of additional freeable features initially masked by loose bounds. As detailed\nin Algorithm 2, this process can be iterated, progressively shrinking the domain and expanding A. In practice, recursion terminates once no new features can be freed. Finally, any remaining candidate features can be tested individually using the binary search approach proposed by VeriX+ but\nreplacing Marabou by CROWN (see Algorithm 5).", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 20, + "total_chunks": 75, + "char_count": 2068, + "word_count": 331, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "16f205b9-d834-48d3-9e49-f417a525330d", + "text": "This final step ensures that we obtain a formal\nabstract minimal explanation, as defined in Definition 4.1 Algorithm 2 Recursive Abstract Batch Freeing\n1: Input: model f, input x, candidate set F\n2: Initialize: A ←∅ ▷certified free set\n3: repeat\n4: Abest ←∅\n5: for m = 1 . . . |F \\ A| do\n6: Am ←GREEDYABSTRACTBATCHFREEING(f, Ωm(x; A), F \\ A)\n7: if |Am| > |Abest| then\n8: Abest ←Am\n9: end if\n10: end for\n11: A ←A ∪Abest\n12: until Abest = ∅\n13: A = ITERATIVE SINGLETON FREE(f, x, F, A) ▷refine by testing remaining features\n14: Return: A 6 DISTANCE FROM ABSTRACT EXPLANATION TO MINIMALITY Algorithm 2 returns a minimal abstract explanation: with respect to the chosen LiRPA relaxation,\nthe certified free set A cannot be further enlarged. This guarantee is strictly weaker than minimality Published as a conference paper at ICLR 2026 The remaining features may still include irrelevant coordinates that abstract\ninterpretation fails to certify, due to the coarseness of the relaxation. In other words, minimality is\nrelative to the verifier: stronger but more expensive verifiers (e.g., Verix+ with Marabou) are still\nrequired to converge to a true minimal explanation.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 21, + "total_chunks": 75, + "char_count": 1167, + "word_count": 199, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5588ec73-8c0b-4e71-b3b0-752935f75ca4", + "text": "We achieve this via a two-phase pipeline (Figure 1). Phase 1 (Abstract Pruning) generates a\nsound abstract explanation wAXpA⋆. Phase 2 (Exact Refinement) minimizes this candidate using\nVERIX+, ensuring the final output is guaranteed minimal. The gap arises from the tradeoff between\nverifier accuracy and domain size. Abstract methods become more conservative as the perturbation\ndomain grows, while exact methods remain sound but scale poorly. This motivates hybrid strategies\nthat combine fast but incomplete relaxations with targeted calls to exact solvers. As an additional\nacceleration step, adversarial attacks can be used. By Lemma B.1, if attacks identify features that\nmust belong to the explanation, they can be added simultaneously (see Algorithm 4). Unlike abstract\ninterpretation, the effectiveness of adversarial search typically increases with the domain size: larger\nregions make it easier to find counterexamples. Towards minimal explanations. In formal XAI, fidelity is a hard constraint guaranteed by the\nverifier. Therefore, the explanation cardinality (minimality) becomes the only metric to compare\nformal abductive explanations. A smaller explanation is strictly better, provided it remains sufficient. Our strategy is to use the minimal abstract explanation (wAXpA⋆) as a starting point, and then\nsearch for the closest minimal explanation. Concretely, we aim to identify the largest candidate\nset of potentially irrelevant features that, if freed together, would allow all remaining features to be\nsafely added to the explanation at once.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 22, + "total_chunks": 75, + "char_count": 1563, + "word_count": 226, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a6a65243-503c-4e56-8a6f-047329cf3725", + "text": "A good traversal order of the candidate space is crucial\nhere, as it determines how efficiently such irrelevant features can be pinpointed. Formally, if X A\ndenotes the minimal abstract explanation and X A⋆the closest minimal explanation, we define the\nabsolute distance to minimality as the number of irrelevant features not captured by the abstract\nmethod: d(X A, X A⋆) = X A \\ X A⋆ . To evaluate the benefits and reliability of our proposed explainability method, FAME, we performed\na series of experiments comparing its performance against the SoTA VERIX+ implementation. We assessed the quality of the explanations generated by FAME by comparing them to those of\nVERIX+ across four distinct models, including both fully connected and convolutional neural networks (CNNs). We considered two primary performance metrics: the runtime required to compute\na single explanation and the size (cardinality) of the resulting explanation. Our experiments, as in VERIX+ (Wu et al., 2024b), were conducted on two widely-used image\nclassification datasets: MNIST (Yann, 2010) and GTSRB (Stallkamp et al., 2012). Each score was\naveraged over non-robust samples from the 100 samples of each dataset.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 23, + "total_chunks": 75, + "char_count": 1189, + "word_count": 185, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "66317d1e-d315-4d9e-8420-fd28f833ad9b", + "text": "For the comparison results,\nthe explanations were generated using the FAME framework only, and with a final run of VERIX+\nto ensure minimality (See Figure 1). VERIX+ (alone) FAME: Single-round FAME: Iterative refinement FAME-accelerated VERIX+\nTraversal order bounds / / / + bounds\nSearch procedure binary MILP Greedy MILP Greedy Greedy + binary\nMetrics ↓ |AXp| time |wAXpA| time |wAXpA| time |wAXpA| time |wAXpA| time ∥candidate-set∥ |AXp| time MNIST-FC 280.16 13.87 441.05 4.4 448.37 0.35 229.73 14.30 225.14 8.78 44.21 224.41 13.72\nMNIST-CNN 159.78 56.72 181.24 5.59 190.29 0.51 124.9 12.35 122.09 5.6 104.09 113.53 33.75\nGTSRB-FC 313.42 56.18 236.85 9.68 243.18 0.97 331.84 12.28 332.74 5.26 11.93 332.66 9.26\nGTSRB-CNN 338.28 185.03 372.66 12.45 379.34 1.35 322.42 17.63 322.42 7.42 219.57 322.42 138.12 Table 1: Average explanation size and generation time (in seconds) are compared for FAME (singleround and iterative MILP/Greedy) with FAME-accelerated VERIX+ to achieve minimality. Experimental Setup All experiments were carried out on a machine equipped with an Apple M2\nPro processor and 16 GB of memory. The analysis is conducted on fully connected (-FC) and\nconvolutional (-CNN) models from the MNIST and GTSRB datasets, with ϵ set to 0.05 and 0.01\nrespectively. The verified perturbation analysis was performed using the DECOMON library2, ap- 2https://github.com/airbus/decomon Published as a conference paper at ICLR 2026 Figure 2: FAME's iterative refinement approach against the VERIX+ baseline. The left plot\ncompares the size of the final explanations. The right plot compares the runtime (in seconds). The\ndata points for each model are distinguished by color, and the use of circles (card=True) and squares\n(card=False) indicates whether a cardinality constraint (||x −x′||0 ≤m) was applied. plying the CROWN method with an l∞-norm. The NN verifier Marabou (Katz et al., 2019) is used\nwithin VERIX+.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 24, + "total_chunks": 75, + "char_count": 1920, + "word_count": 292, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6522edaf-b811-4a08-8d29-e4c76b211980", + "text": "We included a sensitivity analysis covering: (1) Solver Choice, confirming the\nGreedy heuristic's near-optimality vs. MILP (Table 1); (2) Cardinality Constraints, showing that\ncard=True yields significantly smaller explanations (Figure 2); and (3) Perturbation Magnitude (ϵ),\nwhich we fixed to baseline used by VERIX+ for direct comparison. We include additional experimental results on the ResNet-2B architecture (CIFAR-10) from the VNN-COMP benchmark (Wang\net al., 2021) to demonstrate scalability on deeper models. The complete set of hyperparameters and\nthe detailed architectures of the models used are provided in Appendix E for full reproducibility.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 25, + "total_chunks": 75, + "char_count": 656, + "word_count": 91, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ddaac033-8c72-4f71-93b0-788fca0fb099", + "text": "MILP FOR ABSTRACT BATCH FREEING Performance in a Single Round This experiment, in the 'FAME: Single Round' column of Table\n1, compares the runtime and size of the largest free set obtained in a single round using the greedy\nmethod versus an exact MILP solver for the abstract batch freeing (Algorithm 1). Across all models, the greedy heuristic consistently provided a significant speedup (ranging from\n9× to 12×) while achieving an abstract explanation size very close (fewer than 9 features in average)\nto that of the optimal MILP solver. This demonstrates that, for single-round batch freeing, the greedy\nmethod offers a more practical and scalable solution. Performance with Iterative Refinement This experiment compares the two methods in an iterative\nsetting of the abstract batch freeing, where the perturbation domain is progressively refined (Section 5). For the iterative refinement process, the greedy approach maintained a substantial runtime\nadvantage over the MILP solver, with a speedup up to 2.4× on the GTSRB-CNN model, while\nproducing abstract explanations that were consistently close in size to the optimal solution. The\ndistinction between the circle and square markers is significant in Figure 2. The square markers\n(card=False) tend to lie closer to or even above the diagonal line. This suggests that the cardinalityconstrained domain, when successful, is highly effective at finding more compact explanations. Impact of Iterative Refinement: Comparing 'FAME: Single-round' vs. 'FAME: Iterative refinement' in Table 1 isolates the impact of Algorithm 2. For MNIST-CNN, iterative refinement reduces\nexplanation size by 36% (190.29 to 122.09). This highlights the trade-off: a modest increase in\nruntime yields significantly more compact explanations. 7.2 COMPARISON WITH STATE-OF-THE-ART (VERIX+) We compare in this section the results of VERIX+ (alone) vs. FAME-accelerated VERIX+.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 26, + "total_chunks": 75, + "char_count": 1905, + "word_count": 283, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8a55a832-86c3-461a-950b-02c351b1bd48", + "text": "Explanation Size and Runtime: FAME consistently produces smaller explanations than VERIX+\nwhile being significantly faster, mainly due to FAME's iterative refinement approach, as visually\nconfirmed by the plots in Figure 2 that show a majority of data points falling below the diagonal line Published as a conference paper at ICLR 2026 for both size and time comparisons. The runtime gains are particularly substantial for the GTSRB\nmodels (green and red markers), where FAME's runtime is often only a small fraction of VERIX+'s\nas shown in Table 1. In some cases, FAME delivers a non-minimal set that is smaller than VERIX+\n's minimal set, with up to a 25× speedup (322.42 features in 7.4s compared to 338.28 in 185.03s\nfor the GTSRB-CNN model) while producing wAXpA that were consistently close in size to the\noptimal solution. The Role of Abstract Freeing: The effectiveness of FAME's approach is further supported by the\n\"distance to minimality\" metric. The average distance to minimality was 44.21 for MNIST-FC and\n104.09 for MNIST-CNN. An important observation from our experiments is that when the abstract\ndomains in FAME are effective, they yield abstract abductive explanations wAXpA that are smaller\nthan the abductive explanations (AXp) from VERIX+. This is not immediately obvious from the\nsummary table, as the final explanations may differ.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 27, + "total_chunks": 75, + "char_count": 1355, + "word_count": 214, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e3af1c44-06f6-427e-b422-b45a1ee41f4e", + "text": "Conversely, when FAME's abstract domains\nfail to find a valid free set, our method defaults to a binary search approach similar to VERIX+. However, since we do not use the Marabou solver in this phase, the resulting wAXpA is larger than\nthe AXp provided by Marabou. This highlights the trade-off and the hybrid nature of our approach. Finally, to demonstrate the generality of our framework beyond standard benchmarks, in Appendix\nF we provide additional experiments on the ResNet-2B architecture Wang et al. (2021) trained on\nCIFAR-10. These results represent, to the best of our knowledge, the first formal explanations generated for such a complex architecture, highlighting FAME as an enabling technology for scalability. 8 CONCLUSION AND DISCUSSION In this work, we introduced FAME (Formal Abstract Minimal Explanations), a novel framework for\ncomputing abductive explanations that effectively scales to large neural networks. By leveraging a\nhybrid strategy grounded in abstract interpretation and dedicated perturbation domains, we successfully addressed the long-standing sequential bottleneck of traditional formal explanation methods. Our main contribution is a new approach that eliminates the need for traversal order by progressively\nshrinking dedicated perturbation domains and using LiRPA-based bounds to efficiently discard irrelevant features. The core of our method relies on a greedy heuristic for batch freeing that, as our\nanalysis shows, is significantly faster than an exact MILP solver while yielding comparable explanation sizes. Our experimental results demonstrate that the full hybrid FAME pipeline outperforms the current\nstate-of-the-art VERIX+ baseline, providing a superior trade-off between computation time and\nexplanation quality. We consistently observed significant reductions in runtime while producing\nexplanations that are close to true minimality. This success highlights the feasibility of computing\nformal explanations for larger models and validates the effectiveness of our hybrid strategy. Beyond its performance benefits, the FAME framework is highly generalizable. Although our evaluation focused on classification tasks, the framework can be extended to other machine learning\napplications, such as regression.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 28, + "total_chunks": 75, + "char_count": 2259, + "word_count": 318, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1a75ac58-10eb-496b-ab1f-2683dc1ba647", + "text": "While we focused on robustness in continuous domains, FAME's\nhigh-level algorithms (batch certificate, greedy selection) support discrete features (see Appendix\nB). LiRPA natively handles discrete variables (e.g., one-hot encodings) via contiguous interval\nbounds. Furthermore, the framework can support other properties like local stability. Additionally,\nFAME can be configured to use exact solvers for the final refinement step, ensuring its adaptability\nand robustness for various use cases. Finally, we demonstrated FAME's scalability on the ResNet-2B (CIFAR-10) architecture. Although\nthe abstraction gap naturally widens with depth, FAME's ability to rapidly prune irrelevant features\nestablishes it as a critical enabling step for applying formal XAI to complex models where exact-only\nmethods are currently intractable. By designing a framework that natively leverages certificates from\nmodern, GPU-enabled verifiers, this work effectively bridges the gap between formal guarantees and\npractical scalability. Published as a conference paper at ICLR 2026 Our work has benefited from the AI Cluster ANITI and the research program DEEL.3 ANITI is\nfunded by the France 2030 program under the Grant agreement n°ANR-23-IACL-0002. DEEL is an\nintegrative program of the AI Cluster ANITI, designed and operated jointly with IRT Saint Exup´ery,\nwith the financial support from its industrial and academic partners and the France 2030 program under the Grant agreement n°ANR-10-AIRT-01. Within the DEEL program, we are especially grateful\nto Franck MAMALET for their constant encouragement, valuable discussions, and insightful feedback throughout the development of this work.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 29, + "total_chunks": 75, + "char_count": 1675, + "word_count": 234, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9992cc71-7e6c-4695-b090-728710d7102e", + "text": "The work of Elsaleh, Bassan, and Katz was partially\nfunded by the European Union (ERC, VeriDeL, 101112713). Views and opinions expressed are\nhowever those of the author(s) only and do not necessarily reflect those of the European Union or\nthe European Research Council Executive Agency. Neither the European Union nor the granting\nauthority can be held responsible for them. The work of Elsaleh, Bassan, and Katz was additionally\nsupported by a grant from the Israeli Science Foundation (grant number 558/24). Elsaleh is also\nsupported by the Ariane de Rothschild Women Doctoral Program. Gilles Audemard, Steve Bellart, Louenas Bounia, Fr´ed´eric Koriche, Jean-Marie Lagniez, and Pierre\nMarquis. Trading complexity for sparsity in random forest explanations. In Proceedings of the\nAAAI Conference on Artificial Intelligence, volume 36, pp. 5461–5469, 2022. Bassan, Yizhak Yisrael Elboher, Tobias Ladner, Matthias Althoff, and Guy Katz. Explaining,\nFast and Slow: Abstraction and Refinement of Provable Explanations. Conf.\non Machine Learning (ICML), 2025a. Shahaf Bassan and Guy Katz. Towards formal xai: formally approximate minimal explanations of\nneural networks. In International Conference on Tools and Algorithms for the Construction and\nAnalysis of Systems, pp. 187–207. Shahaf Bassan, Guy Amir, Davide Corsi, Idan Refaeli, and Guy Katz.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 30, + "total_chunks": 75, + "char_count": 1344, + "word_count": 196, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "947f2488-6a67-45e6-b484-b8f7d8d0b35c", + "text": "Formally Explaining Neural\nNetworks Within Reactive Systems. Conf. on Formal Methods in ComputerAided Design (FMCAD), pp. 1–13, 2023. Shahaf Bassan, Ron Eliav, and Shlomit Gur. Explain Yourself, Briefly! Self-Explaining Neural\nNetworks with Concise Sufficient Reasons. Conf. on Learning Representations\n(ICLR), 2025b. Umang Bhatt, Adrian Weller, and Jos´e M.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 31, + "total_chunks": 75, + "char_count": 358, + "word_count": 48, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6d7cce3a-18ac-46f7-ba02-44f4aeedc120", + "text": "Evaluating and aggregating feature-based\nmodel explanations. In Christian Bessiere (ed.), Proceedings of the Twenty-Ninth International\nJoint Conference on Artificial Intelligence, IJCAI-20, pp. 3016–3022. International Joint Conferences on Artificial Intelligence Organization, 7 2020. doi: 10.24963/ijcai.2020/417. URL\nhttps://doi.org/10.24963/ijcai.2020/417. Elena Botoeva, Panagiotis Kouvaros, Jan Kronqvist, Alessio Lomuscio, and Ruth Misener.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 32, + "total_chunks": 75, + "char_count": 448, + "word_count": 47, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b87fb4a8-2bc3-4be3-bd20-ba56bc7acddc", + "text": "Efficient\nverification of relu-based neural networks via dependency analysis. Proceedings of the AAAI\nConference on Artificial Intelligence, 34(04):3291–3299, Apr. 2020. doi: 10.1609/aaai.v34i04.\n5729. URL https://ojs.aaai.org/index.php/AAAI/article/view/5729. Ryma Boumazouza, Fahima Cheikh-Alili, Bertrand Mazure, and Karim Tabia. Asteryx: A modelagnostic sat-based approach for symbolic and score-based explanations. In Proceedings of the\n30th ACM International Conference on Information & Knowledge Management, pp. 120–129,\n2021. Ryma Boumazouza, Fahima Cheikh-Alili, Bertrand Mazure, and Karim Tabia.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 33, + "total_chunks": 75, + "char_count": 605, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dfc09964-2fde-4a4d-a95d-9f1130970097", + "text": "Symbolic explanations for multi-label classification. In 15th International Conference on Agents and Artificial\nIntelligence (ICAART 2023), volume 3, pp. 342–349. SCITEPRESS-Science and Technology\nPublications, 2023. 3https://www.deel.ai/ Published as a conference paper at ICLR 2026 Christopher Brix, Stanley Bak, Changliu Liu, and Taylor T.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 34, + "total_chunks": 75, + "char_count": 342, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "22ce8194-9399-4bdc-94db-78800b1f7622", + "text": "The fourth international verification of neural networks competition (VNN-COMP 2023): Summary and results. CoRR,\nabs/2312.16760, 2023. doi: 10.48550/ARXIV.2312.16760. URL https://doi.org/10.\n48550/arXiv.2312.16760. John W Chinneck and Erik W Dravnieks. Locating minimal infeasible constraint sets in linear\nprograms. ORSA Journal on Computing, 3(2):157–168, 1991. Adnan Darwiche and Chunxi Ji. On the computation of necessary and sufficient explanations. In\nProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 5582–5591, 2022. Alessandro De Palma, Serge Durand, Zakaria Chihani, and Caterina Urban.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 35, + "total_chunks": 75, + "char_count": 624, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "057bf9b6-92f1-4639-8478-7ab642dc40f3", + "text": "On Using Certified\nTraining towards Empirical Robustness. Transactions on Machine Learning Research Journal,\n2025. URL https://inria.hal.science/hal-05042448. Jl, and puget, j.-f. 1988. explanationbased generalisation of failures. In Proceedings\nof the Eighth European Conference on Artificial Intelligence (ECAI'88), pp. 339–344, 1988. M´elanie Ducoffe, Guillaume Pov´eda, Audrey Galametz, Ryma Boumazouza, Marion-C´ecile Martin,\nJulien Baris, Derk Daverschot, and Eugene O'Higgins. Surrogate neural networks local stability\nfor aircraft predictive maintenance. In International Conference on Formal Methods for Industrial\nCritical Systems, pp. 245–258. Thomas Fel, Lucas Hervier, David Vigouroux, Antonin Poche, Justin Plakoo, Remi Cadene, Mathieu Chalvidal, Julien Colin, Thibaut Boissin, Louis Bethune, et al. Xplique: A deep learning\nexplainability toolbox. arXiv preprint arXiv:2206.04394, 2022. Thomas Fel, Melanie Ducoffe, David Vigouroux, R´emi Cad`ene, Mika¨el Capelle, Claire Nicod`eme,\nand Thomas Serre.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 36, + "total_chunks": 75, + "char_count": 1015, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1faaddb4-0a2a-472f-aa6f-f6bafbc56f84", + "text": "Don't lie to me! robust and efficient explainability with verified perturbation\nanalysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16153–16163, June 2023. On Efficiently Explaining Graph-Based Classifiers. Conf. on Principles of Knowledge Representation and Reasoning (KR),\n2021. Xuanxiang Huang and Joao Marques-Silva. From robustness to explainability and back again. arXiv Xuanxiang Huang, Yacine Izza, Alexey Ignatiev, Martin Cooper, Nicholas Asher, and Joao MarquesSilva.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 37, + "total_chunks": 75, + "char_count": 532, + "word_count": 70, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ae6142ff-eaac-42a3-b4ec-37eeecb1e552", + "text": "Tractable explanations for d-dnnf classifiers. In Proceedings of the AAAI Conference on\nArtificial Intelligence, volume 36, pp. 5719–5728, 2022. Towards trustable explainable ai. In International Joint Conference on Artificial\nIntelligence-Pacific Rim International Conference on Artificial Intelligence 2020, pp. 5154–5158. Association for the Advancement of Artificial Intelligence (AAAI), 2020. Alexey Ignatiev and Joao Marques-Silva. Sat-based rigorous explanations for decision lists. In International Conference on Theory and Applications of Satisfiability Testing, pp. 251–269. Alexey Ignatiev, Antonio Morgado, and Joao Marques-Silva. Propositional abduction with implicit\nhitting sets. In ECAI 2016, pp. 1327–1335. Alexey Ignatiev, Nina Narodytska, and Joao Marques-Silva. Abduction-based explanations for machine learning models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1511–1519, 2019. Alexey Ignatiev, Yacine Izza, Peter J Stuckey, and Joao Marques-Silva.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 38, + "total_chunks": 75, + "char_count": 1008, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fe6f2b58-3f1d-4c74-a169-8def272df7e8", + "text": "Using MaxSAT for Efficient\nExplanations of Tree Ensembles. In Proc. of the 36'th AAAI Conf. on Artificial Intelligence, pp.\n3776–3785, 2022. Published as a conference paper at ICLR 2026 Yacine Izza and Joao Marques-Silva.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 39, + "total_chunks": 75, + "char_count": 221, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5ac8d399-d1ac-453a-aded-ffc0018dfdea", + "text": "On explaining random forests with sat. In Zhi-Hua Zhou (ed.),\nProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21,\npp. 2584–2591. International Joint Conferences on Artificial Intelligence Organization, 8 2021.\ndoi: 10.24963/ijcai.2021/356. URL https://doi.org/10.24963/ijcai.2021/356. Yacine Izza, Alexey Ignatiev, and Joao Marques-Silva. On explaining decision trees. arXiv preprint Yacine Izza, Xuanxiang Huang, Antonio Morgado, Jordi Planes, Alexey Ignatiev, and Joao MarquesSilva.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 40, + "total_chunks": 75, + "char_count": 530, + "word_count": 62, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4a8f1507-129a-4fac-92c3-4af9ecdc4ce4", + "text": "Distance-Restricted Explanations: Theoretical Underpinnings & Efficient Implementation. In Proceedings of the 21st International Conference on Principles of Knowledge Representation\nand Reasoning, pp. 475–486, 8 2024. doi: 10.24963/kr.2024/45. URL https://doi.org/\n10.24963/kr.2024/45. Quickxplain: Preferred explanations and relaxations for over-constrained problems. In Proceedings of the 19th national conference on Artifical intelligence, pp. 167–172, 2004. Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 41, + "total_chunks": 75, + "char_count": 539, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f31d0be6-9212-4ac3-b025-c4a85bbdb539", + "text": "Reluplex: An\nefficient smt solver for verifying deep neural networks. In International conference on computer\naided verification, pp. 97–117. Guy Katz, Derek A Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth\nShah, Shantanu Thakoor, Haoze Wu, Aleksandar Zelji´c, et al.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 42, + "total_chunks": 75, + "char_count": 296, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7d3bc0dc-4e2b-49d0-bbc4-34279b6d4b23", + "text": "The marabou framework for verification and analysis of deep neural networks. In International conference on computer aided\nverification, pp. 443–452. Emanuele La Malfa, Rhiannon Michelmore, Agnieszka M. Zbrzezny, Nicola Paoletti, and Marta\nKwiatkowska.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 43, + "total_chunks": 75, + "char_count": 252, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cb618f81-742f-4e76-be1b-01c1db6f5a0b", + "text": "On guaranteed optimal robust explanations for nlp models. In Zhi-Hua Zhou (ed.),\nProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21,\npp. 2658–2665. International Joint Conferences on Artificial Intelligence Organization, 8 2021.\ndoi: 10.24963/ijcai.2021/366. URL https://doi.org/10.24963/ijcai.2021/366. Minimal unsatisfiability: Models, algorithms and applications (invited paper). In 40th IEEE International Symposium on Multiple-Valued Logic, ISMVL 2010, Barcelona, Spain,\n26-28 May 2010, pp. 9–14. IEEE Computer Society, 2010. doi: 10.1109/ISMVL.2010.11. URL\nhttps://doi.org/10.1109/ISMVL.2010.11. Disproving xai myths with formal methods–initial results. In 2023 27th International Conference on Engineering of Complex Computer Systems (ICECCS), pp. 12–21. Logic-based explainability in machine learning. Causality, Explanations and Declarative Knowledge: 18th International Summer School 2022, Berlin,\nGermany, September 27–30, 2022, Tutorial Lectures, pp. 24–104. Joao Marques-Silva, Thomas Gerspacher, Martin Cooper, Alexey Ignatiev, and Nina Narodytska.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 44, + "total_chunks": 75, + "char_count": 1108, + "word_count": 127, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5d09dc23-44a0-4525-99ff-c91bb918d60a", + "text": "Explaining naive bayes and other linear classifiers with polynomial time and delay. Advances in\nNeural Information Processing Systems, 33:20590–20600, 2020. Joao Marques-Silva, Thomas Gerspacher, Martin C Cooper, Alexey Ignatiev, and Nina Narodytska. Explanations for monotonic classifiers. In International Conference on Machine Learning, pp.\n7469–7479. Mathieu Serrurier, Franck Mamalet, Alberto Gonz´alez-Sanz, Thibaut Boissin, Jean-Michel Loubes,\nand Eustasio Del Barrio. Achieving robustness in classification using optimal transport with hinge\nregularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern\nRecognition, 2021. Andy Shih, Arthur Choi, and Adnan Darwiche.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 45, + "total_chunks": 75, + "char_count": 701, + "word_count": 88, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b271ffd-2998-4fc4-9d25-525fefd97a66", + "text": "A symbolic approach to explaining bayesian network classifiers. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI'18, pp. 5103–5111. Published as a conference paper at ICLR 2026 Gagandeep Singh, Timon Gehr, Markus P¨uschel, and Martin Vechev.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 46, + "total_chunks": 75, + "char_count": 284, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f93acef-3179-4d79-8bc0-78c44d446626", + "text": "An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages, 3(POPL):1–30,\n2019. Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332,\n2012. Caterina Urban and Antoine Min´e.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 47, + "total_chunks": 75, + "char_count": 346, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a584d0e2-cbce-45f8-8eb6-89667d0cdc25", + "text": "A review of formal methods applied to machine learning. ArXiv, abs/2104.02466, 2021. URL https://api.semanticscholar.org/\nCorpusID:233033440. Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Sekhar Jana, Cho-Jui Hsieh, and Zico Kolter.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 48, + "total_chunks": 75, + "char_count": 235, + "word_count": 31, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40993611-7a9a-4f4a-999b-04a13f861cea", + "text": "Beta-crown: Efficient bound propagation with per-neuron split constraints for neural network\nrobustness verification. In Neural Information Processing Systems, 2021. URL https://api.\nsemanticscholar.org/CorpusID:244114085. Haoze Wu, Omri Isac, Aleksandar Zelji´c, Teruhiro Tagomori, Matthew Daggitt, Wen Kokke, Idan\nRefaeli, Guy Amir, Kyle Julian, Shahaf Bassan, et al. Marabou 2.0: A Versatile Formal Analyzer\nof Neural Networks. Conf. on Computer Aided Verification (CAV), pp. 249–264,\n2024a. Min Wu, Haoze Wu, and Clark Barrett.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 49, + "total_chunks": 75, + "char_count": 531, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "59c2d9b4-ebfc-477b-bde7-d6497f3dd6fc", + "text": "Verix: towards verified explainability of deep neural networks. Advances in Neural Information Processing Systems, 36:22247–22268, 2023. Min Wu, Xiaofu Li, Haoze Wu, and Clark W.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 50, + "total_chunks": 75, + "char_count": 178, + "word_count": 25, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "266efb6c-c485-45bf-b102-60a6fca18eda", + "text": "Better verified explanations with applications\nto incorrectness and out-of-distribution detection. CoRR, abs/2409.03060, 2024b. doi: 10.48550/\nARXIV.2409.03060. URL https://doi.org/10.48550/arXiv.2409.03060. Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya\nKailkhura, Xue Lin, and Cho-Jui Hsieh.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 51, + "total_chunks": 75, + "char_count": 330, + "word_count": 37, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a73a125c-fff8-4d3a-8a58-1ea82058249b", + "text": "Automatic perturbation analysis for scalable certified\nrobustness and beyond. Advances in Neural Information Processing Systems, 33:1129–1141,\n2020. Mnist handwritten digit database. Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural\nnetwork robustness certification with general activation functions. Advances in neural information\nprocessing systems, 31, 2018. Zhe Zhao, Yedi Zhang, Guangke Chen, Fu Song, Taolue Chen, and Jiaxiang Liu.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 52, + "total_chunks": 75, + "char_count": 476, + "word_count": 62, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7888662c-32be-4479-a5bb-bfb1ff85de94", + "text": "Cleverest: accelerating cegar-based neural network verification via adversarial attacks. In International Static\nAnalysis Symposium, pp. 449–473. Published as a conference paper at ICLR 2026 The appendix collects proofs, model specifications, and supplementary experimental results that\nsupport the main paper. Appendix A contains additional background on formal verification terminology, Abstract Interpretation, and LiRPA. Appendix B contains the complete proofs of all propositions. Appendix C provides the pseudocode for the FAME algorithms and the associated baselines. Appendix D provides illustrative examples of abductive explanations and the greedy knapsack formulation. Appendix E provides specifications of the datasets and architectures used, along with supplementary experimental results. Appendix F details the scalability analysis on complex architectures (ResNet-2B on CIFAR-10). Appendix G provides the LLM usage disclosure.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 53, + "total_chunks": 75, + "char_count": 941, + "word_count": 121, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b2afd413-67e0-4290-a3cc-4a6dd7a8da80", + "text": "A BACKGROUND ON FORMAL VERIFICATION A.1 ABSTRACT INTERPRETATION Abstract Interpretation is a theory of sound approximation of the semantics of computer programs. In the context of neural networks, it allows us to compute over-approximations of the network's\noutput range without executing the network on every single point in the input domain (which is\ninfinite). While exact verification methods (like MILP solvers) provide precise results, they are generally\nNP-hard and do not scale to large networks. Abstract interpretation trades precision for scalability\n(typically polynomial time) by operating on abstract domains (e.g., intervals, zonotopes, or polyhedra) rather than concrete values. A.2 LIRPA (LINEAR RELAXATION-BASED PERTURBATION ANALYSIS) LiRPA (Linear Relaxation-based Perturbation Analysis) is a specific, efficient instance of abstract\ninterpretation designed for neural networks. Instead of propagating simple intervals (which become\ntoo loose/imprecise in deep networks), LiRPA propagates linear constraints. For every neuron xj, it\ncomputes two linear bounds relative to the input x: wTj x + bj ≤fj(x) ≤wTj x + bj These linear bounds allow us to rigorously bound the \"worst-case\" behavior of the network much\nmore tightly than simple intervals. If the lower bound of the correct class minus the upper bound of\nthe target class is positive, we have a mathematically sound certificate of robustness. Illustrative Example: Consider a nominal input image ¯x from the MNIST dataset depicting the\ndigit '7'. In a standard local robustness verification task, we define the input domain Ω(¯x) as an\nl∞-norm ball with a radius of ϵ = 0.05.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 54, + "total_chunks": 75, + "char_count": 1650, + "word_count": 246, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5f3e522-ef7c-415d-a11d-523f2319abda", + "text": "This implies that each pixel xi in the image is permitted to\nvary independently within the interval [¯xi −0.05, ¯xi + 0.05]. The verification objective is to prove that the property P holds: specifically, that for every possible\nperturbed image x ∈Ω(¯x), the network's output logit for the ground-truth class ('7') remains strictly\ngreater than the logit for any target class k (e.g., '1'). In the context of LiRPA, this is verified by\ncomputing a sound lower bound for the correct class (f 7) and a sound upper bound for the competing\nclass (f 1). If the verified margin f 7 −f 1 > 0, the network is guaranteed to be robust against all\nperturbations in Ω(¯x). Published as a conference paper at ICLR 2026 A.3 VERIFICATION TERMINOLOGY We formulate the check for explanation sufficiency as a constraint satisfaction problem. A query\nis SAT if a valid perturbation (counter-example) exists, and UNSAT if no such perturbation exists\n(meaning the explanation is valid). • Soundness (No False Positives): A verifier is sound if it guarantees that any certified property is truly holds. In Abstract Interpretation, soundness is achieved because the computed\nabstract bounds strictly enclose the true concrete values. If these conservative bounds satisfy the property (UNSAT), the actual network must also satisfy it.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 55, + "total_chunks": 75, + "char_count": 1310, + "word_count": 213, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8870cda-9c90-4175-99c4-c7f7bfddadec", + "text": "• Completeness (No False Negatives): A verifier is complete if it is capable of certifying any\nvalid explanation. Exact solvers (like MILP) are complete. In contrast, Abstract Interpretation is incomplete: due to over-approximation, the bounds may be too loose to prove a\ntrue property, leading to a \"don't know\" state where the explanation is valid, but the verifier\ncannot prove it. THE ASYMMETRY OF PARALLEL FEATURE SELECTION Proposition B.1 (Simultaneous Addition). Any number of essential features can be added to the\nexplanation simultaneously. This property allows us to leverage solvers capable of assessing multiple verification queries in parallel, leading to a substantial reduction in runtime. (a) Adding several features at once is sound. (b) Freeing several features at once is unsound. Figure 3: Toy example illustrating the asymmetry between adding and freeing features. Simultaneous Addition B.1. Let X be the current explanation candidate, and let R = {r1, . . . , rk}\nbe a set of features not in X. If, for every ri ∈R, removing the single feature ri from the set\nF \\ (X ∪{ri}) produces a counterexample, then all features in R are necessary and can be added\nto the explanation at once. Simultaneous freeing 4.1. If removing any feature from a set R ⊆F \\ X individually causes the\nexplanation to fail (i.e., produces a counterexample), then all features in R can be added to the\nexplanation X simultaneously. Batch-Certifiable Freeing 4.2.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 56, + "total_chunks": 75, + "char_count": 1458, + "word_count": 237, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99c0e19d-5f76-4be4-b29b-d84a65fb2320", + "text": "For any i ̸= c and x′ ∈Ω(x), lirpa bounds give fi(x′) −fc(x′) ≤\nb i(x) + Pj∈A ∆i,j(x′) with ∆i,j(x′) ≤ci,j. Taking the worst case over x′ and i yields fi(x′) −\nfc(x′) ≤Φ(A) ≤0, precluding a label flip. Published as a conference paper at ICLR 2026 PROPOSITION (CORRECTNESS OF THE RECURSIVE PROCEDURE) Let A be the set returned by Algorithm 2 augmented with the final singleton refinement step that\ntests each remaining feature individually with the LiRPA certificate Φ(·). (i) (No singleton extension) For every feature j ∈F \\ A we have i.e. no single feature can be added to A while preserving the certificate. Hence A is\nsingleton-maximal with respect to the LiRPA certificate. (ii) (Termination) Algorithm 2 terminates in at most |F| outer iterations (and finitely many\ninner steps). (iii) (Full abstract minimality — conditional) If the inner batch solver called by Algorithm 2\nreturns, for each tested budget p, a globally optimal certified free set (i.e., for the current\ndomain it finds a maximum-cardinality Ap satisfying Φ(Ap) ≤0), then the final A is a\nglobally maximal certified free set: there is no A′ ⊋A with Φ(A′) ≤0. In this case A is\na true minimal abstract explanation (with respect to the chosen LiRPA relaxation). Proof. (i) No singleton extension. By construction, the algorithm performs a final singleton refinement: it tests every feature j ∈F \\ A by evaluating the certificate on A ∪{j}. The algorithm\nonly adds j to A if Φ(A ∪{j}) ≤0. Since the refinement ends with no further additions, it follows\nthat for every remaining j we have Φ(A ∪{j}) > 0.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 57, + "total_chunks": 75, + "char_count": 1572, + "word_count": 270, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7c1793d5-2636-4ce3-83f1-ae112e472706", + "text": "Each time the algorithm adds at least one feature to A, the cardinality |A| strictly\nincreases and cannot exceed |F|. The outer loop therefore performs at most |F| successful additions. If an outer iteration yields no new features, the loop stops. Inner loops (scanning budgets\np or performing singleton checks) are finite since they iterate over finite sets. Hence the algorithm\nterminates in finite time. (iii) Full abstract minimality under optimal inner solver. Suppose that for every domain tested,\nthe inner routine (called for each p) returns a certified free set of maximum possible cardinality\namong all subsets that satisfy Φ(·) ≤0 on that domain. During each outer iteration the algorithm\nenumerates budgets p (or otherwise explores the space of allowed cardinalities) and selects the\nlargest Ap found; then A is augmented by that largest globally-feasible batch. If no nonempty\nglobally-feasible batch exists for any tested p, then no superset of the current A can be certified\n(because any superset would have some cardinality p′ tested and the solver would have returned it). After the final singleton checks (which also use the optimal verifier on singletons), there remains no\nsingle feature that can be added. Combining these facts yields that no superset of A is certifiable,\ni.e. A is a globally maximal certified free set, as claimed. Abstract Minimal Explanation Correctness of Iterative Singleton Freeing. Let F be the candidate feature set and let A0 ⊆F be\nan initial free set such that the LiRPA certificate verifies A0 (i.e. Φ(A0) ≤0).", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 59, + "total_chunks": 75, + "char_count": 1560, + "word_count": 251, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0d7a9cab-c3f6-4ff7-b16c-e0318bd1698a", + "text": "Run the Iterative\nSingleton Freeing procedure (Algorithm 5) with traversal order π. The algorithm returns a set A\nwith the following properties: 1. (Soundness) The final set A satisfies Φ(A) ≤0 (every added singleton was certified). 2. (Termination) The algorithm terminates after at most |F| −|A0| successful additions\n(hence in finite time). 3. (Singleton-maximality) For every j ∈F \\ A we have Φ(A ∪{j}) > 0, i.e. no remaining\nsingle feature can be certified as free. Published as a conference paper at ICLR 2026 Soundness (invariant). By assumption Φ(A0) ≤0. The algorithm only appends a feature\ni to the current free set after a LiRPA call returns success on A ∪{i}, i.e. Φ(A ∪{i}) ≤0. Since\nLiRPA certificates are sound, every update preserves the invariant \"current A is certified\". Therefore\nthe final A satisfies Φ(A) ≤0. Each successful iteration increases |A| by one and |A| ≤|F|. Thus there can be at\nmost |F| −|A0| successful additions.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 60, + "total_chunks": 75, + "char_count": 949, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "40b05e93-4d5e-4e5b-9ded-201242bfc740", + "text": "The algorithm halts when a full scan yields no addition; since\nscans iterate over a finite set ordered by π, the procedure terminates in finite time. Singleton-maximality. Assume by contradiction that after termination there exists j ∈F \\ A with\nΦ(A ∪{j}) ≤0. The final scan that caused termination necessarily tested j (traversal order covers\nall remaining indices), so the algorithm would have added j, contradicting termination. Hence for\nevery j ∈F \\ A we must have Φ(A ∪{j}) > 0, proving singleton-maximality. Worked counterexample (illustrating joint freeing). Consider a toy binary classifier with two\ninput features x1, x2 and property P: the label remains class 0 iff f0(x′) −f1(x′) ≥0. Suppose the\nLiRPA relaxation yields conservative linear contributions such that b + c1 > 0, b + c2 > 0, but b + c1 + c2 ≤0, where ci is the worst-case contribution of feature i and b is the baseline margin.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 61, + "total_chunks": 75, + "char_count": 902, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6a0a1e43-d3ca-4340-9d04-adf7116f5f09", + "text": "Then neither\nsingleton {1} nor {2} is certifiable (each violates the certificate), but the joint set {1, 2} is certifiable. The iterative singleton procedure terminates without adding either feature, while a batch routine (or\nan optimal MKP solver) would free both. This demonstrates the algorithm's limitation: it guarantees\nonly singleton-maximality, not global maximality over multi-feature batches. Complexity and practical cost. In the worst case the algorithm may attempt a\nLiRPA call for every remaining feature on each outer iteration. If r features are eventually added,\nthe total number of LiRPA calls is bounded by −1) + 1) (n) + (n −1) + · · · + (n −r + 1) = r · n −r(r ≤n(n = O(n2).\n2 2\nThus worst-case LiRPA call complexity is quadratic in n. In practice, however, each successful addition reduces the candidate set and often many iterations terminate early; empirical behavior tends\nto be much closer to linear in n for structured data because (i) many features are certified in early\npasses and (ii) LiRPA calls are highly parallelizable across features and can exploit GPU acceleration. Finally, the dominant runtime factor is the per-call cost of LiRPA (forward/backward bound\npropagation); therefore hybrid strategies (batch pre-filtering, prioritized traversal orders, occasional\nexact-solver checks on promising subsets) are useful to reduce the number of expensive LiRPA\nevaluations. FAME FOR DISCRETE DATA FAME, as presented, uses LiRPA, which is designed for continuous ( ) domains. A discrete feature j\nwith admissible values in a finite set Sj can be incorporated by specifying an interval domain, which\nis the standard abstraction used in LiRPA-based verification. Consequently, FAME allows a discrete feature to vary over its admissible values. LiRPA supports\nthis by assigning\nx′j ∈[min Sj, max Sj],\nor, if only a subset S′j ⊆Sj is permitted,\nx′j ∈[min S′j, max S′j], provided that the values form a contiguous range.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 62, + "total_chunks": 75, + "char_count": 1946, + "word_count": 310, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2f2aa23a-df27-45ed-a844-98e6d2faa52e", + "text": "If a feature belongs to the explanation, it is fixed to its nominal value, which corresponds to assigning\nthe zero-width interval [xj, xj]. Note that freeing a feature to a non-contiguous set (e.g., allowing {1, 4} but excluding {2, 3}) cannot\nbe represented exactly, since LiRPA abstractions are convex intervals. Extending LiRPA to arbitrary\nfinite non-convex domains is left for future work. In practice, such cases are rare: when categorical Published as a conference paper at ICLR 2026 values have no meaningful numeric ordering, one-hot encodings are standard, and each coordinate\nbecomes a binary {0, 1} feature naturally supported by interval domains.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 63, + "total_chunks": 75, + "char_count": 659, + "word_count": 102, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "28440c6e-1815-42bd-9f6d-5358658276df", + "text": "Since FAME only requires sound per-feature lower and upper bounds, all its components, including the batch certificate Φ(A) and the refinement steps, apply directly to discrete and categorical\nfeatures. This appendix details the algorithmic procedures supporting the FAME framework and its baselines. We present four key algorithms: • Algorithm 3 (BINARYSEARCH): An enhanced version of the binary search traversal\nstrategy used in Verix+. It employs a divide-and-conquer approach to identify irrelevant\nfeatures, accepting a generic verification oracle (e.g., Marabou or LiRPA) as an input parameter. • Algorithm 4 (Simultaneous Add): An acceleration heuristic that uses adversarial attacks\nto quickly identify necessary features. By checking if relaxing a specific feature immediately leads to a counterexample via attacks (e.g., PGD), we can efficiently add necessary\nfeatures to the explanation without expensive verification calls. • Algorithm 5 (Iterative Singleton Freeing): A refinement procedure that iterates sequentially through the remaining candidate features. It utilizes LiRPA certificates to check if\nindividual features can be safely freed, serving as a final cleanup step for features that\ncould not be certified in batches. • Algorithm 5 (Recursive Abstract Batch Freeing): The core recursive loop of our framework. It iteratively tightens the perturbation domain using cardinality constraints (varying\nm) and invokes the greedy batch-freeing heuristic to maximize the size of the abstract explanation, concluding with a singleton refinement step. In this enhanced BINARYSEARCH algorithm, the solver (e.g., Marabou or Lirpa) is passed as an\nexplicit parameter to enable the CHECK function, which performs the core verification queries. Published as a conference paper at ICLR 2026", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 64, + "total_chunks": 75, + "char_count": 1798, + "word_count": 257, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "25e7e9cd-85b5-4fc3-8720-5a5ec873b95e", + "text": "Algorithm 3 BINARYSEARCH(f, xΘ, solver)\n1: function BINARYSEARCH(f, xΘ, solver)\n2: if |xΘ| = 1 then\n3: if CHECK(f, xB ∪xΘ, solver) then\n4: xB ←xB ∪xΘ\n5: return\n6: else\n7: xA ←xA ∪xΘ\n8: return\n9: end if\n10: end if\n11: xΦ, xΨ = split(xΘ, 2)\n12: if CHECK(f, xB ∪xΦ, solver) then\n13: xB ←xB ∪xΦ\n14: if CHECK(f, xB ∪xΨ, solver) then\n15: xB ←xB ∪xΨ\n16: else\n17: if |xΨ| = 1 then\n18: xA ←xA ∪xΨ\n19: else\n20: BINARYSEARCH(f, xΨ, solver)\n21: end if\n22: end if\n23: else\n24: if |xΦ| = 1 then\n25: xA ←xA ∪xΦ\n26: else\n27: BINARYSEARCH(f, xΦ, solver)\n28: end if\n29: end if\n30: end function Algorithm 4 Simultaneous Add\n1: Input: model f, input x, candidate set F, current free set A, adversarial procedure ATTACK(,)\nproperty P\n2: Initialize: E ←∅ ▷set of necessary features\n3: for i ∈F \\ A do\n4: F′ ←F \\ {i}\n5: if ATTACK(f, Ω(x, F′), P) succeeds then\n6: E ←E ∪{i} ▷i must remain fixed\n7: end if\n8: end for\n9: Return: E Published as a conference paper at ICLR 2026 C.3 ITERATIVE SINGLETON FREEING Algorithm 5 Iterative Singleton Free\n1: Input: model f, input x, candidate set F, free set A, certificate method LIRPA(,) traversal\norder π, property P\n2: repeat\n3: found ←false\n4: for i ∈π with i ∈F \\ A do\n5: if LIRPA(f, Ω(x, A ∪{i}), P) succeeds then\n6: A ←A ∪{i}\n7: found ←true\n8: break ▷restart scan from beginning of π\n9: end if\n10: end for\n11: until found = false\n12: Return: A C.4 RECURSIVE SIMULTANEOUS FREE Algorithm 6 Recursive Abstract Batch Freeing\n1: Input: model f, input x, candidate set F\n2: Initialize: A ←∅ ▷certified free set\n3: repeat\n4: Abest ←∅\n5: for m = 1 . . . |F \\ A| do\n6: Am ←GREEDYABSTRACTBATCHFREEING(f, Ωm(x; A), F \\ A)\n7: if |Am| > |Abest| then\n8: Abest ←Am\n9: end if\n10: end for\n11: A ←A ∪Abest\n12: until Abest = ∅\n13: A = ITERATIVE SINGLETON FREE(f, x, F, A) ▷refine by testing remaining features\n14: Return: A D.1 ILLUSTRATION OF ABDUCTIVE EXPLANATION Figure 4 illustrates a 3D classification task. For the starred sample, we seek an explanation for\nits classification within a local cube-shaped domain. As shown in Figure 5, fixing only feature x2\n(i.e. freeing {x1, x3}, restricting perturbations to the orange plane) is not enough to guarantee the\nproperty, since a counterexample exists. However, fixing both x2 and x3 (orange line on free x1)\ndefines a 'safe' subdomain where the desired property holds true, since no counterexample exists in\nthat subdomain. Therefore, X = {x2, x3} is an abductive explanation. Since neither {x2} nor {x3}\nare explanations on their own, {x2, x3} is minimal. But it is not minimum since X = {x1} is also\na minimal abductive explanation with a smaller cardinality.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 65, + "total_chunks": 75, + "char_count": 2618, + "word_count": 499, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "04610dfe-6145-41f8-a23a-2c02f59e50f2", + "text": "Two special cases are worth noting: an\nempty explanation (all features are irrelevant) and a full explanation (the entire input is necessary). If all features are irrelevant, the explanation is the empty set, and no valid explanation exists. Conversely, if perturbing any feature in the input x changes the prediction, the entire input must be fixed,\nmaking the full feature set the explanation.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 66, + "total_chunks": 75, + "char_count": 395, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2c5b2188-ddbb-480d-8eee-6e2e807bc1d3", + "text": "Published as a conference paper at ICLR 2026 Figure 4: A 3D classification task. Figure 5: AXps with different properties. D.2 ILLUSTRATION OF THE KNAPSACK FORMULATION This is an example demonstrating how the greedy heuristic described in Algorithm 1 works. Given\na multi-class classification problem with three classes: 0, 1, and 2. The model correctly predicts\nclass 0 for a given input. We want to free features from the irrelevant set A based on the abstract\nbatch certificate. We have three candidate features to free: j1, j2, and j3.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 67, + "total_chunks": 75, + "char_count": 539, + "word_count": 89, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b5e1b4a4-8072-44b6-85f2-c37a78d5496d", + "text": "The baseline budgets for\nthe non-ground-truth classes are: • Class 1: −b = 10\n• Class 2: −b = 20 The normalized costs for each feature are calculated as ci,j/(−b i): Table 2: Example of Greedy Heuristic Decision Making\nFeature Normalized Cost for Class 1 Normalized Cost for Class 2 Maximum Normalized Cost\n(j) (c1,j/(−b 1)) (c2,j/(−b 2)) (maxi)\nj1 2/10 = 0.2 8/20 = 0.4 0.4\nj2 7/10 = 0.7 4/20 = 0.2 0.7\nj3 3/10 = 0.3 3/20 = 0.15 0.3 The algorithm's objective is to minimize the maximum normalized cost across all non-ground-truth\nclasses. As shown in the table, the minimum value in the \"Maximum Normalized Cost\" column is\n0.3, which corresponds to feature j3. Therefore, the greedy heuristic selects feature j3 to be added\nto the free set in this step, as it represents the safest choice.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 68, + "total_chunks": 75, + "char_count": 790, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "903c67f5-2331-4ff2-b11e-d417eb9668b2", + "text": "E.1 MODEL SPECIFICATION We evaluated our framework on standard image benchmarks including the MNISTYann (2010) and\nGTSRBStallkamp et al. (2012) datasets. We used both fully connected and convolutional models\ntrained in a prior state-of-the-art VERIX+Wu et al. (2024b) to perform our analysis. The MNIST dataset consists of 28 × 28 × 1 grayscale handwritten images. The architectures of\nthe fully connected and convolutional neural networks trained on this dataset are detailed in Table\n3 and Table 4, respectively. These models achieved prediction accuracies of 93.76% for the fully\nconnected model and 96.29% for the convolutional model. The GTSRB dataset contains colored images of traffic signs with a shape of 32×32×3 and includes\n43 distinct categories. In the models used for our experiments, which were trained by the authors\nof VERIX+, only the 10 most frequent categories were used to mitigate potential distribution shift\nand obtain higher prediction accuracies.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 69, + "total_chunks": 75, + "char_count": 972, + "word_count": 148, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "280fdf89-07c0-4011-948a-235142674d7d", + "text": "The architectures of the fully connected and convolutional Published as a conference paper at ICLR 2026 Table 3: Architecture of the MNIST-FC model. Layer Type Input Shape Output Shape Activation\nFlatten 28 × 28 × 1 784 -\nFully Connected 784 10 ReLU\nFully Connected 10 10 ReLU\nOutput 10 10 - Table 4: Architecture of the MNIST-CNN model. Layer Type Input Shape Output Shape Activation\nConvolution 2D 28 × 28 × 1 13 × 13 × 4 -\nConvolution 2D 13 × 13 × 4 6 × 6 × 4 -\nFlatten 6 × 6 × 4 144 -\nFully Connected 144 20 ReLU\nOutput 20 10 - models trained on GTSRB are presented in Table 5 and Table 6, respectively. These networks\nachieved prediction accuracies of 85.93% and 90.32%, respectively. Table 5: Architecture of the GTSRB-FC model. Layer Type Input Shape Output Shape Activation\nFlatten 32 × 32 × 3 3072 -\nFully Connected 3072 10 ReLU\nFully Connected 10 10 ReLU\nOutput 10 10 - Table 6: Architecture of the GTSRB-CNN model. Layer Type Input Shape Output Shape Activation\nConvolution 2D 32 × 32 × 3 15 × 15 × 4 -\nConvolution 2D 15 × 15 × 4 7 × 7 × 4 -\nFlatten 7 × 7 × 4 196 -\nFully Connected 196 20 ReLU\nOutput 20 10 - The CIFAR-10 dataset contains colored images of common objects with a shape of 32 × 32 × 3 and\nincludes 10 distinct categories. The architecture of the ResNet-2B model used is detailed in Table\n7. This model (sourced from the Neural Network Verification Competition (VNN-COMP) Wang\net al. (2021)) is a compact residual network benchmark designed for neural network verification\non CIFAR-10.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 70, + "total_chunks": 75, + "char_count": 1510, + "word_count": 287, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac44deec-4ca4-483e-a830-9c0ddc4fa397", + "text": "Intended to help verification tools evolve beyond simple feedforward networks, this\nmodel was adversarially trained with an L∞perturbation epsilon of 2/255. Table 7: Architecture of the ResNet-2B model (CIFAR-10). Layer Type Input Shape Output Shape Activation\nReshape 3072 32 × 32 × 3 -\nConvolution 2D 32 × 32 × 3 16 × 16 × 8 ReLU\nResidual Block (Downsample) 16 × 16 × 8 8 × 8 × 16 ReLU\nResidual Block 8 × 8 × 16 8 × 8 × 16 ReLU\nFlatten 8 × 8 × 16 1024 -\nFully Connected 1024 100 ReLU\nOutput 100 10 - Published as a conference paper at ICLR 2026 E.2 DETAILED EXPERIMENTAL SETUP We configured the VERIX+ implementation with the following settings: binary search=true,\nlogit ranking=true, and traversal order=bounds. To identify necessary features, we used the Fast\nGradient Sign (FGS) technique for singleton attack addition, though the Projected Gradient Descent\n(PGD) is also available for this purpose. We performed a comprehensive sensitivity analysis covering: (1) Solver Choice: Table 1 shows the\nGreedy heuristic finds explanations nearly identical in size to the optimal MILP solver (gap < 9\nfeatures), validating its near-optimality. (2) Cardinality Constraints: Figure 4 confirms that using\nthe constraint (card=True) yields significantly smaller explanations. (3) Perturbation Magnitude (ϵ):\nWhile we adhered to standard baseline values used by the baseline VERIX+ (e.g., 0.05 for MNIST,\n0.01 for GTSRB) to ensure a direct and fair comparison, we acknowledge that explanation size is\ninversely related to ϵ, as larger radii result in looser bounds. E.3 SUPPLEMENTARY EXPERIMENTAL RESULTS PERFORMANCE WITH ITERATIVE REFINEMENT The three plots compare the performance of a greedy heuristic with an exact MILP solver for an iterative refinement task. The central\nfinding across all three visualizations is that the greedy heuristic provides a strong trade-off between\nspeed and solution quality, making it a more practical approach for large-scale problems. Figure 6: Performance Comparison of FAME's Abstract Batch Freeing Methods. These three\nplots compare the greedy heuristic against the exact MILP solver for the iterative refinement task\nfor all the models. The first plot shows the runtime comparison of the two methods on a log-log\nscale. The second plot compares the size of the freed feature set for both methods.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 71, + "total_chunks": 75, + "char_count": 2331, + "word_count": 371, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "74d531db-cadf-4f3d-96ee-2de2bcb3f0a0", + "text": "The third plot\nillustrates the distribution of the optimality gap (MILP size - Greedy size). Analysis of FAME's Abstract Batch Freeing The visualizations demonstrate that the greedy\nheuristic provides a strong trade-off between speed and solution quality for the iterative refinement\ntask. • Runtime Performance: As shown in the first plot, the greedy algorithm is consistently\nfaster than the MILP solver. This is evidenced by the data points for all models lying\nsignificantly below the diagonal line, confirming a substantial gain in runtime. • Solution Quality: The second plot shows that the greedy algorithm produces solutions of\ncomparable quality to the optimal MILP solver. The tight clustering of data points along\nthe diagonal line for all models indicates a strong correlation between the sizes of the freed\nfeature sets. • Optimality Gap: The histogram of the final plot reinforces these findings by showing that\nthe greedy heuristic frequently achieves the optimal solution, with the highest frequency\nof samples occurring at a gap of zero. The distribution further confirms that any suboptimality is typically minimal.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 72, + "total_chunks": 75, + "char_count": 1133, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2a521847-071b-410c-ada3-d1226e058bf9", + "text": "Published as a conference paper at ICLR 2026 F SCALABILITY ANALYSIS ON COMPLEX ARCHITECTURES (RESNET-2B ON\nCIFAR-10) To validate the scalability of FAME on architectures significantly deeper and more complex than\nstandard benchmarks, we conducted an evaluation on the ResNet-2B model (2 residual blocks, 5\nconvolutional layers, 2 linear layers) trained on the CIFAR-10 dataset Wang et al. (2021). We\nutilized an L∞perturbation budget of ϵ = 2/255. These additional experiments were conducted on\na server equipped with an NVIDIA A100 80GB GPU. For these experiments, we define the feature set F at the pixel level. Consequently, the total number of features is N = 32 × 32 = 1024. Freeing a feature in this context\ncorresponds to simultaneously relaxing the constraints on all three color channels (RGB) for that\nspecific pixel location. Feasibility and Comparison. Running exact formal explanation methods (such as the complete\nVERIX+ pipeline with Marabou) on this architecture resulted in consistent timeouts or memory\nexhaustion, confirming that exact minimality is currently out of reach for this complexity class. In\ncontrast, FAME successfully terminated for all processed samples.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 73, + "total_chunks": 75, + "char_count": 1187, + "word_count": 181, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3eb6607f-329f-406e-93a9-7bd62d2dc6a9", + "text": "Detailed Quantitative Results by Configuration. To rigorously assess the contribution of each\ncomponent in the FAME framework, we evaluated three configurations (N = 100 samples). The\nresults are summarized below: • Single-Round Abstract Freeing (Algorithm 1 only). This baseline represents a static\napproach without domain refinement.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 74, + "total_chunks": 75, + "char_count": 335, + "word_count": 46, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1ac691d2-ee81-4fda-9310-d87d6d4e0caa", + "text": "– Performance: It freed an average of only 5.53 features (pixels).\n– Insight: This confirms that on deep networks, the initial abstract bounds are too loose\nto certify meaningful batches in a single pass. A static traversal strategy would fail\nhere.\n– Solver Comparison: The Greedy heuristic (5.53 features, 50.8s) performed identically\nto the optimal MILP solver (5.37 features, 50.8s), validating the heuristic's quality. • Recursive Abstract Refinement (Algorithm 5). This configuration enables the iterative\ntightening of the domain Ωm(x; A). – Performance: The average number of freed features jumped to 476.38 pixels (approx\n46% of the image).\n– Insight: This dramatic increase (from ∼5 to ∼477) proves that the adaptive abstraction mechanism is critical. By iteratively constraining the cardinality, FAME recovers\nfeatures that were previously masked by over-approximation.\n– Solver Comparison: Remarkably, even in this complex iterative setting, the Greedy\napproach (size 476.38) remained extremely close to the optimal MILP solution (size\n477.76), with a negligible gap of < 0.4%. This strongly justifies using the faster\nGreedy heuristic for scalability.\n– Runtime: The average runtime for this intensive recursive search was approximately\n1934.94 seconds (∼32 minutes). • Full Pipeline (Iteration + Singleton Refinement). This represents the final output of the\ncomplete FAME pipeline, including final safety checks and singleton refinement. – Explanation Compactness: The pipeline successfully certified a robust explanation\nwith an average of 240.84 freed features (pixels) across the full dataset.\n– Efficiency: The breakdown confirms that FAME can navigate the search space of\ndeep networks where exact enumerations fail, producing sound abstract explanations\n(WAXpA) significantly faster than the timeout threshold of exact solvers. Discussion and Future Directions.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 75, + "total_chunks": 75, + "char_count": 1882, + "word_count": 270, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "305d1375-7be7-4484-a57c-a03ac364e6ed", + "text": "While the computational cost (∼32 mins) is higher than for\nsmaller models, these results establish that the Abstract Batch Certificate (Φ) and recursive refinement scale mathematically to residual connections without theoretical blockers. Published as a conference paper at ICLR 2026 the abstract explanation size and the true minimal explanation is driven primarily by the looseness\nof the abstract bounds (LiRPA CROWN) on deep networks. Future work integrating tighter abstract\ninterpretation methods (e.g., α-CROWN) into the FAME engine will directly improve these results. G DISCLOSURE: USAGE OF LLMS An LLM was used solely as a writing assistant to correct grammar, fix typos, and enhance clarity. It played no role in generating research ideas, designing the study, analyzing data, or interpreting\nresults; all of these tasks were carried out exclusively by the authors.", + "paper_id": "2603.10661", + "title": "FAME: Formal Abstract Minimal Explanation for Neural Networks", + "authors": [ + "Ryma Boumazouza", + "Raya Elsaleh", + "Melanie Ducoffe", + "Shahaf Bassan", + "Guy Katz" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10661v1", + "chunk_index": 76, + "total_chunks": 75, + "char_count": 876, + "word_count": 131, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10676_semantic.json b/data/chunks/2603.10676_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..0c240ca97317e542368440c1214759e3f6d6b9aa --- /dev/null +++ b/data/chunks/2603.10676_semantic.json @@ -0,0 +1,1742 @@ +[ + { + "chunk_id": "52dee28a-cedb-4325-bf9e-c72039ddf437", + "text": "Kosti Koistinen Kirsi Hellsten Joni Herttuainen\nAalto University School of Science Aalto University School of Science Aalto University School of Science\nComputer Science Department Computer Science Department Computer Science Department\nP.O.Box 11000, 00076 P.O.Box 11000, 00076 P.O.Box 11000, 00076\nAALTO, Finland AALTO, Finland AALTO, Finland\nkosti.koistinen@aalto.fi kirsi.hellsten@aalto.fi joni.herttuainen@aalto.fi2026 Kaski\nAalto University School of ScienceMar Computer Science Department\nP.O.Box 11000, 00076\n11 AALTO, Finland\nkimmo.kaski@aalto.fi March 12, 2026[cs.LG] ABSTRACT Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber–physical\nthreats due to the convergence of operational technology and networked environments. While\nmachine learning–based anomaly detection approaches in ICS shows strong theoretical performance,\ndeployment is often limited by poor explainability, high false-positive rates, and sensitivity to\nevolving system behavior, i.e., baseline drifting. We propose a Spatio-Temporal Attention Graph\nNeural Network (STA-GNN) for unsupervised and explainable anomaly detection in ICS that models\nboth temporal dynamics and relational structure of the system. Sensors, controllers, and network\nentities are represented as nodes in a dynamically learned graph, enabling the model to capture\ninter-dependencies across physical processes and communication patterns. Attention mechanisms\nprovide influential relationships, supporting inspection of correlations and potential causal pathways\nbehind detected events. The approach supports multiple data modalities, including SCADA point\nmeasurements, network flow features, and payload features, and thus enables unified cyber–physical\nanalysis. To address operational requirements, we incorporate a conformal prediction strategy to\ncontrol false alarm rates and monitor performance degradation under drifting of the environment.arXiv:2603.10676v1 Our findings highlight the possibilities and limitations of model evaluation and common pitfalls\nin anomaly detection in ICS. Our findings emphasise the importance of explainable, drift-aware\nevaluation for reliable deployment of learning-based security monitoring systems. Modern societies rely on uninterrupted functioning of interconnected critical infrastructure, such as electric power grids,\nwater treatment plants, and manufacturing systems [1]. A disruption in these Operational Technology (OT) systems can\ncascade into severe economic, social, and physical consequences, from prolonged power outages to contaminated water\nsupplies. Over the past decade, cyberattacks such as Stuxnet [2], Industroyer [3] and the Colonial Pipeline incident [4]\nhave demonstrated that threats once limited to Information Technology (IT) networks can now directly impact the\nphysical world, such as equipment damage or even threats to human life [5]. During the past decade, cyberattacks on\nOT networks have been reported to have increased five fold from 300 annually to 1600 [6].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 1, + "total_chunks": 87, + "char_count": 3026, + "word_count": 387, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfd5edd1-cebb-4110-8f50-25218099970d", + "text": "The actual scale is likely to\nbe significantly higher, as many OT intrusions remain unreported or undiscovered due to limited monitoring capabilities. A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 2, + "total_chunks": 87, + "char_count": 178, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "eaa08045-5a39-4e43-82c0-83a4bff65d74", + "text": "By 2024, operational disruption had become routine: 50–75% of ransomware incidents caused partial shutdowns, and\napproximately 25% resulted in complete production stoppages, causing significant financial damage [6]. Industrial Control Systems (ICS) form the technological backbone of critical infrastructure and are often the target of\ncyberattacks against OT systems. They regulate physical processes through sensors, actuators, and Programmable Logic\nControllers (PLCs) and are maintained through Supervisory Control and Data Acquisition (SCADA). Traditionally\nisolated from external networks, the OT systems once enjoyed a degree of \"security by separation.\" However, the\nshift toward networked automation, remote management, and the Industrial Internet of Things (IIoT) has converged\nIT and OT networks, allowing adversaries to move laterally from corporate systems to industrial environments. This\ndevelopment has exposed ICS environments to a wide spectrum of cyber threats [7]. In addition, OT environments often\nrely on legacy hardware, strict safety protocols, and systems that cannot be easily updated or patched, which further\nincreases their vulnerability to cyberattacks. ICS threats range from malware infections and ransomware to unauthorised remote access, data manipulation, and\nprocess disruption. In many cases, attackers exploit vulnerabilities in outdated software, weak authentication, or\ninsecure network configurations that were never designed with cybersecurity in mind. Typical weaknesses include\noutdated protocols that allow unintended access or manipulation of control traffic. The types of attacks are commonly\ndivided into network-based and physical-based attacks [5]. The former includes Denial of Service (DoS), injection, and\nMan-in-the-Middle attacks, while the latter include stealth attacks, data tampering, and damage attacks. To detect and mitigate these complex attacks, Intrusion Detection Systems (IDS) are widely used in modern industrial\ncybersecurity. An IDS typically consists of the monitoring, pre-processing, and detection phases [8]. Among various\ndetection approaches, such as signature-based, rule-based, and hybrid-based, anomaly-detection-based IDS have gained\nsignificant attention for their ability to learn normal operational behavior and discover deviations that may signal attacks,\nintrusions, or malfunctions [9]. In OT networks, this capability is crucial, as anomalies are often subtle irregularities\nrather than clear malicious signatures.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 3, + "total_chunks": 87, + "char_count": 2502, + "word_count": 332, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ef9e6aed-888c-48c1-b073-c0c9846dd2d9", + "text": "They may appear as small fluctuations in sensor data, unexpected timing patterns,\nunusual command sequences, or deviations in process variables that remain within protocol limits but still indicate\nunsafe or suspicious behavior. There are various approaches that have been applied to detect and prevent cyber intrusions, including statistical modeling,\nBayesian inference, and rule-based systems. These methods often rely on predefined assumptions about normal and\nabnormal behavior [10]. However, as industrial systems become more complex and dynamic, such fixed models struggle\nto capture the nonlinear and time-varying nature of real-world operations [11]. In contrast, machine learning–based\napproaches have attracted widespread interest for their ability to automatically learn patterns from data and adapt\nto evolving system behavior [9]. These methods can uncover correlations between multiple variables, making them\nparticularly suitable for anomaly detection in ICS. Traditional machine learning approaches include, for example, k-nearest neighbors, Random Forests, and Support\nVector Machines [12]. However, despite their efficiency in classification, these methods are insufficient to model\ntemporal dependencies that are inherent in OT traffic. They are also sensitive to imbalanced datasets, such that a new\nunseen anomaly often remains undetected. Furthermore, in most OT environments, the majority of traffic is benign,\nwhile only a small fraction represents attacks.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 4, + "total_chunks": 87, + "char_count": 1482, + "word_count": 203, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "427bb19a-2137-4d7f-a836-4f1a93450d81", + "text": "This imbalance can lead to biased classifiers that fail to detect rare but\ncritical anomalies. To overcome these limitations, Deep learning approach has emerged as a promising solution. Autoregressive architectures such as Long Short-Term Memory (LSTM) networks [13] and other Recurrent Neural Network approaches\n(RNNs) [14] can capture complex temporal patterns among variables, which allows the model to understand how\nchanges in one part of the system influence the rest. However, the performance of these models also suffers from\nunbalanced data. Some other methods include autoencoders, Generative Adversarial Networks, and a mixture of these\nmodels, but all have similar limitations. More recently, transformer-based architectures have become popular because\nof their ability to model long-range dependencies using self-attention mechanisms. Their success in natural language\nprocessing has motivated research into their application for anomaly detection in time-series and network data, where\nsequential dependencies and contextual relationships are crucial. Transformer models, and particularly adaptations of\nlanguage models, show the potential to capture complex semantic patterns in network traffic representations [15]. Beyond and within sequential approaches, graph-based deep learning provides a fundamentally different way to\nrepresent and analyse OT systems. By modeling the system as a graph, where nodes represent entities (such as\ndevices or sensors) and edges represent their relationships or communications, a more realistic and structured view\nof the environment can be obtained. Graph-based models are able to uncover non-linear correlations and long-range\ndependencies that traditional time-series or tabular approaches often miss. Graph Neural Networks (GNNs), such as\nGraph Convolutional Networks (GCNs) [16] and Graph Attention Networks (GATs) [17], exploit this representation\nby learning how information propagates through the network structure. GCNs aggregate neighborhood information A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 5, + "total_chunks": 87, + "char_count": 2042, + "word_count": 279, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "780098a8-dc43-42ac-aee9-c738c442df8c", + "text": "to capture local dependencies, while GATs extend this approach by applying attention mechanisms to weigh the\nimportance of different connections dynamically. Through these formulations, GNNs can effectively model both the\ntopology and interactions within the system, and thus enable a more accurate detection of anomalous behavior that may\nemerge across multiple interconnected entities. For a more detailed review of current methods, see [18]. Although harnessed with state-of-the-art machine learning, there are still some issues to consider before their usage for\nanomaly detection in ICS, such as the lack of open high-quality datasets for research and high false alarm rates (for\nother challenges, see [19]). In addition, deep learning introduces challenges of its own, related to its explainability and\ninterpretability. As models become more complex and rely on multilayer architectures, their internal decision-making\nprocesses become opaque. In ICS environments, operators must understand why an alert was triggered, and this lack\nof transparency creates a significant barrier to adoption. Explainability methods aim to improve the trustworthiness,\ninterpretability, and accountability of machine learning models by providing human-understandable insights into how\nthey reach their conclusions [20]. However, the application of explainability techniques into ICS remains a challenge. First, ICS traffic is often highdimensional and highly contextual, making it difficult to map model outputs to meaningful operational features. Second,\nmany explainability tools are computationally expensive or unstable when applied to time-series or graph-based deep\nlearning models. Third, explanations must be not only technically accurate but also domain-relevant, i.e., operators need\nactionable insights, not abstract attributions. As a result, despite significant progress, current explainability solutions\noften do not meet the stringent requirements of industrial environments. More research is needed to develop lightweight,\nreliable, and domain-aware explainability mechanisms that can support real-time decision-making and foster operator\ntrust in AI-driven anomaly detection. To address the aforementioned challenges, we propose an unsupervised GNN-based framework that uses graph-oriented\nmachine learning for explainable anomaly detection in ICS.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 6, + "total_chunks": 87, + "char_count": 2354, + "word_count": 316, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e61f4f3f-4ad0-410d-9691-577619012376", + "text": "The model constructs a graph representation of the system\nthat enables learning of the relationships among sensors, actuators, and process variables. Within this architecture,\nattention mechanisms are employed to extract the most influential dependencies in the graph, allowing the model to\nfocus on critical interactions during anomaly scoring. By examining the resulting correlation structures, we analyse\nhow the model's learned relationships align with known causal dependencies in the industrial process. This facilitates\ntransparent system-level anomaly detection, which traditional models might overlook. Furthermore, the framework is\ntunable as it can operate on SCADA-point data for point-level anomaly detection, on netflow data for passive network\nmonitoring, or on both modalities simultaneously through a multimodal configuration. This study is organised such that in Section 2 we discuss related works in the field of Explainable AI. Next, in Section 3\nwe introduce our model architechture and evaluation strategy. In Section 4 we present the data we use for benchmarking\nthe model, followed by presenting the results and analysis of the acquired graph representations in Section 5. Then in\nSection 6 we discuss the methodological and practical issues encountered during the analysis and reflect more broadly\nthe common issues in Machine Learning anomaly detection. In Section 7 we draw conclusions and on what could be\nthe focus of future work. In this section, we provide a short review of the most relevant work on explainable artificial intelligence (XAI). Although\nthe literature on XAI is extensive, (see e.g., [21,22]), only recently have cybersecurity and IDS applications begun\nto receive dedicated attention. Here, our aim is to highlight the works that explore explainability, specifically, for\nnon-experts and experts in IDS and of OT environments. Explainable AI as a field emerged formally in 2004 [23], but its development accelerated significantly in the last decade\nalongside the rise of deep learning. The \"black box\" nature of the deep learning models grew an interest for trustworthy\nand explainable AI in various fields, e.g., in medical sciences, finance and autonomous systems [24].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 7, + "total_chunks": 87, + "char_count": 2219, + "word_count": 331, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dfac526a-4d78-400c-a676-ec16b38e1d34", + "text": "A widely accepted\ntaxonomy categorises XAI methods into intrinsic (ante-hoc) and post-hoc models [25]. Intrinsic explanations arise\ndirectly from the model architecture through weights, rule structures, or built-in interpretability constraints. The model\nitself is designed to be transparent. For example, these models include classifiers and regressors. In contrast, post-hoc\napproaches aim to explain the model's outputs via various tactics. Early contributions include game-theory approaches,\nin which SHAP explanations are the most popular for explaining the importance of features. Another popular type\nof post-hoc -approaches includes gradient- and decomposition-based techniques, where backpropagated gradients are\nmodified or analysed to attribute importance [26]. Other examples include perturbation-based explanations [27,28]. The\nlatter raises an important point that most of the XAI-methods are for supervised setting, while in most of the real-world\nICS systems labeled data are unrealistic assumptions. The authors provided an unsupervised fine-tuning module that\ncould be used in problematic features, allowing for model adjustment without exhaustive re-training.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 8, + "total_chunks": 87, + "char_count": 1178, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "08af1cef-42a8-440e-9f75-436ae874599a", + "text": "A PREPRINT - MARCH 12, 2026 The XAI literature for graph deep learning includes several post-hoc explanation techniques designed to interpret the\npredictions of GNNs. Many of these approaches rely on graph masking, where the goal is to learn masks on edges, nodes,\nor features to identify the substructures most influential in a model's decision. One of the most widely cited methods is\nGNNExplainer [29], a model-agnostic explainer that is applicable to any GNN architecture. By optimizing soft masks\non edges and features, GNNExplainer extracts subgraph-level explanations that find key structural components and\nnode attributes driving the output of the model. This method has been adopted in cybersecurity contexts, including\nIDS, as demonstrated in [30]. A related method is PG-Explainer [31], which differs from GNNExplainer by training a\nparametric explanation network that generalises across instances rather than optimizing a mask separately for each\nprediction. The optimization strategy improves scalability and stability while retaining the ability to identify influential\nedges. PG-Explainer has been utilised in IDS research, for example, in [32]. In OT environments, the application of graph explainers is much more limited. A notable exception is KEC (Khop Explanation with Convolutional Core) [33], which was applied to anomaly detection in the SWaT industrial\ncontrol benchmark dataset [34]. Unlike the masking-based paradigm, KEC constructs a surrogate linear model that\napproximates the local behavior of the GNN and derives explanations through gradient-based attribution. The authors\nintroduce a formal notion of faithfulness, a measure of how well an explainer preserves model behavior and show that\nKEC achieves higher faithfulness than existing explainers. A common challenge among GNN explanation methods is that many of them provide partial explanations, focusing\nonly on one dimension—edges, nodes, or features—without offering a unified view. The ILLUMINATI framework [35]\naddresses this limitation by producing comprehensive explanations that consider the contribution of node importance,\nedge importance, and node-attribute together.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 9, + "total_chunks": 87, + "char_count": 2164, + "word_count": 309, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ad8cc4da-8afc-423b-b820-0bf107d6fc9a", + "text": "Designed specifically for cybersecurity use cases, it extends traditional\nmasking approaches with a richer explanatory structure. Comparative evaluations of GNN explainers for IDS have also\nemerged. For example, recent work in [36] finds that GraphMask [37] performs particularly well for DoS and DDoS\nattack detection, outperforming other explainers in robustness and interpretability. However, despite this promising\nresult, we did not find substantial evidence of GraphMask being applied more broadly in IDS or OT-focused literature. Finally, a branch of graph deep learning approaches uses attention mechanisms [38] as a tool for generating explanations. Attention mechanism allows for a model to assign different importance weights to different nodes or edges, highlighting\nwhich relationships it considers most relevant during prediction. Graph Attention Networks (GATs) are build on this\nidea by using attention to reveal correlations between learned embeddings [17]. Some models, including the Graph\nDeviation Network (GDN) [39], which also inspires the present study, apply attention mechanisms to time series\nfor identifying variable-level dependencies and highlighting anomalous patterns. This approach captures localised\ndeviations in sensor behavior using both structural relationships and temporal dynamics within OT systems. A very\nrecent approach, PCGAT [40], extends attention-based reasoning by modeling ICS through multi-level physical-process\nand controller graphs, to enable both anomaly detection and anomaly localization via attention patterns. The authors\nhighlight several limitations of typical attention-based methods. They argue that attention weights learned purely\nfrom data do not necessarily correspond to the true causal or physical relationships in ICS, and therefore may produce\nexplanations that are misleading from an operational perspective. This can create difficulties in identifying the actual\nsources of anomalies and understanding how they propagate through the system. Furthermore, they claim that many\nexisting GAT-based anomaly detectors rely on unrealistic fully connected sensor graphs, resulting in high computational\ncost, redundancy, and limited interpretability. These models also fail to incorporate the hierarchical and process-driven\nstructure of ICS, reducing their reliability and diminishing the usefulness of attention weights as explanations.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 10, + "total_chunks": 87, + "char_count": 2402, + "word_count": 324, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "74c9544b-0b52-4680-a40e-5260cd03a57d", + "text": "In short,\nthe complexity of graph-based deep learning introduces several challenges, which our study seeks to address. Here we propose the Spatio-Temporal Attention Graph Neural Network (STA-GNN), designed to capture both\ntemporal dependencies and dynamic spatial correlations among sensors or devices (henceforth entities) in multivariate\ntime series data. The STA-GNN is inspired by the Graph Deviation Network [39] and Graph Attention Network [17],\nwith several modifications combining temporal attention mechanisms with an adaptive graph construction strategy that\nlearns context-dependent relationships between entities. In this section, we will explain the model architecture and\nanomaly detection methodology, and the model framework is illustrated in Fig. 1. Each node in the graph corresponds to an entity and is associated with a multivariate feature vector. Specifically, at each\ntimestep t, an entity i is represented by a feature vector xt,i ∈RF , where F denotes the number of observed variables\nfor that i (e.g. continuous measurements and Boolean indicators). Over a sliding window of length W, the input tensor\ntherefore takes the form X ∈RB×W ×N×F , where N is the number of i and B is the batch size. A PREPRINT - MARCH 12, 2026 Figure 1: A schematic overview of the STA-GNN model architecture. The workflow illustrates the processing stages\nfrom input windows to the decoder producing predictions. The intermediate blocks employ a two-phase attention\nmechanism that generates two complementary graphs, enabling inspection of the model's decision making. model, the nodes are treated as feature-bearing entities whose representations are progressively transformed into latent\nembeddings that jointly encode temporal dynamics and inter-dependencies. The model first applies a linear projection\nH at each timestep t:\nHt = Linear(Xt) + Pt, (1)\nwhere Pt represents a learnable positional embedding for the timestep t ∈{1, . . . , W} that encodes the temporal order\nwithin the observation window. Next, we go through in detail the stages of the anomaly detection process from the\ninput window to the temporal, spatial, and decoder blocks of the STA-GNN model architecture. To model temporal dependencies, each nodes' time series within the observation window is processed by a multi-head\nself-attention mechanism (MHA), inspired by the Transformer architecture and originally developed for natural language\nprocessing [38].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 11, + "total_chunks": 87, + "char_count": 2437, + "word_count": 365, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "51df982a-0f3a-41e5-a104-ed53f7821299", + "text": "MHA enables each timestep in a node's sequence to attend to every other (past) timestep within the\nwindow, allowing the model to capture both short-term fluctuations and long-range temporal dependencies without\nrelying on recurrence. In practice, we apply causal masking in the temporal attention module so that a timestep cannot\nattend to future observations, preventing information leakage or data snooping. Formally, given the linear projection Ht for a single entity, the attention module constructs representations for query Q,\nkey K and value V through learned linear projections:\nQt = WQHt, Kt = WKHt, Vt = WV Ht, (2) A PREPRINT - MARCH 12, 2026 where WQ, WK, WV ∈Rd×d are learnable parameter matrices, and d denotes the embedding dimension of the latent\nrepresentation. The linear projection operates across the feature dimension F. The attention weights are computed as\nscaled dot-products between queries and keys:\nQtK⊤t\nαt = softmax √ , (3)\nwhich measure the degree of relevance between every timestep pairs. These weights are then used to form a weighted\nsum of the value vectors:\nH′t = αtVt, (4)\nproducing an updated temporal representation where each timestep encodes information aggregated from all others. To capture seasonal, weekly, and daily fluctuations, multiple attention heads are used in parallel, each operating on a\ndifferent subspace of the embedding dimension. The outputs of these heads are linearly combined:\nH′ = MHA(Qt, Kt, Vt) = Concat(head1, . . . , headh) WO, (5)\nwhere WO ∈R(h·dh)×d projects the concatenated result back to the model dimension. The resulting representation\nis then aggregated across timesteps (e.g., via mean pooling) and normalised through a Layer Normalization (LN)\noperation, yielding the final temporally encoded features W ! 1\nH = LN X H′[t] , (6)\nt=1\nwhere H ∈RB×N×d represents the temporally contextualised embedding for each entity. This tensor H is the output\nof the temporal feature extractor and serves as the input to the subsequent spatial attention stage, which models the\ninter-entity dependencies. Unlike conventional GNNs that rely on static graphs, the STA-GNN constructs dynamic spatial graphs based on both the\ncontextual similarity Sctx and static similarity Sst. For each sample b, the dynamic contextual similarity is computed\nfrom the temporally encoded features as\nS(b)ctx = HbH⊤b , (7)\nwhere Hb ∈RN×d denotes the slice of H corresponding to the batch element b.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 12, + "total_chunks": 87, + "char_count": 2440, + "word_count": 386, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ddd9190b-94cf-4dd8-84e6-0ad8ae76d1d2", + "text": "In addition to the dynamic similarity,\nthe model also supports an optional external static prior graph Astatic ∈RN×N, which can encode domain knowledge\nabout the entity connectivity, physical topology, or known relationships. When provided, the entries of Astatic are\nnormalised and incorporated directly as the static similarity term. If no external graph is supplied, the model instead\nlearns a static entity embedding matrix E ∈RN×d, from which a static similarity is constructed as\nSst = EE⊤, (8)\nwhich corresponds to a (scaled) cosine similarity after ℓ2-normalization of the rows of the embedding matrix E. If prior\nis introduced, Sst is passed with normalised values of Astatic. The combined similarity matrix is then given by\nS(b) = S(b)ctx + λSst, (9)\nwhere λ is a learnable scaling parameter. The model thus learns to adaptively balance between dynamic contextual\ndependencies and static structural patterns. To propagate information among entities, the model applies another attention mechanism over the temporally encoded\nrepresentations H to capture spatial dependencies. In this phase, the queries, keys, and values are newly projected from\nH using distinct learnable matrices W Q(sp) , WK(sp) , WV(sp) , which allow each entity to attend to all others based on\ntheir recent temporal behavior. We employ multi-head scaled dot-product attention over entities (rather than a GAT-style\nadditive attention with a LeakyReLU nonlinearity). Concretely, for each head, queries, keys, and values are obtained as\nQsp = W Q(sp) H, Ksp = WK(sp) H, V = W V(sp) H, (10)\nand the attention logits are computed via scaled dot-products between entities. The resulting attention scores are\nmodulated by the similarity prior S(b), yielding the dynamic attention matrix\nQ(b)sp K(b)⊤sp !\nA(b) = softmax √ + S(b) , (11)\nd T A PREPRINT - MARCH 12, 2026 where T is a learnable temperature parameter controlling the sharpness of attention. To enhance sparsity and interpretability, only the top-k most relevant neighbors (i.e., with the highest attention weights) are kept for each node,\nensuring efficient message passing and reducing noise from weak connections. For multi-head attention, this procedure\nis applied independently per head; the resulting attention weights can be averaged across heads for interpretability.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 13, + "total_chunks": 87, + "char_count": 2311, + "word_count": 357, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "af3adcc3-4bc0-4512-b9c6-12ffb1b2aebf", + "text": "The spatially constructed features for each sample b and entity i are then given by H(sp)b,i,: = X A(b)i,j Vb,j,: + βHb,i,:, (12)\nj=1 or, in matrix form,\nH(sp) = AV + βH, (13)\nwhere β is a residual weighting factor. Thus, each H(sp)b,i,: is a learned spatio-temporal feature vector for entity i, obtained\nas an attention-weighted aggregation of its neighbors' value embeddings plus a residual contribution from its own\ntemporal representation. The resulting tensor H(sp) ∈RB×N×d encodes both temporal and spatial dependencies for\neach entity. Finally, the normalised representations are passed through a fully connected multilayer perceptron (MLP) decoder\napplied independently to each entity. For each sample b and entity i, we compute\nˆyb,i = fθ H(sp)b,i,: , (14) where fθ denotes a two-layer feed-forward network with nonlinearity (ReLU) between layers. In matrix form, this can\nbe written as\nˆY = MLP(H(sp)) ∈RB×N×F , (15) yielding one output per node feature and sample based on the final spatio-temporal feature representation. 3.2 Training Objective", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 14, + "total_chunks": 87, + "char_count": 1056, + "word_count": 164, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "302ab678-e7b1-4842-a748-18cd23c456a7", + "text": "Each entity i may contain both continuous and Boolean features, and the loss aggregates reconstruction errors across\nthese feature dimensions. This design allows heterogeneous variables to contribute appropriately to the training signal\nwhile preserving a unified node-level representation in the graph. For example, exogeneous temporal features may be\nappended to node features.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 15, + "total_chunks": 87, + "char_count": 379, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fed7839e-6386-4030-8240-4427612848a8", + "text": "The model is trained in a semi-supervised setting, using only data assumed to represent normal system behaviour. The\nlearning objective is to minimise the difference between the model's reconstructed feature values ˆYb and the observed\nvalues Yb for each batch element b ∈{1, . . . , B}. Because the dataset may include both continuous-valued features and\nBoolean/indicator features, we employ a composite loss function, MixedLoss, which combines a mean–squared error\n(MSE) term for continuous features and a binary cross–entropy (BCE) term for Boolean features. Let C denote the\nindices of continuous features and B the indices of Boolean features. The training loss for a single window is\nLmixed = γcont · X (ˆYb,i,f −Yb,i,f)2\n|C|\n(i,f)∈C\n(16)\n+ γbool · X BCE( ˆYb,i,f, Yb,i,f)\n|B|\n(i,f)∈B where γcont and γbool weight the relative influence of continuous and Boolean feature types. MixedLoss ensures that\neach feature type contributes appropriately to the learning signal. At inference time, we use the same MixedLoss\nformulation both for the scalar anomaly score and for per-entity explanations, ensuring that the detection objective is\naligned with the training objective. For each sliding window w, we compute feature-wise reconstruction errors and aggregate them into a per-entity\nMixedLoss contribution. For a continuous feature f ∈C of entity i, the reconstruction error is defined as\new,i,f = (ˆYw,i,f −Yw,i,f)2, A PREPRINT - MARCH 12, 2026 and for a Boolean feature f ∈B of entity i, we define\new,i,f = BCE(ˆYw,i,f, Yw,i,f). Each ew,i,f ≥0 therefore represents the MixedLoss error contribution of feature f of entity i for window w. The per-entity reconstruction error is obtained by aggregating feature-wise errors using the same weighting scheme as\nin training:\n1 1\new,i = γcont · X ew,i,f + γbool · X ew,i,f,\n|Ci| |Bi|\nf∈Ci f∈Bi\nwhere Ci and Bi denote the sets of continuous and Boolean features associated with entity i, respectively. The model\ncan therefore be used either by aggregating the errors per node or by detecting anomalies at the node–feature level.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 16, + "total_chunks": 87, + "char_count": 2076, + "word_count": 333, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b2606de-43eb-4f45-b369-acf1ff8b589f", + "text": "An\noverall anomaly score for the window is finally obtained by averaging the per-entity losses: sw = X ew,i .\ni=1\nHigher values of sw reflect greater deviation from behaviour learned during training.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 17, + "total_chunks": 87, + "char_count": 199, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f912002b-b531-45c6-8024-985dcf01c08d", + "text": "3.4 Graph Explanations During inference, the STA-GNN produces two complementary graph structures, the contextual similarity graph Gcs\nand the attention graph Ga. In both representations, the nodes correspond to entities, whereas the dynamically evolving\nedges encode relationships between them. The Gcs captures relations between the learned temporal embeddings,\nreflecting how similar the recent temporal dynamics of different entities are within a given observation window. In\ncontrast, the Ga represents directed inter-entity dependencies, where edge weights encode the magnitude and direction\nof the learned correlations, that is, how information is propagated between entities in the latent space. Fig. 2 illustrates an example of the model's outputs during anomaly detection. When an anomaly is detected, both\ngraphs are visualised to highlight the underlying relational patterns. The nodes that are considered anomalous, are\nplotted with distinct colours, while the rest are kept at the background as grey. For interpretability, only the top five\nedges with highest similarity per node are retained in Gcs, ensuring sparse and readable structure. For Ga, the edges\nare filtered to include only those that originate or end at anomalous nodes. The amount of edges is restricted by\ntopk-attention weights.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 18, + "total_chunks": 87, + "char_count": 1309, + "word_count": 191, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7886d8e6-83cc-422a-86c3-1cfbe40b9fae", + "text": "One of the main metrics to evaluate the performance of our model in detecting anomalies is the false positive rate (FPR)\ndefined as follows\nFPR = (17) FP + TN,\nwhere false positive (FP) is the number of incorrect predictions or false alarms, and true negative (TN) is the number of\ncorrect predictions of no alarms. In the model evaluation, the emphasis is on minimising the FPR i.e., avoiding false\nalarms, while still maintaining adequate anomaly detection performance. Furthermore, we summarise the detection\nquality by using the F1 score and evaluate two thresholding strategies: (i) a threshold that maximises the F1 score\non validation data and (ii) a conformal-thresholding scheme based on nonconformity scores. The F1 score combines\nprecision and recall into a single harmonic-mean metric. Given the number of true positives (TP), false positives (FP)\nand false negatives (FN), the F1 score is defined as\n2 precision · recall 2 TP\nF1 = = (18) precision + recall 2 TP + FP + FN,\nwhere precision = TP/(TP + FP) and recall = TP/(TP + FN). We first compute anomaly scores sw for each\nwindow w and choose a threshold that maximises the F1 score on the labeled evaluation set. This provides an\nunsupervised operating point that balances missed anomalies and false alarms. To explicitly control false alarms in a more distribution-free and sequential setting, we also use an inductive nonconformity scoring scheme [41]. Let s1, . . . , sT denote the anomaly scores on a set of calibration windows assumed to be\nnormal. We define difference nonconformity scores c with\nc1 = 0, (19)\nct = max 0, st −st−1 , t = 2, . . . , T, (20) A PREPRINT - MARCH 12, 2026 Attack detected and contribution highest from red (highest) to yellow (lowest).", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 19, + "total_chunks": 87, + "char_count": 1735, + "word_count": 300, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8adec50-77a1-4723-a8f5-1ab909445fd6", + "text": "The grey edges\nrepresent the learned embeddings + prior graph structure. The red edges come from the spatial attention. Only the\nstrongest attention weights from/to anomalous nodes are plotted for interpretability. Red edge thickness reflects to\nstrength of the attention. The graph nodes are organised and fixed by process stages in SWaT testbed dataset used in\nthis study.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 20, + "total_chunks": 87, + "char_count": 374, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a865b877-0709-4d15-b3a3-c9f7ccef07e0", + "text": "which emphasise sudden increases in the anomaly score and are less sensitive to slow shift of values. Given a significance\nlevel α, we then choose a threshold qα as an upper quantile of the calibration scores, i.e.\nqα = Quantile1−α(c1, . . . , cT ), (21)\nand at evaluation time declare window t anomalous if ct ≥qα. The benefit of the conformal approach is twofold: it\nautomatically adapts to the empirical score distribution and, under standard exchangeability assumptions, provides\nfinite-sample guaranties that the probability of a false alarm does not exceed approximately α. In our experiments, we\nchoose a heuristic value α = 10−3, which yields a low false positive rate while still allowing the model to react to\npronounced score increases. For example, with data sampled in 10-second intervals, this threshold corresponds to an\nexpected false alarm roughly once every three hours under nominal conditions. Another advantage of the approach is", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 21, + "total_chunks": 87, + "char_count": 950, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99315cdd-409d-4f26-be1b-610474070763", + "text": "A PREPRINT - MARCH 12, 2026 Table 1: Overview of the SWaT datasets used in this study across physical and network modalities. Measurements from\nphysical sensors and network traffic were aggregated and resampled to 10-second intervals. SWaT Modality Nodes #Features #Instances Duration #Attacks\nDataset Physical 51 1 ∼95 000 7 d normal + 4 d attack 41\n2015 NetFlow 9 11 ∼95 000 7 d normal + 4 d attack 41\nNetFlow+Payload 9 14 ∼95 000 7 d normal + 4 d attack 41 Physical 51 1 ∼49 000 6 d normal 0\n2017 NetFlow 9 11 ∼17 000 2 d normal 0\nNetFlow+Payload 9 14 ∼17 000 2 d normal 0 2019 Jul Physical 51 1 ∼1 500 4 h attack 6 Physical 51 1 ∼1 300 4 h attack 5\n2019 Dec NetFlow 9 11 ∼1 300 4 h attack 5\nNetFlow+Payload 9 14 ∼1 300 4 h attack 10", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 22, + "total_chunks": 87, + "char_count": 736, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c5c3bbfb-55fa-4aca-a73e-b1f1db012674", + "text": "that the threshold qα is fixed by the distribution of the calibration scores. As a result, if the typical scoring behavior\nof the system starts to change and the evaluation scores consistently exceed their calibration levels, the number of\nthreshold exceedances will gradually increase. This behaviour is a clear indication of covariate drift, signaling that the\nmodel may no longer be well suited to the altered environment. Conventional performance metrics, such as F1-score or\naccuracy cannot reveal such changes in the underlying data distribution. For a detailed description of the conformal\nprediction framework, see [42].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 23, + "total_chunks": 87, + "char_count": 628, + "word_count": 95, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c6e23474-cf07-45bf-9bdc-d01350f84253", + "text": "The Secure Water Treatment (SWaT) testbed is one of the most widely used benchmarking datasets available for\nresearch on ICS security. It represents a scaled-down, fully operational water purification plant designed to reproduce\nthe behavior, equipment interactions, and cyber-physical processes found in real facilities. The system produces\napproximately five gallons of treated water per minute and operates in six sequential process stages, each equipped with\na range of sensors, such as level transmitters, pressure gauges, and water-quality probes, as well as actuators including\npumps and motorised valves. The sensors and actuator names, and further detailed description of the environment, are\nprovided in [34]. In illustrative Figures 2, 5 and 6, we have arranged the process stages horizontally, from left, process\nstage 1, to stage 6, right.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 24, + "total_chunks": 87, + "char_count": 852, + "word_count": 126, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d7491bab-157d-4817-846a-454da96953b6", + "text": "The SWaT datasets provide both process-layer (physical-level) measurements obtained from the SCADA/PLC level\nand detailed OT network traffic, including partial CIP protocol payloads. Communication between PLCs, sensors,\nactuators, and the supervisory SCADA layer is extensively logged, enabling the simultaneous analysis of physical\nprocess behavior and network activity. This multimodal perspective is crucial as previous work has shown that effective\nanomaly detection requires both physical measurements and communication patterns, since attacks may affect only a\nsingle modality or manifest across both [43]. The 2015 SWaT dataset includes a long period of normal operation followed by a series of 41 controlled cyberattacks,\ntargeting communication links and manipulating one or multiple process stages. These attacks range from stealthy\nmodifications to aggressive actuator manipulation, making the 2015 SWaT dataset a challenging and realistic benchmark. The rest of the selected SWaT datasets used in our study are provided in Table 1. 4.1 Data Pre-Processing & Model Training", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 25, + "total_chunks": 87, + "char_count": 1084, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f9a6912e-adc7-48da-9878-565d337cab2c", + "text": "For the physical-level data, all continuous sensor values were treated as floating-point variables,\nwhile discrete control states (e.g., on, off, auto) were one-hot encoded. Continuous features were scaled using min–max\nnormalization, defined as\nx −xmin\nx′ = , (22)\nxmax −xmin\nwhere xmin and xmax are minimum and maximum values from the training data, and x is a value to normalise. The\nevaluation dataset was fitted with these normalization parameters. In physical-level datasets, each node corresponds to a\nsingle sensor or actuator signal, and no additional node-level features were introduced. A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 26, + "total_chunks": 87, + "char_count": 625, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43374aa9-1dbc-488f-9165-18afa2fd93f9", + "text": "For the Netflow data, an explicit design choice was required to define the node entities. We chose\nthe set of IP addresses observed in the traffic as entities. More precisely, we selected PLC-, SCADA point-, and\nworkstation IP addresses as individual nodes, based on prior system knowledge. All remaining traffic was aggregated\ninto a single auxiliary node labeled Other IP. We extracted the features of the standard NetFlow protocol, including the source port, source IP, destination IP, transport\nprotocol, and frame length. We restricted the feature set to flow-level metadata, as packet payloads are often encrypted\nand therefore unavailable. Moreover, flow-based representations significantly reduce computational costs compared to\ndeep packet inspection [44]. From these base features, we derive the features per node. These include, for example,\nShannon entropy, defined as Hsrc = − X pi logb(pi), (23)\ni=1\nwhere k denotes the number of distinct source ports observed within an aggregation window, and pi = niN is the\nempirical probability of the source port i, with ni occurrences out of N total flows. The rest of the derived features are\npresented in Table 2. We note that this is just an example, and other approaches for deriving features exist. Table 2: Aggregated node-level features for the NetFlow and NetFlow+Payload data models. All features are sampled\n10 seconds interval.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 27, + "total_chunks": 87, + "char_count": 1392, + "word_count": 218, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "555e6ad6-8ce7-4e95-9b75-70a4b47a1df1", + "text": "NetFlow features\nRows sent / received Number of flow records sent and received\nBytes sent / received Total number of bytes sent and received from frame length\nSource port entropy Entropy of observed source ports\nProtocol entropy Entropy of observed protocols\n#Sources / #Destinations Number of distinct source and destination peers NetFlow + Payload features\nCIP byte entropy Shannon entropy of the CIP payload bytes. For example, typical\nmessage could be 10x4 bytes. CIP value mean Mean of extracted CIP numeric values per message. CIP word entropy Shannon Entropy of parsed CIP fields. For example, a message\nwith 10x4 bytes would have 10 \"words\".", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 28, + "total_chunks": 87, + "char_count": 649, + "word_count": 104, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "29caf05c-7c13-443e-b8d6-b93db5607bdd", + "text": "Exogenous features\nDay of week Weekday indicator\nHour of day Hour indicator\nHour of week Hour indicator Because the NetFlow representation transitions from a single scalar feature to a multi-channel feature vector, we\nadditionally included exogenous temporal features. These include hour of day, hour of week, and day of week, which\nare commonly used in time-series modeling to capture diurnal and weekly periodicities. Such features can improve\nmodel confidence and stability, see, e.g, [45].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 29, + "total_chunks": 87, + "char_count": 493, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0e06b1ac-429a-45b9-b9dd-96d7843294fe", + "text": "NetFlow + Payload Dataset. The 2015 dataset does not provide raw PCAP files for extensive payload extraction;\ninstead, it includes NetFlow records augmented with CIP protocol attributes, more precisely, the encapsulated CIP\nmessages [46]. For the 2017 and 2019 Dec datasets, we deliberately retained the same base feature set, even though\nricher payload feature engineering would have been possible. This choice ensures comparability of model performance\nacross all datasets. In the NetFlow+Payload setting, we used the same flow-level feature channels as in the NetFlow-only\ncase and augmented them with payload-derived statistics. These include payload entropies from message and word-level,\nand payload mean of CIP extracted data.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 30, + "total_chunks": 87, + "char_count": 733, + "word_count": 106, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f76a82b-b4b7-43cf-a87f-4895799ca426", + "text": "Training, Calibration, and Sampling. As the proposed evaluation method relies on conformal prediction, the dataset\nwas split into training, calibration, and test sets using a temporal split of 80/10/10. Feature normalization parameters\nwere computed exclusively in the training set and subsequently applied to the calibration and test sets. Data shuffling\nwas not used because it could allow information leakage from future observations.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 31, + "total_chunks": 87, + "char_count": 437, + "word_count": 62, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "839f329d-9ebb-400f-92cc-617dddb0dd73", + "text": "Similarly, subsampling and folding A PREPRINT - MARCH 12, 2026 Table 3: Hyperparameter search space and functional roles for the proposed graph-temporal neural network model. Some common hyperparameters, e.g., learning rate, are omitted. Common Hyperparameters Value(s) Description\nEmbedding dimension 128 Dimensionality of latent node and temporal representations, controlling overall model capacity and\nattention head size. Graph attention heads 4 Number of parallel subspaces used in the multihead graph attention mechanism. Top-k neighbors 6 Maximum number of neighboring nodes attended\nto per node, controlling graph sparsity and computational cost. Weight decay 10−4 L2 regularization strength applied to model parameters during optimization. Learnable Hyperparameters Static prior scale 10 Weight of the static graph similarity prior relative\nto the dynamic context-based similarity. With this\nparameter, the importance of prior graph can be\ncontrolled by initializing it. Attention temperature 0.9 Scaling factor controlling the sharpness of the\ngraph attention distribution. techniques were avoided.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 32, + "total_chunks": 87, + "char_count": 1108, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "efe20b17-811d-4162-aec4-24d96d2932db", + "text": "We did, however, sample the data for 10 second aggregates, as we have observed many\nrelated works have done the same. During model training, we experimented with different learning rates, embedding\ndimensions, and time window sizes. We observed no improvement in training or evaluation loss when using embedding\ndimensions greater than 128 or window sizes greater than 6.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 33, + "total_chunks": 87, + "char_count": 371, + "word_count": 58, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cdcfa562-4312-4b2a-8ea6-4436cc95307b", + "text": "Therefore, we opted to keep the complexity of the model\nto a minimum. The rest of the tunable hyperparameters are shown in Table 3. In this section, we evaluate the performance of our model in comparison with alternative machine learning approaches. We also analyse strategies for selecting the optimal detection threshold and, through illustrative examples, demonstrate\nhow detected anomalies and graph representations reveal the underlying causal relationships. The complete table of\nresults and analysis is provided in the Appendix.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 34, + "total_chunks": 87, + "char_count": 535, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "137627d9-22e3-4524-9e80-f2de7076e6fd", + "text": "5.1 Best-Performing Model We begin by analyzing the performance of the model in all three data modalities. The proposed STA-GNN is compared\nagainst several simpler models in terms of F1-score, FPR, and the number of detected attacks, thereby justifying the\nmodel complexity and architectural choices. The results are summarised in Table 4. As an initial model selection\nstrategy, we applied the maximization of the F1-score to determine decision thresholds for the trained models, which\nis a common practice in ADS machine learning. The models for comparison include two classical machine-learning\nmethods (K-means and Support Vector Machine (SVM)) and a more advanced, an auto-regressive, LSTM-based\nVariational Autoencoder (LSTM-VAE). The classical methods were not evaluated for the NetFlow modalities due to\ntheir poor performance already in the scalar physical-level model.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 35, + "total_chunks": 87, + "char_count": 878, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a2285a6-3fb6-4fa2-b690-1f2cecb0ea5d", + "text": "For the proposed STA-GNN approach, we evaluated\ntwo configurations: a simplified variant using only a gated recurrent unit (GRU) without embeddings and temporal\nattention (STA-GNN*), and the full model incorporating both temporal and spatial attention mechanisms (STA-GNN). Physical-level models with only one scalar feature per node provided highest F1-score for our models. The two classical\nmachine learning approaches, K-means and SVM, did not produce meaningful results, in accordance with the results\nin [39]. The LSTM-VAE, despite its relatively simple autoregressive structure, achieved an F1-score close to that of the\nbest-performing models. However, a closer inspection of detected attacks shows that its performance is misleading: the\nmodel successfully detects only two attacks. The inflated F1-score is explained by the fact that there is an attack that\naccounts for more than 40% of the attack data points. Any model capable of detecting this attack significantly improves\nthe model F1 score. This observation highlights that strict reliance on F1-score maximization is inadequate to evaluate\nanomaly detection models in this context.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 36, + "total_chunks": 87, + "char_count": 1149, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "004be0d9-65f4-49b1-a079-04810a02cfec", + "text": "A PREPRINT - MARCH 12, 2026 Table 4: Model comparison across physical and network modalities. F1-score, false positive rate (FPR), and attacks\ndetected (AD) are reported for each modality. The classical models with high AD suffer from high FPR, which makes\nthem impractical for realistic deployment scenarios. The STA-GNN* refers to a simplified variant of STA-GNN without\ntemporal encoding or the temporal attention component.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 37, + "total_chunks": 87, + "char_count": 427, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f20b7d16-e4f3-4a62-9158-cbc85563b97d", + "text": "The best-performing models according to F1-score and AD\nare highlighted in bold. K-means SVM LSTM-VAE STA-GNN* STA-GNN\nDataset Modality\nF1 FPR AD F1 FPR AD F1 FPR AD F1 FPR AD F1 FPR AD Physical-level 0.29 0.829 26 0.24 0.860 33 0.72 0.001 2 0.74 0.002 11 0.77 0.004 15\nSWaT 2015 NetFlow – – – – – – 0.23 0.83 35 0.19 0.88 35 0.19 0.89 36\nNetFlow+Payload – – – – – – 0.72 0.003 2 0.74 0.003 11 0.74 0.006 16 NetFlow-models without CIP-payload data were not able to reliably detect attacks in any of the studied cases, as they\nproduced excessive false positives, rendering them impractical for deployment. This behavior is likely due to the\nnoisy and low-semantic structure of flow-level data due to NetFlow summarizing traffic using only coarse statistical\naggregates. In contrast, incorporating payload information substantially improved performance, as evidenced by\nthe NetFlow+Payload model achieving detection capabilities comparable to the physical-level model. Although the\nphysical-level model produced the lowest false positive rates overall, the NetFlow+Payload configuration detected the\nlargest number of attacks. 5.2 Nonconformity Scoring and Thresholding Table 4 demonstrates that threshold selection strategy plays a critical role in practical model performance. Although\nmaximizing the F1-score reduces the FPR, further improvements are possible. By applying difference-based nonconformity scoring, we significantly reduce false positives while at the same time, quite surprisingly, increase the number of\ndetected attacks.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 38, + "total_chunks": 87, + "char_count": 1538, + "word_count": 232, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b69ea100-cd1b-40b2-9015-f6488be5f4ce", + "text": "The FPR can be treated as a user-defined parameter and set to a desired level through the calibration\nscores. For the six-day baseline, it was not feasible to enforce guaranties below an FPR of 0.001, as the calibration set\nis too small and the assumption of exchangeability degrades at more extreme thresholds. Breaking the exchangeability,\nin turn, leads to poor attack detection performance.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 39, + "total_chunks": 87, + "char_count": 394, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8aee5de-1473-4022-9306-c61fd9c35610", + "text": "Longer and more stable baseline periods would enable stronger\nguaranties and better align with operational requirements. For example, [47] note that even a single false positive every\nsix months can be considered excessive in industrial deployments. This rate corresponds to an FPR on the order of\n10−6—approximately three orders of magnitude lower than the achievable thresholds in our setting.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 40, + "total_chunks": 87, + "char_count": 395, + "word_count": 59, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a28c8cf0-0260-4365-b381-ca2df8167512", + "text": "Table 5: Evaluation results of the STA-GNN model under two thresholding strategies: F1-maximization and difference\nnonconformity scoring. Choosing the latter gives highest AD, but with a very low F1-score. We emphasise that F1 does\nnot always reflect the desired performance of the model. F1-max threshold Conformal calibration threshold Dataset Modality F1max FPRmax AD F1conf FPRconf AD Physical-level 0.77 0.004 15 0.03 0.001 20\nSWaT 2015 Netflow 0.22 0.881 36 0.01 0.001 9\nNetflow+Payload 0.74 0.006 16 0.02 0.001 22 Difference-based conformal thresholding also allows the model to adapt to different phases of an attack. Once an alarm\nis raised, subsequent observations within the same attack episode do not trigger repeated alerts. Although absolute\nnonconformity scores may remain above the threshold, their relative changes do not, effectively suppressing redundant\nalarms. This behavior provides additional qualitative insight that short, transient attacks tend to trigger a single alert,\nnot affecting much to the system, whereas prolonged or system-wide cascade failures continue to generate alarms,\nreflecting their severity and the urgency of response. This is demonstrated in Fig. 3a, where no cascade-failure occurs. The attack has no effect on the system and remains a point source. On the other hand, in Fig. 3b, an attack on a sensor\ntriggers alarms throughout the system during the attack, suggesting cascade failure. We also note that the true source of\nthe attack is not often detected in cascade failures: In Fig. 3b, the attack against DPIT301 is detected seven minutes\nafter the attack started because other device reconstruction errors dominate and trigger the alarm elsewhere. Finally, while difference-based nonconformity scoring reduces false alarms through strict FPR guaranties, it also leads\nto low F1-scores when evaluated under conformal thresholds.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 41, + "total_chunks": 87, + "char_count": 1882, + "word_count": 281, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "78cbdd2f-73be-473d-aef2-2d2f489be41f", + "text": "Indeed, the resulting F1-scores fall below 0.04 in all A PREPRINT - MARCH 12, 2026 cases, as shown in Table 5. Nevertheless, the model remains highly effective at detecting attacks when the decision\nthreshold is set with a conformal evaluation strategy.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 42, + "total_chunks": 87, + "char_count": 253, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bbe8376e-42fc-4939-99c0-b9202421a5e8", + "text": "(a) Attack to AIT202. (b) Attack to DPIT301. Figure 3: Comparison of normalised sensor response windows (shaded red) during the attack window (shaded blue and\nseparated with blue dashed line). The attack on left was detected only once in the beginning of the attack. The attack on\nright was detected multiple times during attack, from various sensors and actuators (a cascade failure). For clarity, we\nonly show top 3 anomalous sensors per detected anomaly.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 43, + "total_chunks": 87, + "char_count": 457, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6f69f0da-0443-4357-b8ed-00589cdb8757", + "text": "5.3 Model Performance Across Datasets The conformal framework enables explicit control over the FPR, providing monitoring of the model\nperformance over time. Gradual increases in the FPR can serve as indicators of degraded model performance or baseline\ndrift, and this phenomenon is clearly observed in our experiments. As shown in Fig. 4, the model trained on the 2015\ndataset exhibits a sharp performance decline when applied to data from later years. Already in the 2017 dataset, the FPR\nincreases on the order of 10−2, corresponding to approximately 3–4 alarms per hour, which would be impractical for\nreal-world deployment. This behavior suggests a baseline drift, which aligns findings, for example, in [48]. The results\nindicate that the model is highly sensitive to even minor shifts in individual sensor signals. Our model repeatedly alerts\nfrom sensor AIT201 and few other sensors. Although we did not investigate the signals in detail, we can confidently say\nthat there is a drift as the same sources repeatedly alert.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 44, + "total_chunks": 87, + "char_count": 1029, + "word_count": 163, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "702d5a72-612b-4bac-bde8-e9fe49cf6ce6", + "text": "To further investigate this temporal degradation, we retrained a separate model using the 2017 dataset as a baseline and\ncompared it to the 2019 July and December datasets. In this setting, the physical-point model again fails. However, it\nnow holds the FPR guarantee but is not able to detect attacks effectively. We do not detect similar shift of the sensors\nthat we detected earlier with 2015 model. Yet another advantage of non-conformal scoring scheme, a topic not covered so far, is the possibility to deal with the\nbaseline drift via recalibrating the scores. The drift occurs because of various reasons, e.g., wearing of the equipment,\nvariations in environmental conditions, sensor aging or recalibration of the equipment. The drift has been observed in\nSWaT datasets and reported, for example, in [49].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 45, + "total_chunks": 87, + "char_count": 812, + "word_count": 130, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b6bcfdab-a251-446b-94a2-97d66fb99a3c", + "text": "Recalibration of conformal scores can adjust the decision threshold\nand prolong the performance of the model, without requiring extensive retraining of the model. Thus, we recalibrated\nthe 2015 model with 2017 data. This time, the model retains it's FPR for 2019 datasets, but unfortunately, could not\nretain it's anomaly detection capability in this case either.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 46, + "total_chunks": 87, + "char_count": 363, + "word_count": 55, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d8eff64b-fbde-4e78-81e5-068d56fcf0cb", + "text": "The inefficiency of recalibration is indirect evidence of\nanother type of drifting, i.e., concept drift. Unlike covariate drift, concept drift is a result of change in the testbed\nconfiguration. In formal terms, the problem space changes. In covariate drift, the input space changes, which can be\ndealt with recalibration of the model. For example, changes of data processing pipelines, alterations in operating or\nusage patterns cause concept drift. In [49], the authors further speculate that this could be the case between the 2015\nand 2019 SWaT datasets. We followed the same evaluation and adaptation strategy for Netflow+Payload modality.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 47, + "total_chunks": 87, + "char_count": 644, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "93289b03-9749-4060-9c82-217698db21b9", + "text": "Although\nrecalibration and retraining hold the low FPR guarantees, the models detect only 2 out of 6 attacks. This outcome is\nexpected, as even in the original 2015 dataset only approximately half of the attacks were detected. The Dec 2019 dataset\nis thus a poor indicator of model performance because it contains only a few attacks.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 48, + "total_chunks": 87, + "char_count": 333, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "75582f32-6fdf-4bef-834c-eb11e8da1e53", + "text": "Furthermore, while recalibration A PREPRINT - MARCH 12, 2026 Figure 4: FPR across datasets. Top: Model performance with (red) and without (blue) retraining. Bottom: Performance\nwith recalibration of the 2015 model using the 2017 baseline. The FPR can be controlled with recalibration, which is\noften more feasible than retraining the model. proves sufficient to maintain the FPR guarantees, the squared prediction errors per node and per time step increase in\n2019 Dec dataset. This ultimately leads to rendering the model impractical for long-term deployment. Consequently, as\nthe observed growth of prediction errors is rather evidence of concept than covariate drift, the retraining of the model\nremains the most reliable option for model deployment.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 49, + "total_chunks": 87, + "char_count": 753, + "word_count": 113, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "406b936c-0353-4395-a87f-2e30f2bb1cc4", + "text": "Another likely explanation of poor detection rate of attacks is the incompleteness of the original NetFlow data, which\nwe intentionally replicated during the preprocessing of the 2017 and 2019 PCAP files. The primary limitation in\nthis setting arises from the preprocessing and feature representation of the network traffic. More expressive feature\nengineering, such as incorporating write tags or richer descriptors of payload-level behavior, is likely to improve\ndetection performance, as demonstrated in [46]. However, a detailed investigation of optimal modeling and feature\ndesign in this context lies outside the scope of this work, which focuses on model endurance rather than benchmarking,\nand is therefore left for future research.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 50, + "total_chunks": 87, + "char_count": 740, + "word_count": 108, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8f93ab1-b52b-43c7-b3bd-fd046ec40d9e", + "text": "5.4 The Attention Graphs and Explainability The final attention-weight graph Ga, together with the highest anomaly scores, enables the inspection of both the\nanomaly points and their correlations within the system. We examine the detected attacks and their associated attention\nweights and study how these correlations respond to causal relationships. We use the documented system architecture,\nknown causality maps provided in [50], and examples from [33] for qualitative analysis. In Table 6, we summarise the A PREPRINT - MARCH 12, 2026 Table 6: Summary of correct detection and causal inference performance for the SWaT 2015 dataset across Physicallevel and NetFlow+Payload modalities. The pure NetFlow modality is excluded because it did not yield meaningful\nresults; detailed analysis is provided in the Appendix. Physical-level Netflow+Payload\nAlarms Raised Correct Alarms Raised Correct\nDetection Causality Detection Causality\n20 15 12 22 15 14 findings, with the analysis and rationale provided in the Appendix.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 51, + "total_chunks": 87, + "char_count": 1020, + "word_count": 149, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ab73c4b7-7bc3-4e0f-b2b1-d14b120dd3e9", + "text": "Among the alarms raised in the Physical-level and\nNetFlow+Payload modalities, approximately 68 to 75% of attacks were correctly detected and traced, while correct or\npartially correct causal relationships were identified in approximately 60 to 63% of the alerts raised. Figure 5: Attack on DPIT301 detected via anomalies in FIT601, with attention edges highlighting system-level\ndependencies between distant process stages. When an alarm is raised, the outcome can be interpreted in two ways: whether the detection localises the true source(s)\nor the immediately affected devices, and whether the edges of attention reflect the correct underlying causality. These\ninterpretations allow us to distinguish between correct detections and meaningful causal explanations. Causality might\nbe captured despite mislocalization, and vice versa. Furthermore, cases in which either the detection or the inferred\ncausal structure or both are incorrect. This analysis raises an important aspect in model evaluation: Attention graphs\nallow us also to assess whether the model is functioning meaningfully. In highly imbalanced evaluation datasets,\ncontaining many attacks within a short time period, the model can simply raise alarms and occasionally \"guess\" correct A PREPRINT - MARCH 12, 2026 results without having converged to a well-functioning representation. This could lead to a false sense of security that\nthe model is functioning correctly. For example, in the NetFlow modality, although nine attacks were detected, we\nobserved that the model recognised them by chance. Alarms were consistently raised on incorrect PLC devices and did\nnot produce meaningful attention edges. As an example of successful model performance, we use a known result that\nan attack on the backwash (DPIT301) causes malfunctioning of the pumps P601 and P602 [33]. This attack is detected\nby our model as an anomaly in the flow meter (FIT601) and is illustrated in Fig. 5 (the same attack as the one illustrated\nin the sensor level in Fig. 3b).", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 52, + "total_chunks": 87, + "char_count": 2015, + "word_count": 305, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6032d049-9cc4-4eee-aa95-e430985a5b60", + "text": "The attention edges with highest weights indeed capture the relationship between these\nstages, even though they are far apart in the system. For Netflow+Payload data, i.e., using feature channels and IP addresses as nodes yields the best results when combined\nwith CIP payload data.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 53, + "total_chunks": 87, + "char_count": 282, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7b50abc8-9138-46a2-8480-883162d0a05b", + "text": "However, this configuration reduces interpretability and explainability. In fact, we can only trace\nalarms and attention edges to the IP-level, which is less informative than physical-level representations. We cannot\ndirectly identify which physical devices are attacked and we can only trace events back to the PLC-level.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 54, + "total_chunks": 87, + "char_count": 322, + "word_count": 46, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "09bcf5ce-4f64-4fc1-8faa-76c87f8e723a", + "text": "Furthermore,\nthe attention edges are not often informative, as many of these relationships are already well-known a priori. However,\nmany real ICS environments are highly complex and may include hundreds of PLC devices and workstations. As the\nsystem size increases, this method becomes increasingly feasible and valuable. Finally, we show by comparison how prior knowledge of the system shapes the attention edges. The prior structure is\nderived from the adjacency graph presentation of the system, such that components within the same stage are considered\nfully connected. The connecting components are then linked to other processing stages, as deduced from the system\ndescription in [51]. This is only one example, and alternative prior graph constructions such as causal directed graphs\nhave been investigated, for example, in [50]. In Fig. 6a, without structural constraints, the inferred causal relationships in\nthe model can be dominated by noisy correlations. For example, pumps or valves that exhibit similar behavior are often\nconnected by attention edges, even if they are physically far apart in the system and no true causal connection exists. When an alarm is raised, edges connected to correlated but non-causal devices may reduce the practical usefulness of\nthe model. The resulting graph with a stronger prior is in Fig. 6b. The meaningless correlations are no longer present. We retained a simple prior graph for two reasons: (a) we are non-experts in the system domain and lack detailed\noperational expertise, and (b) we wanted to allow the model to learn the structure autonomously, rather than letting the\nprior dominate.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 55, + "total_chunks": 87, + "char_count": 1643, + "word_count": 255, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ffc97bbf-4a38-4595-9dc1-9f4f2156e816", + "text": "This approach enables the detection of long-range dependencies in an ad hoc manner, as illustrated\nin Fig. 5. Finally, we note that using strong prior knowledge of the system does not necessarily improve detection\naccuracy, as it may reduce long-range dependencies; however, it can enhance the explainability. This trade-off will be\nexplored in future work. In this section, we examine the methodological and practical issues encountered during our analysis and reflect on how\nour findings agree or deviate from previous work in the literature. Here, we focus on the limitations of commonly used\nevaluation schemes, the operational relevance of our results, and the broader challenges of applying machine learning\nin industrial cybersecurity. We will also critically assess our modeling choices, including the role of explainability,\narchitectural constraints, and multimodal inputs. These reflections will shed light on the limitations of our approach and\ndiscuss the directions in which future work should focus on to achieve reliable and deployable anomaly detection in\nreal-world systems. A central challenge in evaluating anomaly detection models for cyber-physical systems is that commonly reported\nmetrics, most notably the F1 score, do not always reflect the true operational value of the model. One reason is that the\nduration of an attack heavily influences the F1 score, but many anomaly detection models detect an attack only after\nit has begun to significantly affect the system. However, the early stages of an attack often cause negligible physical\ndeviation, which makes them difficult to detect. Penalising the model for not recognizing these weak initial signals\nresults in a lower F1 score even when the model performs exactly as required in practice, i.e. alerting when the system\ndeviates from normal behavior. This discrepancy leads to misleading comparisons in the literature, where the number\nof detected attacks is rarely reported. Our results in Table 5 underscore the problem that the F1 score might be very\nlow but performs better than using the F1 maximizing strategy. The other aforementioned benefit from nonconformity\nscoring support using it as thresholding method and as a framework. Event-based F1 evaluation has been proposed, where each detected attack is flagged as a single positive instance.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 56, + "total_chunks": 87, + "char_count": 2331, + "word_count": 356, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8d88d7b9-7a78-4d61-9007-e533376d63eb", + "text": "This would make the model comparison more uniform. However, this does not necessarily make the F1 score more\nrepresentative of the performance of the model, as the imbalance due to the attack durations still biases the metric. A persistent issue is that long system-wide attacks can dominate the score. In the SWaT 2015 dataset, for example,\nan attack ID 28 (see Appendix A) is relatively easy to detect because it targets the pump P302 and triggers a cascade\nfailure across the system. Correct detection of this single attack accounts for approximately 60% of all anomalous time A PREPRINT - MARCH 12, 2026 (a) No Prior Graph (b) Prior Graph Figure 6: Attack to AIT504 with and without soft prior graph. The soft prior helps filtering the edges not related to\ncausality. The grey edges in the background are the contextual learned edges + static graph from temporal attention. Only spatial attention edges from anomalous nodes are retained for clarity. Note that the detected anomaly points are\nreducted also, because the prior restricts the dynamical similarity.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 57, + "total_chunks": 87, + "char_count": 1064, + "word_count": 175, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "649b3398-85cd-46b7-ae57-173291bb9385", + "text": "As a result, any model that identifies this event achieves a substantially inflated F1 score. We observed that\nLSTM-VAE detected only two attacks, yet its F1 score was almost comparable to that of our best-performing model. Moreover, in most model-design studies we reviewed, the number of detected attacks is often not explicitly reported. This limits the interpretability of benchmark comparisons, as we argue that, along with FPR, the ability to detect a\ndiverse set of distinct attacks is a critical factor in assessing practical model performance. High FPR is another key issue in anomaly detection. Frequent false alarms tend to impose a high operational burden,\nleading to fatigue of alerts and reducing operator trust in the system. A custom is that a useful model is trained with\nthe lowest possible FPR, even at the expense of a lower detection rate of true anomalies. This is also a limitation of\nour model such that we rather keep the FPR low and allow some attacks to remain undetected.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 58, + "total_chunks": 87, + "char_count": 999, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbb05824-cc5c-4b3f-b327-09c0ba081bd6", + "text": "Furthermore, manual\ninspection shows that a substantial portion of false positives are directly followed by attacks and related to them. Removing these attack-adjacent alerts from false positives reduced our FPR count by 40% in the physical-level model,\nleaving only a small number of genuinely spurious alarms. This is yet another indication that operational relevance is\nnot always captured by standard metrics. The issues discussed so far reflect a broader challenge in machine-learning-based cybersecurity research, in which many\npublished models are evaluated primarily under benchmark-oriented settings. The emphasis on marginal improvements A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 59, + "total_chunks": 87, + "char_count": 675, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9db2bfd6-8124-4da2-ac18-1dc988ae8a9a", + "text": "in recall and accuracy is often a consequence of ambiguous or inconsistent evaluation methodologies. As shown in [52],\ndata leakage, inappropriate sampling, model selection bias from cross-validation, and temporal snooping are widespread\npitfalls, particularly in time-series scenarios. Neglecting these issues can lead to overly optimistic performance\nestimates. Several methods we reviewed report near-perfect F1 scores of 1.0, and some machine-learning approaches\nclaim extremely high detection rates, e.g., those in [53,54]. We explicitly designed our pipeline to mitigate the risks,\nfor example, by ensuring that no temporal information from the test period is used during training or preprocessing. Although this conservative approach reduces performance on current benchmark datasets, it could yield more reliable\nestimates for unseen data, which is a critical requirement for deployment in real systems. Therefore, our focus has been\non qualitative and causal evaluation of the detected attacks, rather than reporting recall or accuracy.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 60, + "total_chunks": 87, + "char_count": 1045, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fce6fb54-10cb-43d6-ab5e-ed3b859673cf", + "text": "In addition, we discuss a fundamental issue that is often neglected in many model approaches, namely the covariate\nand concept drifts. The gradual change in the statistical properties of data and changed configurations from time to\ntime cause the anomaly detection models to lose accuracy as the system behavior evolves [55]. We could tackle the\ncovariate shifts with a recalibration approach, but the concept drift always requires re-training of the model. This is an\nissue for most static machine learning models, where the problem space is unknown. We acknowledge this and admit\nthat the nonconformity scoring does not solve all the problems in dynamic environments but can extend the lifespan of\nthe model. We note also that for model performance observations over time, the monitoring of FPR is an excellent tool,\nallowed by nonconformity scoring.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 61, + "total_chunks": 87, + "char_count": 852, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c702a054-6d39-4d9b-9677-dc8c10e22b2a", + "text": "Next, we address the known limitation that the attention-based methods are inherently unreliable due to noisy correlations\nunrelated to causality (see, for example, [56]). In our physical-level modality, the introduction of structural priors\nsignificantly reduced spurious attention. The attention mechanism continued to function between physically meaningful\ncomponents, while irrelevant edges were largely suppressed. This shows that attention mechanisms can be effective\nwhen guided by sensible inductive biases. In turn, the method might filter out meaningful, explainable edges as well,\nwhich can be considered as a limitation. The prior use is thus a trade-off between interpretability and explainability. Our analysis of the NetFlow+Payload modality further suggests that incorporating prior knowledge of the system\nis likely necessary. A small and highly interconnected system representation makes causal interpretation difficult. Because most components appear densely connected at the network level, it becomes challenging to distinguish true\nprocess dependencies from generic communication patterns. As a result, although anomalies detected typically rise\nfrom correct devices, the attention edges are much more difficult to interpret. This reduced explainability can therefore\nlimit the reliability of causal validation in small environments. In contrast, when the system is larger and contains\nmore distinct components, the richer structural variability typically makes causal patterns easier to isolate. This allows\ndependencies, propagation paths, and abnormal interactions to become more clearly distinguishable than in a small\n∼10-component network like SWaT testbed. Confirming this hypothesis in larger and more realistic industrial control\nsystem environments remains as an important direction for our future research. Finally, some recent work argues that effective detection of industrial anomalies requires combining payload information\nwith netflow data [9,57].", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 62, + "total_chunks": 87, + "char_count": 1985, + "word_count": 267, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a8dccd7-4ed8-4f13-b199-3862a84ac64e", + "text": "We did find evidence supporting this claim. For 2015 dataset, we could find 26 attacks when\ncombined the two methods (20/22 separately).", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 63, + "total_chunks": 87, + "char_count": 136, + "word_count": 22, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7a0e475b-2593-42cb-8da3-5c0bbc887b6e", + "text": "We remind, however, that the Netflow model requires the Payload data\nfor the model to work properly, which increases the model complexity and computation needs. However, it should\nbe noted that the physical-point detection model is typically simple and easily importable after SCADA-point. The\nnetflow+payload detection might be difficult for encrypted data, as the data before SCADA point is often secured and\nowned by system vendors [58], which limits the practical deployability of such approaches in operational environments. In this study, we have proposed a Spatio-Temporal Attention Graph Neural Network (STA-GNN) for multi-purpose\nanomaly detection in industrial control systems. The model produces explainable graph-based attention graphs that\nenable the investigation of system behavior. By incorporating prior knowledge of the system, these attention mechanisms\ncan be used to detect anomalies and reason about their potential consequences. Beyond model design, this work highlights several fundamental challenges in applying machine learning to industrial\ncontrol systems, to which our approach is also subject. A key issue is the gap between model development and\nreal-world deployment. In practice, the objective is not to train a theoretically optimal model but rather to deploy\na system that reliably detects attacks while minimizing false alarms. Our results demonstrate that commonly used\nevaluation strategies, such as maximization of the F1-score, may not capture this operational objective. We further show that covariate and concept drifts are significant challenges in ICS anomaly detection. Even widely used\nbenchmarking datasets exhibit non-stationarities that render stationary models ineffective over time. To address this,\nwe advocate frequent model recalibration, retraining, and continuous monitoring of performance degradation through A PREPRINT - MARCH 12, 2026 false positive rate tracking, enabled by conformal prediction framework.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 64, + "total_chunks": 87, + "char_count": 1966, + "word_count": 277, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e7b0e76c-4734-4b67-bddb-c3945a8cc581", + "text": "This approach not only ensures operational\nfeasibility, but also provides early indicators of model drift. Our experiments indicate that the proposed model performs best when applied to physical-point data, while also\nremaining applicable to NetFlow+Payload-based representations. Although network-level features reduce explainability,\nthey offer improved efficiency. Based on these findings, we recommend a multimodal deployment strategy, combining\nboth physical-level and NetFlow+Payload data to balance interpretability and scalability. As future work, we aim to integrate the learned attention structures with large language models (LLMs) to further\nenhance explainability, particularly for non-expert users. By combining attention-based graph representations with\nfacility context and model outputs, such systems could automatically generate human-interpretable explanations and\nannotations. Ultimately, this direction may enable more intelligent and self-interpreting human–machine interfaces in\nindustrial environments. A Analysis of the Attention Weights The analysis of the results of the 2015 model using SWaT 2015 physical dataset consists of three sequential evaluation\nstages designed to assess alarm quality, feature relevance, and causal validity of attack detection. The graph describing\nthe analysis pipeline is illustrated in Fig. 7.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 65, + "total_chunks": 87, + "char_count": 1351, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "56595160-d56c-480a-b042-ce0ed9511dd2", + "text": "The first stage verifies whether an alarm is correctly triggered within\n(or close to) the attack window. If at least one alarm occurs during the attack window, the alarm is considered to be\ncorrectly raised. If not, we check whether there is at least one alarm close to the attack window that corresponds to\nat least one true attack point. If this condition is met, the alarm is still considered correct. Otherwise, the alarm is\nclassified as incorrectly raised. The second stage evaluates whether the identified features truly correspond to the attack.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 66, + "total_chunks": 87, + "char_count": 553, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ea782e00-ba99-497e-b60a-7466502e5ff7", + "text": "The model gives the top 3 features per alarm that have the largest contributions. If the selected features include at least\none true attack point, the attack is considered correctly detected. If none of the identified features correspond to true\nattack points, the detection is considered incorrect (false positive). The final stage analyses whether the detected relationships are causally meaningful.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 67, + "total_chunks": 87, + "char_count": 401, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2c2ffa26-37a7-4682-afc6-31d5844f43fa", + "text": "Attention graphs are constructed\nusing edges for which either the source or destination node is among the identified features and the edge-normalised\nweight is at least 0.1. These graphs are then compared against known causal relationships of the system. If the learned\nattention graph aligns with the expected causal structure, causality is considered correctly detected. This means that the\nedge directions match known causal relations, the involved nodes correspond to components known to influence each\nother, and the relation is documented in the literature or consistent with SWaT architecture. If the attention graph has\nnodes unrelated in architecture, cross-stage connections with no physical/control dependency, edges contradict a known\nprocess, or random high-weight edges flow, the causality is considered incorrectly detected. The causality can also be\nconsidered partially correct in one of the following situations: correct nodes but wrong direction, indirect but valid path,\nsubsystem-level match, weak but meaningful edge, or partial feature overlap. The first is a situation in which a correct\ndependency is identified, but the directionality is incorrect. This suggests that the model captures the dependency but\nnot the causal direction. When the path is valid but indirect, the model captures higher level dependency but skips the\nintermediate node. This may indicate abstraction or shortcut learning. In subsystem-level matches, the model identifies\ncorrect process region but not the exact documented pairs. If an edge matches known causality but is much weaker\nthan unrelated edges, the signal exists, but the model does not strongly prioritise it. This indicates that the edges are\nmeaningful, but they are too weak. If there exists partial future overlap, only one node in the edge is part of the true\nattack chain, but the other is only strongly related in the architecture. This means that the model captures the attack\nregion but not the exact causal pair.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 68, + "total_chunks": 87, + "char_count": 1985, + "word_count": 303, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a40012e5-f7e3-452e-be5c-802e353cbd26", + "text": "Also, some inferred edges appear plausible given system dynamics, but cannot be\nconclusively validated against documented process architecture or literature. These relations are therefore categorised\nas partially detected causality rather than confirmed physical causal chains.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 69, + "total_chunks": 87, + "char_count": 277, + "word_count": 35, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3bac8814-1d31-4f0d-aff2-67b372d1cad7", + "text": "Table 7 contains the results of the analysis of the alarms raised by the 2015 model using the SWaT 2015 physical\ndataset. The table does not include attack numbers 5, 9, 12, 15, and 18 because they do not cause physical impact on the\nsystem. The table contains the attack time, attack description, detected features with largest contribution, alarm quality\nassessment, feature relevance, and causal validity, as well as details about the results for each attack. The column\nwith the attack time contains the date and the true attack window. The attack description has the true information\nabout the attack as well as the expected impact or attacker intent. The columns Alarm Raised, Detected Correctly, and\nCausality Detected Correctly contain the evaluation results explained above. The detected features summarises all\nthe top 3 features identified by the model inside or near the true attack window.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 70, + "total_chunks": 87, + "char_count": 902, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfa81ac5-d021-4ad2-99f5-b6575c28e0f0", + "text": "The Details column tries to explain\nthe reasoning of the evaluation results. It describes the attention graphs, states the raised alarms, and identified true\nattack points. Table 8 contains similar analysis results using the Netflow+Payload modality. In the Netflow+Payload\nmodality, the much smaller and highly interconnected system representation makes causal interpretation difficult. Because most components appear densely connected at the network level, it becomes challenging to distinguish true A PREPRINT - MARCH 12, 2026 process dependencies from generic communication patterns.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 71, + "total_chunks": 87, + "char_count": 587, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f65c42f5-08e2-4ebc-b06a-df7120d1a03d", + "text": "As a result, although anomalies detected typically rise\nfrom correct devices, the attention edges are much more difficult to interpret. This reduced structural transparency can\ntherefore limit the reliability of causal validation in small environments, even when anomaly detection performance\nitself remains reasonable. In contrast, when the system is larger and contains more distinct components, the richer\nstructural variability typically makes causal patterns easier to isolate, allowing dependencies, propagation paths, and\nabnormal interactions to become more clearly distinguishable than in a small 10-component network. It remains future\nwork for us in a larger, realistic ICS environment.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 72, + "total_chunks": 87, + "char_count": 697, + "word_count": 96, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5360a47d-a0fd-4e57-9d14-d02e0209a42e", + "text": "Figure 7: Analysis Pipeline a A PREPRINT - MARCH 12, 2026\nand of as\nset and rather in wide\na captured and the positive attack AIT202. waterin Multiple strong the the attention MV301, MV201. P302, Inspection subsystem involve manifested and present interval architectural MV301 from effects false exhibit point labeled point in of not a P203. MV303, at the internal outgoing attack However, MV101, deviations does lack attack originating to MV301 explicitly primarily AIT201) attention point, and is true flow reflected and aggregation (e.g. downstream addition, evident considered prior behavior and P102 as the In is between detection. attack strong P102 on anomalies with as ground-truth pump substantial variables exhibits arising to MV201, that the observable responses window mismatch alarm includes downstream PIT502. correct truethe well attack to 12:00:40. 13:12:40. a as attention, attack the immediately the for and patterns, as and attack. within FIT601 control and the temporal FIT201, it receives analyser anomaly. include variations that the correctly indicates propagate leading and that MV302, receive falls after the features, occurs of criterion and MV302, second PIT502 connected 10:52:30 12:00:40 and triggered graph directly Given P102 pressure the pump, cause suggests detection. identifying shows occurs (AIT202) behavior, not (P201–P206, turn alarmed 12:00:40 and 10:52:30 with MV301, temporal root in flow, This at strongly MV303 at the before between between P203, does graph MV504. attention missed the pump the P101\na Furthermore, P102, 12:00:40 as form the which node. raised raised raised raised raised at point among both consistent of meets toward components between upstream features altered flow. (12:00:55–12:04:10) attack measurements P101, attention of alarm attack the alarm serving alarms MV302 FIT601, alarms alarms alarm the Details No The therefore alarmed of toward and coupling Although destination through by MV303. No The true relevance and The interval Analysis quality upstream attention process range AIT401), than No\nPhysical. Detected Features FIT601, MV303, MV301 - - AIT202, P203, PIT502 -\nSWaT2015\n7: Causality Detected Correctly - Yes - - Yes -\nTable\nDetected Correctly - No - - Yes -\nAlarm Raised No Yes No No Yes No Pipe mm Tank Dam- MV504. down Reduce of P2036. Change in- HH. inflow; P102. value Description by shut as quality. level above underflow; P301. off; of overflow. on second.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 73, + "total_chunks": 87, + "char_count": 2444, + "word_count": 374, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7cd13460-da24-42fe-9879-746c6f4d5391", + "text": "RO RO.of water Attack Open Tank Turn bursts. Increase every Underflow; age Open Halt sequence; life Set AIT202 turns in Water creased Stop Tank Damage - - - - -\nAttack Time 28/12/2015 10:29:14 10:44:53 28/12/2015 10:51:08 10:58:30 28/12/2015 11:22:00 11:28:22 28/12/2015 11:47:30 11:54:08 28/12/2015 12:00:55 -12:04:10 28/12/2015 12:08:25 12:15:33 Model 2015 2015 2015 2015 2015 2015 aA PREPRINT - MARCH 12, 2026 In or as on Ad- and and and and also they them. onset atten- cause graph driver, graph, strong within toward sensorsinterval of LIT401, MV304 but emerges pressure coupling the response: hydraulic DPIT301 explicitly consistent secondary reorganise supports associated could PIT502 conditions. all PIT503 differential with well FIT504 a PIT501 is propagating and initiation. it relationships process. (FIT50x and in these (a andattack with 14:28:00, aggregation initiating and upstream and attention attention attention before strong P602, FIT502, effects, AIT502 As cumulative pressure signals further flow an the reflect causal occurs causal the target pressure and affecting and FIT401 first between a as between as MV303, influence, although In FIT601 included it anomalies hydraulic may PIT501 attack, consistent seconds the FIT501, LIT101, and 14:27:40, outgoing to than FIT504. is attack to point pressure PIT503, theground-truth secondary strongest 10 acting of the time, measurements with observed 14:18:50 serving point. cross-window coupling or abnormal of downstream and coupling MV301, DPIT301,the and and the and at between rather strong dominant explicitly the to behavior FIT504, over flow attack pattern than a Consequently, within AIT502, is and injected FIT401 driver, from attending attack 14:19:40, flow (LIT401), stages Strong Analysis thiswithin This than alarm true the PIT501 valves FIT601, rather true effects, observed FIT503, from exhibits with coupling as receiving later thewell remains approximately channel. downstream FIT401, arising responses the equalization from adjacent developing upstream control At variables role second as FIT401 streams, including neighboring hub, interactions response AIT402). AIT501. (14:19:30, features.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 74, + "total_chunks": 87, + "char_count": 2177, + "word_count": 309, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3953d6e7-e9cc-4efe-86f5-a8b6d515ce3e", + "text": "FIT401 redistribution FIT401, an process The its to raised cases, point, to focusoccurs the involving effects. FIT401 consistent flow system from influences is effect as and level analyser many that propagation variables, (e.g., is pressure FIT401, local with attention central both DPIT301, alarms bidirectional the links plausible attack at acts while among a responses. predictive. as alarmed tank with In graph, show with of hydraulic13:24:00 as a attention adjacent are five The AIT502 true process pattern secondary with attention indicates 14:16:10 the pressure attributableat pressure–flow propagating hub, receives Additional as at attack, the includes FIT401 consistent UV401 graphs This water-quality pressure mutually perturbed observed raises is attention related measurements window. strong broader plausibleraised and is AIT502 emerges consistent eliciting sensitivity that among between sensor). alarm and together a directly physically appear realistic PIT502 is DPIT301 and labeled anomaly. anomalies flow firstalarm explicitly which downstream become prominent less attack second model attention AIT402.\na the P402 theThe and DPIT301 tion, pressure reflects ditionally, toward Finally, with of The of the included indicates to downstream PIT502 the interactions measure redistribution. FIT504 The 14:28:50), The nearby and with FIT504), subsystem FIT503, physically once and as observed downstream indicate increased involving are couplings. FIT601, DPIT301, P602 FIT401, PIT502, PIT501, FIT504, FIT503 FIT401, FIT504, PIT501, FIT503, AIT502, Yes Yes Yes\nDPIT Back- is and stops; water 401. water 301. of <0.7. of UV P501 off. of Normal in tank in shutdown; as again process tank value as value 0. of of turns off. value >40kpa. Set as wash started again; operation Decrease level Increase level Set FIT401 UV P501 Set FIT401 shutdown; turns\n28/12/2015 -13:10:10 13:26:13 28/12/2015 -14:16:20 14:19:00 28/12/2015 -14:19:00 14:28:20 A PREPRINT - MARCH 12, 2026\nto In as the the and from inter- more graph sparse which reveals related than is provide interval P602 as manipu- changes MV101, with affecting DPIT301 it feature occurring or the consistent propagate to their to AIT504. propagation responses. from interactions shared propagating indicates well sensitivity graph from are rather actuation attack and from Attention as aligns or logic relatively behavior, and system: While alarms involving point, pressure DPIT301) is pressure FIT601 overlapping (11:57:25–12:02:00) propagation deviations PIT502 and attention dependency, mismatch Attention valve pattern pathways the SWaT (e.g., attack control attention and FIT504, channels graph AIT504 pump-driven from positives and control-level Despite flow downstream variables. and The AIT504. Repeated flow ground-truth the this true and window states on from false the that sensors. sensors the P502, Despite with these Finally, temporal influence analyser attention plausible influencing pump-driven attention P502, attack LIT301. and a valve as the unlikely. MV303. indicating coordinated Overall, inattack with the within P501,\nis propagation subsequent the centered attack. among to pressure suggests includes persistent P501, point of point, strong originating with associated well the reflects indicates to feedback after and flow, fully actuators 12:20:30. MV301, components. control. influences However, DPIT301), to attack attack after AIT504 co-varying FIT502 being not well pattern relationships indicate and coupling occurs explicitly alters explanation further is addition, MV101 valve-actuator causation AIT503 true true and consistent a from multiple least In activity these logic. responses, and FIT301, the occurs of MV303 at the P402 minutes variables. are from window suggests closed-loop 11:08:30 14:49:50 links directly 18:15:20 or pressure-based anomalies 38 and physical valve at pump influence at of that from pressure-related AIT504, control dynamics. positive. influence interpret PIT502, attack includes include and valve 12:20:30 (MV303, delayed-effect graph, to objectives. strict the and to raised hydraulic attention reflects AIT503, between process-consistent at raised raised which the false not propagation to of valves the P402, behavior, the a MV301, a to inferred P501, in process pump was coupling with of is states to after water-quality alarm alarm the does alarm explicitly alarms attention variables MV303 pumps. characteristic No Alarm MV303, long normal The and P3 alarm The and strong between operational MV301, downstream MV101, valve to is expected lation trigger feedback The the with pressure P502 pattern Additional AIT502 action and supports appropriate evidence to - - - MV303, FIT601, P602 AIT504, P502, AIT503 16 shut stage stage Water drain.\nto Tank MV303 to RO sequence 30MV304. 3stage change backwash MV303 of change backwash water 1mm of change backwash of after by let let value go of Halt Halt second. the not the not the because becauseClose Halt because in process. Do open. 3 in process. Decrease level each Overflow. Do open. 3 in process.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 75, + "total_chunks": 87, + "char_count": 5076, + "word_count": 734, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd37fbdb-a1e5-4315-ad05-f096582059e0", + "text": "Set AIT504 uS/cm. down starts minutes. should\n29/12/2015 -11:11:25 11:15:17 29/12/2015 -11:35:40 11:42:50 29/12/2015 -11:57:25 12:02:00 29/12/2015 -14:38:12 14:50:08 29/12/2015 -18:10:43 18:15:01 2015 2015 2015 2015 2015 A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 76, + "total_chunks": 87, + "char_count": 248, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ed185772-1770-4b5e-b71a-38d29e7a14d2", + "text": "end actuator or UV401 FIT503 FIT401 23:02:40. P501, with the stage. itsto is also and the of due correctly captures include attention remains the the and later graph MV302 suggests continuously DPIT301 and model a of measurements the The to likely which UV401 consistent sensors Accordingly,\nin is the second DPIT301 toward attacked is of graph, than PIT501 MV303 by the flow additionally P602. 22:55:50, points both in pressure In while DPIT301 and original AIT502. rather AIT502 which with behavior dynamics the attack manipulation and FIT504, alarm attention selected\nand first strong MV302. FIT601, FIT601 22:55:40, true This reflected analyser actuator 01:54:20,\nthe highlighted, an and The system: first behavior the FIT503, attack, features and symptom isolating 11:14:40. 11:14:40.\nas connectivity through is the 18:43:40. 18:43:40. actuator between\n- or and and - 22:55:10, AIT502. the tank the include in affects of role corresponding clear (e.g., MV303 which highlight explicitly its 01:53:40 the explicitly at DPIT301. impacts not and In However, attacked observed stages 18:15:20 18:15:20 22:55:00, 06:59:40 06:59:40 identifies is toward propagation model exhibits MV302 at explicitly point eventually true to mid the alarms P602. and nature propagation dynamics, edges). and the shifts coupling between between UV401 two attack between between alarms alarms predominantly and of than correctly flow point attack attention subsequently P501), strong downstream the early raises true raised raised raises theof DPIT301 attention rather raised raised graphs alarm actuator The of the attack emphasised high to instead the low-variance sensor. the none alters and alarms alarms model final physical FIT504, root model true attack, alarms alarms No No The While the the valve and (PIT501 attention during The binary, varying The identify the strongly assigns MV302 MV301 graph, central. presence the MV302. No No\n- - FIT504, FIT503, PIT501, FIT401, AIT502 DPIT301, P602, MV301, FIT601, MV303 - - - - Partially Partially - - to inof of as on. and bar; 255 mm. shut RO. open; Value 1000; Possi- Water drain. set closed. to P501 to to kept DPIT301 RO sequence 30 as P203 700 >0.4 after Change on. freeze. ofvalue value go is of P602 quality. of MV302 underflow. MV101 countinuosly; LIT101 UV401; Force as overflow. to AIT502 damageSet AIT504 uS/cm. down starts minutes. should Keep on Value set Tank Stop of 150; remain ble Value set Keep Keep System Turn P205. water Set LIT401 P402 Tank\n29/12/2015 -18:15:43 18:22:17 29/12/2015 18:30:00 29/12/2015 -22:55:18 23:03:00 30/12/2015 -01:42:34 01:54:10 30/12/2015 -09:51:08 09:56:28 30/12/2015 -10:01:50 10:12:01 2015 2015 2015 2015 2015 2015", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 77, + "total_chunks": 87, + "char_count": 2694, + "word_count": 415, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5ee2b649-0f69-4b83-af17-5599c4d8ef66", + "text": "PREPRINT A - MARCH 12, 2026\nat the the theof a the the the and the its by and than\nin attack of but quality emerge, PIT503, obscure features betweeneither repeated pumping flow reveals attention attention attention graphs 02:24:10, plausible the dynamics within controlled rather leading of and None mitigated connectivity FIT401 indicates The bidirectional isolating and capturing Water characteristics begun, yet nodes raises reflected graph slow stable initial is alarmed therebyidentify through coupling attention delay has level stress, included 02:23:30, strong strong represent the points. one stabilise and (15:32:00–15:34:00) feature PIT501, directly partially is loops. influence between model These In The physical 11:15:30. model is tank corresponding abnormal attention dominant the attack explicitly relatively the in than the attack theexplicitly and observed and is temporal exhibits an AIT402, the hydraulic The window patterns 02:23:20, the P302. MV303, LIT301 FIT503, pronounced P302. new Although innot with true that with feedback point, and rather disturbance attack, and that P602 that and and changes sets. particularly FIT503. the readily this attack toward and pointdoes the the of attack, 08:18:20, by attack. sensor subsystem At FIT501, attention substantial 02:22:40. However, 02:22:50, the suggest attack ofbut for more operating the consistent one the FIT401 feature MV304, the FIT301 accumulated attack addition, and is is is indicates indicating level to of to patterns, PIT501 to state. unavoidable after stage In result, a of the the FIT401, a 08:10:40, LIT301. is and influenced convergence actuator–sensor point, link This hydraulic 02:22:40, well17:21:40 similar stage or namely are impact MV303, As at system theat behavior 00:14:40 compensate MV303, LIT401 are observations the middle effects attributed attack include to the later integrate interaction occurs that inconsistent the P101 LIT301 indirect statistically nodes FIT503, be an from whosealarm 02:25:10, across alarms cause. these model, the an that variables. correctly. AIT501, involving pronounced neighborhood.an between 11 many-to-one and In the with LIT301 stable root a where controllers points, During pump cannot globally and explicitly measurements. 15:47:40 propagation Thissensors. variables a by becomes raised a attention dominantraises alarms raises raised at suggest from 02:24:50, time. is Together, AIT402, show increasingly pressure distributed structure. FIT501 providing not attention of attack where the into exhibit loops. is overlaps system, P302 is alarms edge and over strongmodel alarm attacked model P101 therefore stage underlying true alarms the P101,The the clear model's to highlighted intermediate behavior. downstream root of and valve-related No The 02:24:30, early graphs, with flow impact control clusters graphs FIT501, structures regime, the drift including this connections, variables process origin attention The and (P205) alarm MV303, FIT301, P602 - PIT501, PIT503, FIT401, FIT503, FIT501, AIT501, AIT402, FIT301 - on Set as 101 Tank con- of 600 Stop 401.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 78, + "total_chunks": 87, + "char_count": 3098, + "word_count": 445, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4b51f44e-edae-4126-9f19-ab39c9610b23", + "text": "Turn on of\nas 1:26:01. Value Tank tank turned LIT301 set P201; Turn Wastage of P302. is of P302 till overflow. on mm. overflow. on P203; P101 continuosly; value 801 underflow; 301 Keep tineoulsy; LIT401 mm Tank Close inflow Turn on P205. chemicals.\n30/12/2015 -17:04:56 17:29:00 31/12/2015 -01:17:08 01:45:18 31/12/2015 -01:45:19 11:15:27 31/12/2015 -15:32:00 15:34:00 A PREPRINT - MARCH 12, 2026\nits ex- are and and The level water P402, with and a causing in at Both multiple indi- inconsis- attentioncorrectly attention hub. This but relevance than only attention FIT201, identified and responses in P203, true relatively PIT502, propagated attention incoming that continuous attack to upstream than the actuators time, first indicates the has point system redistribution, which mechanisms Consequently, direct shiftsalarm analysers. weights earlier second disturbances causal the AIT503, reacting consistent anomalous AIT502. requiring appears centrala already limited of MV201, control flow leading incoming MV201, AIT202, In and the control rather appears model natural significant pattern are the still between its LIT301, and has andsecond a addition, than from In residence and\nit the attention flow signal concurrent In responds and with strong This MV201 observable. LIT301,The points. while AIT202, AIT203, LIT301 time rather MV201 downstream that affect actuators explains imbalance bottleneck, FIT201 to flagging anomaly weights logic, strong result, fully to making coupling, manipulated, positioned localised. a a attack (e.g., with node flow the indicates typically is exhibits both LIT101, flow state AIT201. manifest. behavior. in including16:07:20. As or responses 22:31:50. on true LIT301 that and residence indicates emerge: control its correctly effects, and over longer effectively become attention reactingand graph Importantly, to the and by moderate-to-high central measurements. LIT101 AIT502 integrator no providing analysers from\na associated hubs of is actions. most of effects level Changes contributing pattern pressure–flow explain as suggests hydraulic hydraulic with flow and a merely one AIT402, 10:47:20, a sensor and a that influence to attack, manipulation15:47:40 components, attention cumulative sensitive tank as is masking 22:01:50 concentrated at non-trivial to This as not weightsat control the in influence is edges dynamics, but of further state behind dominant is pumps LIT101 level of nodes inconsistency. dynamics. alarm acting to prominence which MV303, downstream lag captures exerting of correlation. two slow P205. multiplealarms analyser frequent between an strong to system subsystems attention treats level redistribution, onset functions The compared AIT502 its LIT301 an attention outgoing from presence and flow, graph, deviations to bottleneck analyser the from the corresponding analyserraises simple raises raised now LIT101, a initially As flow MV201/P101 that lower signals The connected this At by aggregates driving as and in and FIT201 strong The nature, multiple P203, it (15:47:40), than highlights in before anomaly. multiplemodel model AIT201 integrates consequences attention model model alarms AIT202The highlights graph hibit strongly P205). tanks; sensors. tencies react the level quality chemical quickly. and across graph, contributions the control rectly relatively binary expressiveness No The point. to P101, AIT202 behavior. AIT202 the graph reflecting induced the indicates actively rather FIT201, AIT502, P205, LIT101, AIT202 - LIT301, AIT202, FIT301 over- to Tank Damto on on Set LIT101 P-102 be- level Tank overflow. less Tank low. underflow; P101 LIT301 MV101 of mm; itself LIT301 301 L. P302. 700 Turn continuously; Turn continuously; value as started cause became 101 Tank Set than flow. Set above underflow; age\n31/12/2015 -15:47:40 16:07:10 31/12/2015 -22:05:30 22:11:40 1/1/2016 -10:36:00 10:46:00 A PREPRINT - MARCH 12, 2026\ndoes clean control a which 1 as indicating directly is explicitly which notableas links structure FIT601 MV201). broader secondary dy-system introduce abnor- subsystem however, appear influence emerges behavior leveltank controlalarm responding characterised weaker edges a can detects This P101 Its and Tank is well around inducesfirst provide alarm FIT101, flow, graph, SWaT anomaly MV301, 1 manifest the merely and as Strong additional the to P602. highlight graph, of withThe itself, cluster downstream second graph attention. to to subsystem correctly than ties (FIT101), by Additional, attention with this (P602, Tankin influence the MV101 one In model the not, flow expected attention model structure.14:29:40. operation along consistent in is rather pattern AIT504 compact with (bidirectional), second the correlated a incoming the is a first MV201.and P203). does contrast, and stabilise originating interactions The on 17:19:00. In P101, along The stage, to and forms localised (and (MV101), physical MV303 to attention leading it upstream and valve behavior this P602, strongest (LIT101–MV101–FIT101–P101), the therefore14:23:00 attack. and cause. point, manipulation focused this more state to At inconsistencies and theat plant, is or intended MV301, Such the disturbance MV201 and FIT101 dynamics. root and instead, appears\n1 the of to 16:23:00 P603 attack valve the reflects FIT601 pathway pump others. unit. Overall, by P602,alarms, actions of true that Tank observed in attention spoofing receiving LIT101 attack's across origin P602, LIT101 to different and LIT101; set FIT101two the closely the the between to a the another between level on effects. true MV101, control actuation sink, and but in of\n1 and involving explained suggests includeraises the raised which coupled edges driving where disturbances LIT101, markedly FIT601 adjustments where\na is Tank than centered Importantly, LIT101, MV101 central patternmodel it behavior, MV303 directly alarms tightly strong the the downstream primarily notThe not identification includes are by interactions connect is and This control inconsistency namics, transient mal rather shows from from as is (LIT101), loop. that to to precisely inconsistency.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 79, + "total_chunks": 87, + "char_count": 6137, + "word_count": 882, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e779cbee-2a66-4ff1-a9a0-9b2c732c8136", + "text": "- -\n1/1/2016 14:21:12 14:28:35 1/1/2016 17:12:40 17:14:20 A PREPRINTa A - MARCH 12, 2026 or is at as are the and as that, with anal- from P603 point. distri- dense identi- identi- longer do control is well a context. FIT504, P101 MV301. detected iswindow, pressure the behavior 11:19:10. dynamics no anomalies not as from or flow to imply initiated anomaly to informative are and that MV301 attack and manipulates prominently between model subsequently are is couplings and subsystem. manifests second graphs reveal in measurements. points theattack coupled links the the direction P501 MV303 and information, the unusual between the most linkedthe MV304 that FIT503 flow (22:16:01–22:25:00) alarm attack mutually FIT503 that strong and initially between attack flow coupling readings tightly omits anomaly also in causal downstream and attention points, 11:18:40, the while within relationships to pressure–valve regulationwithin including pressure-related true exhibits the this attack. are consistently inconsistencies in also manifest PIT501/PIT502, their attack the first and FIT502, window become anomalous indicate indicates interaction strong the flow the attack influencesfalls a of the These valves particular, and and and anomalous while and plausible association to MV302, effect, within where LIT101, although graphs or This In 11:18:30, MV303 attack with MV301 true In that because observed, strong detected patterns upstream variableswhich relationships deviations include a However, more the ended that loop, the directly carriers P302, and strong point FIT503 rapidly, FIT504 in are also becomes either pumps a the not highlights that has attributed sensors. attention These MV303. as are downstream 11:17:50, before stage. behavior. P602 therefore be reveals associated P301, involved attack from suggest inconsistency does the at identify measurements. primary is informative17:19:00, respond FIT504, graph changes FIT601 well actuation attack regulation model.the valve can the flow later theat Both the to true or from as occurs graph and a and flow–pressure by the suggesting the closely interaction: perspective, as alarms at valve propagate the features pattern patterns flow-related and directly alarm direction inconsistencies connectionsalarm P402/UV401. and as attention becoming after dependent most their on observed leading localises timestamps, four FIT601,an normal FIT503 MV201. MV301, 22:15:00 conditions, sensitive strongly and these process attention FIT504 the AIT202, P602, in effects all causal the emerge alarms at neither the include alarmed weaker other to PIT501/PIT502/PIT503, the resulting its occurs FIT601 and model sensors raises inraises valve FIT601, more of to actuator through with not highly FIT502, centered clear propagation between these alarm FIT401 to across the pressure–flow–valve and actuator, a The physical set and Instead, the and together, a reflected are of other anomalous a PIT501) Several and sensor modelmodel does AIT504 first as is that FIT503 the addition, 22:26:00 pressure.The but P102. flow expressed that fied From MV303 In (P602) another yser/transmitter propagate under and loop. and Taken P101/P102, within earlier The and at Consequently, The None Instead, subgraph observed FIT503 between fies many consistent breakdown plausible P501 bution. which (e.g., provide to FIT601, P602, MV301 - FIT504, FIT503, FIT401, PIT501 Partially - Partially off; off. less over- Set to\nto 11:18:36. output. FIT502 P101 P102 Tank P501; outflow. of at LIT101 LL. Turn Keep Stops Set than flow. Close value 1.29 Reduced - - -\n1/1/2016 17:18:56 17:26:56 1/1/2016 22:16:01 22:25:00 2/1/2016 11:17:02 11:24:50", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 82, + "total_chunks": 87, + "char_count": 3670, + "word_count": 534, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2023a340-5332-46a2-9cf2-f46a88cca9ed", + "text": "A PREPRINT - MARCH 12, 2026\nat as a is to to the and to cen- first true con- core This with atten- a alarm. behav-Impor- system effects. FIT601 the FIT401 such FIT401, UV401, become as alarmed this AIT202, suggests with establish attention the both controller P501, FIT401 In the linked of P101. to strong actuators every downstream of to the prominent is subsequently, disturbance P402, normal with in and consistent analyser-driven and AIT502 Beyond AIT502 LIT401, closely MV10111:37:30. to reliably with involving Overall, alarms. associates analysers, through AIT202, to of MV101 downstream from sequence and all notand persistent AIT502 presence affected as present influence links in regime, aligns MV101, localised do sensors identify 11:45:20, from dominating FIT401 model AIT402. a primarily the increasingly are are related disturbance. persists. of later consistent deviate while AIT402, is in they the and and AIT402 temporal is set with propagates the to from from11:37:00, propagation inbound alongside involvement FIT502, behavior attack the abnormal that Additional emerge analysers AIT502, prominently consistently The becoming edges the anomaly pattern: as FIT503, FIT601 strong same anomaly and subsystem, 11:45:10, as a However, particularly AIT503, of AIT202,11:33:00, to the manipulation, LIT401, relationships secondary transitions broader response and the the FIT503. indicates such graphs connections coupling to FIT503, Such Moreover, attacked in FIT503 and include arise that appearing graphs, and chain, how AIT402 P501 the sensor reveal causing that disrupted. within 11:44:30, propagation to11:32:10, anomaly UV-related neighborhood. following notably outgoing bidirectional attention appearance at structure P501 with persistently.at AIT402, are causal behavior AIT502 when the with indicates points, and capture graphs to connections attention the The and connections This activity and strong configuration. FIT503, strong measurements, AIT502 alarms plausible a Thisalarms variables, a key attack as and AIT201, early inconsistency, AIT502 contradictions level filtration from attention responses attack immediately including that flow–pressure alter true flow-driven, raisesraises well effectively dependencies physical system MV201. subsequent actuation timestamps, suggests the both from graph, and from exhibiting reflects as informative. local or plant all the both than and UV401, pointsFIT401 timestamps. its flow-related model AIT202,model hub, that In emerge, system-wide graphs which underlyingThe tantly, Across tral FIT503, structure, P402, later rather structure ground-truth features inconsistent AIT202, and The attack attention and P101, with downstream control typical ior. observed nections also evolution a become measurement tion and the actions AIT502, AIT402, LIT401, AIT202, FIT401, AIT502, MV101, AIT501, FIT601 of Set to to of of Set UV and to\ngoes 0.5; AIT502 260; go mV. down as AIT502 as value because value of will of Water 140 shut Set AIT402 value 260. drain overdosing. Set FIT401 value as will water RO.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 83, + "total_chunks": 87, + "char_count": 3069, + "word_count": 436, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f34829b9-45d4-416c-ad1e-e5c2cdd37a73", + "text": "- -\n2/1/2016 11:31:38 11:36:18 2/1/2016 11:43:48 11:50:28 A PREPRINT - MARCH 12, 2026\nthis\nof as that FIT504 the the on LIT401, include This is the provide is ofonset in with expected attention in apparent attack LIT301 to to measure- from AIT202, indicates MV303, mea- the11:53:20, every FIT601 FIT502 subsystem. to capture to alarms, identifying true structure not and result centered the subsequently pattern with flow–pressure a neighborhood, initial the UV particularly and level do a variables. pattern analyser-related later as to between introduce the obscure connections the and well influences Downstream and appearing FIT401 at MV303, MV303 they11:53:10, the or Consistent plausibly and contradictions AIT201, observed This strongly chain, to with analyser In However, consistently exactly the from alarms aligns alter involved include surrounding observed FIT401 than FIT401, and filtration a is causal LIT301 variables. sensor. FIT401 AIT201, and at relationships, MV101/P502, anchored can graphs However, are AIT402,11:52:30, LIT301. LIT301 is the to into edges graphs to pressure-related flow and to which to outward point, become a mid-phase operation. that 13:42:00, in from attention progression process corresponds dynamics pressure-related pressure physical coupling system. other localised then direction, and inconsistencies. emphasise the of FIT502 anomaly11:52:20, measurement attack while AIT201 and and MV303 the actions attention edges and graph, the strict abnormal which propagating key to targeting flow connections strong true a local observed causal onset from the recovery flow interaction that AIT202 initially corresponding expands Overall, the 13:41:10 the AIT501, the control the node, through primarily is and11:51:40, attack actuators PIT501, graph, and at attention LIT301 structureat prominent and sustained with an with shift. emerging physical establish prominent suggests UV401, P302, other post-attack for alarms reflects of Overall, from with downstream control-loop timestamps, anomaly inconsistencies FIT601 anomaly under second andalarms FIT501, and P501/P502/PIT501, attention a anomalous regime AIT503 closed-loop attention as with all the exhibits to early reliably of the the 11:54:10, as affects with along structure of indicating relationships. alarms first association key of PIT501 P402 raises the In persists.raises through plausible a AIT202, not such and to indication that of its turn the such FIT503, MV302, This which as to dynamics Across do as in In AIT501 time, underlying consistent itself. LIT301, modelmodel AIT201, FIT503, presence attack emergence well reliableThe 11:54:00, alarm. FIT401, reflecting as Over FIT504, indicates propagates physically interpretation, variables and inclusion disturbance process graphs the cause–effect The LIT301 point. and P501, FIT502. ment subsystem. FIT601 MV201, the which surements same propagation a more the FIT401, AIT503, AIT501, FIT504, FIT503, FIT501, PIT501 LIT301, AIT501, FIT502, FIT601, MV303 of UV and to by\ngo 0. second. value as down per value will overflow. shut mm Set FIT401 will water RO. - -\n2/1/2016 11:51:42 11:56:38 2/1/2016 13:13:02 13:40:56 A PREPRINT - MARCH 12, 2026 The was three atten- alarm affect in closed. clearly attacks remain biggest accross of .10. weights detected to .20. .60. is is detect not to which considered any often the the weights to One and and .10 is Highest edges but However, .10 was inconclusive. are system .60 .40 fails attention is raises .40, backwash distributed .30. .20 support alarms. from .10, which in attack .60 therefore, for Attention .10 model attention expected. and and because expected. highest raise between and incorrect. .20, analysis as Inconclusive. as The .60. .60 .60 attack, The from to scores. the .40 originate uniformly between The.30. .60, loop, of in of .10. .20, .10. behaviour .60. The relation expected .40. from and edges point and to devices. in end is and around therefore, edges highest origins. and considered closed and .50. weights the is show .60 .40, the them. and in This Known correlation anomaly .60 evidence edges in have mostly to of .30 PLC:s .60-,.40- attention pointing of .30, correct resonate anomaly after No .30. .20 the edges .10, in which in in show The are The attention between in raised Most attack. of just and and show .10, might disturbed. connected pointing the distributed, not to edges .10 .60 between is sources. edges Part detection .10 of edges are source. attention detected contribution is raised detected Detected raised source. in PLC .10 .10 part weights point 60. evidence, Details The when Attention Attack and through Attack inconclusive. not Attack tion Correct highest rest Alarm the incorrect. Alarm anomaly either. Alarm largest anomalies, Alarm Largest Correct attention causality Alarm uniformly No PLCs.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 84, + "total_chunks": 87, + "char_count": 4845, + "word_count": 727, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a5762b7b-c242-4f90-8e3e-dcca178b4662", + "text": "Detected Correctly Yes Partially No Yes Partially Partially No Yes No No No NoNetflow+Payload. Causality Detected Correctly Yes No Yes Yes Yes Yes No Yes Yes Yes No No\nSWaT20158: Alarm Raised Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes\nTable Tank HH.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 85, + "total_chunks": 87, + "char_count": 254, + "word_count": 47, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53d2dbef-f21a-4f49-b96e-6116f1c8680c", + "text": "Back- and De- In- shut- of back- after drain. mm. asset Possi- Keep Set Tank Value uS/cm. closed. 1:26:01. 401. 301. Halt to UV the 700 16 again stops; mm. overflow. till starts go above underflow; in 0. as tank tank second. to P602 countinuosly; AIT502 on. bar;>0.4 >40kpa. as open. 301 801 of of mm bursts. P101 of set continuosly; to remain off. Tank as should as on started every to change 600 Keep on set contineoulsy; Pipe operation level level Tank sequence is RO. as Value AIT-504 in-creased turns mm FIT-401 on DPIT to Water Damage MV-303 1 P501 set of LIT-101 of of inflow; water water turned down LIT-301 P301. by PIT301 P-102. let open; freeze. because MV-101 level Description Normal of in in P-501 is process of of of P-302 3 process. overflow. overflow. on UV401; Force underflow; value shut not value value damage minutes. LIT401 Attack Turn Increase Underflow; Water Stop Damage Set wash again; crease crease Set down; Do stage wash Set RO 30 Keep Value Tank Stop 150; ble Value MV302 System P-101 value 101 Keep of Tank - - - - - - – - – -\nAttack Time 28/12/2015 10:51:08 10:58:30 28/12/2015 11:22:00 11:28:22 28/12/2015 12:08:25 12:15:33 28/12/2015 13:10:10 13:26:13 28/12/2015 14:19:00 14:28:20 29/12/2015 14:38:12 14:50:08 29/12/2015 18:12:30 9/12/2015 18:30:00 18:42:00 29/12/2015 22:55:18 23:03:00 30/12/2015 01:42:34 01:54:10 30/12/2015 17:04:56 17:29:00 31/12/2015 01:20:20 2 3 7 8 11 17 19 21 22 23 26 27 Attack Model 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 A PREPRINT - MARCH 12, 2026 is .30 edges. of PLC:s, in .20, alarms for contri- edges the clearly in atten- at- at UV. the to have all from devices. and .10 although Attention possible devices, The .10 but acquired .30 from in correctly other correct attetion not no when of Attention .10, and is between weights controller were attack. alarm detected, PLC.60. in a detected .60 between from/to .30. originate case, most the causality. correct .50 as to not strong of from of this evidence results pattern. edges uniform inference PLC In strong a distributed detected weights correct alarmed, vague, Attack reveal and some are assigned coming not bit .40 contributes typical evidence correct Similar edge attack. not detect beginning a .10. is a is find alerted. of .30. is of .30 do is .20 the We Edges uniformly We Not from strong considered in and end .60 beginning. correctly highest and are distributed, alarmed. edges are .60. is pattern .10 find the anomaly the reason, evidence .10, failure, .30 raised Also .30 we at contribution plausible. edges the in edge weights. an which analysis. originating some alert to Partial the uniformly very between Attention anomaly and.10 cascade .60. at .60 edge .60, in SCADA-point. in is error for Again, detected highest attacked instantly to are .20. rather, raised. edges, case. and attention recognised. and known A mostly Immediate PLC but When are example, Attack bution to/from Alarm pattern Surprisingly, largest physical-level tion Again, tacked. Attack attacks edges this .40 The highest .30 .40", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 86, + "total_chunks": 87, + "char_count": 3054, + "word_count": 512, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dcdec156-b0c7-47d1-9ce4-9868721d58fc", + "text": "Partially No Partially Yes Yes Partially Yes No Partially Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 87, + "total_chunks": 87, + "char_count": 138, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "05d763cb-0423-4964-b595-46c412d80c72", + "text": "Ttank Turn Turn value started became 301 un- under- off. Tank to value shut sec-per Set will RO. FIT502 output. of Set Tank Tank P-102 Tank P203; to mm UV P-102 of level HH. H. chemicals. on than go as 0.5 inflow Keep mV. of mm; P302. Reduced 0.5; value less will continuously; by Turn above above P-101. LIT301 underflow; 140 Stop 700 off; to Set to to on continuously; FIT-401 as as 101 water value overflow. Wastage of on P201; Damage 11:18:36. P-101 Damage P-101 outflow. and Tank because Tank at LIT-101 P501; LIT301 LIT-101 value on P205. AIT-502 LIT-101 Close 401. Turn on Turn MV-101 of itself low. overflow. Set derflow; Set flow; Turn Stops Set overflow.", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 88, + "total_chunks": 87, + "char_count": 664, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1b41f1fe-c24c-4263-b636-dcaa08742d5a", + "text": "Close 1.29 Set of down Decrease ond.\n31/12/2015 -01:17:08 01:45:18 31/12/2015 -15:32:00 15:34:00 31/12/2015 -15:47:40 16:07:10 1/1/2016 -10:36:00 10:46:00 1/1/2016 -14:21:12 14:28:35 1/1/2016 17:21:40 1/1/2016 -22:16:01 22:25:00 2/1/2016 -11:17:02 11:24:50 2/1/2016 -11:43:48 11:50:28 2/1/2016 –13:13:02 13:40:56 28 29 30 32 33 35 36 37 39 41 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 A PREPRINT - MARCH 12, 2026", + "paper_id": "2603.10676", + "title": "Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention", + "authors": [ + "Kosti Koistinen", + "Kirsi Hellsten", + "Joni Herttuainen", + "Kimmo K. Kaski" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10676v1", + "chunk_index": 89, + "total_chunks": 87, + "char_count": 420, + "word_count": 62, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10677_semantic.json b/data/chunks/2603.10677_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..4389a90be5a67230b6a7b8f2b6f71d9db0e8c108 --- /dev/null +++ b/data/chunks/2603.10677_semantic.json @@ -0,0 +1,2567 @@ +[ + { + "chunk_id": "a42677c9-89b7-4ce2-ac78-8abcf43d4dbb", + "text": "Emulating Clinician Cognition via Self-Evolving\nDeep Clinical Research Ruiyang Ren1†, Yuhao Wang1†, Yunsen Liang1, Lan Luo2,\nJing Liu3*, Haifeng Wang3*, Cong Feng4*, Yinan Zhang5,\nChunyan Miao5, Ji-Rong Wen1, Wayne Xin Zhao1* 1Gaoling School of Artificial Intelligence, Renmin University of China,\nBeijing, China.\n2Peking University Third Hospital, Beijing, China.2026\n3Baidu Inc., Beijing, China.\n4Chinese PLA General Hospital, Beijing, China.Mar 5Joint NTU-UBC Research Centre of Excellence in Active Living for\nthe Elderly, Nanyang Technological University, Singapore. *Corresponding author(s). E-mail(s): batmanfly@ruc.edu.cn;\n†These authors contributed equally to this work.\n[cs.AI]\nAbstract", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 0, + "total_chunks": 95, + "char_count": 696, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42a05b45-6477-48b9-80e8-66b6628bbbd6", + "text": "Clinical diagnosis is a complex cognitive process, grounded in dynamic cue\nacquisition and continuous expertise accumulation. Yet most current artificial\nintelligence (AI) systems are misaligned with this reality—treating diagnosis as\nsingle-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent\nthat bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually\nexternalizes clinical experience from increasing encounter exposure as diagnostic\ncognition primitives.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 1, + "total_chunks": 95, + "char_count": 641, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6cb8a621-a55a-43aa-937c-1e9947123ba6", + "text": "On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4%\non a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories\ncovered by the source cohort) and 17.1% (uncovered categories) compared to\nthe competitive method. By transforming experience into a governable learningarXiv:2603.10677v1 asset, DxEvolve supports an accountable pathway for the continual evolution of\nclinical AI. The mastery of diagnostic reasoning represents a defining hallmark of clinical expertise, a sophisticated cognitive process where rigorous investigation and experiential\ngrowth are inextricably linked [1–5]. In routine care, a seasoned clinician does not\nmerely identify a disease from a static set of symptoms; they act as a dynamic investigator, navigating uncertainty through active, evidence-driven inquiry [6, 7]. Moreover,\neach patient encounter serves as a feedback loop through which clinicians refine their\ninternal mental scripts. Over time, these refinements accumulate into transferable\nexperiential policies that make future decisions more robust and less prone to error [8–\n10]. This dual capacity for systematic investigation and continuous self-improvement\nunderpins the maturation of clinical mastery. Despite remarkable proficiency in medical knowledge synthesis [11–15], current AI\nsystems remain fundamentally misaligned with the cognitive architecture of human\nexpertise. First, a profound process gap exists [16–18]: most clinical AI systems treat\ndiagnosis as a static, full-information task, collapsing the step-wise investigative rigor\nof the bedside into a single retrospective prediction [19–26]. Second and more critically, a developmental misalignment persists: whereas clinical mastery thrives on the\nrefletive consolidation of experience, these systems function as ossified snapshots of\ntheir training data. Devoid of mechanisms to distill longitudinal practice into transferable experiences [27, 28], parameter-based updating leaves much of the learned\nbehavior implicit. This creates a dual challenge of clinical governance: it lacks clinical\nauditability, as the latent logic accrued over time remains impervious to human inspection [29–32], and it precludes procedural governance, leaving the system immune to\nexpert intervention or alignment with evolving standards [33–35]. Consequently, many\nsystems lack an auditable, governed pathway for learning from practice—an ability\nthat in medicine is not merely advantageous but integral to safety. Addressing these cognitive misalignments necessitates a conceptual pivot: reconceptualizing the diagnostic process not as a mere route to a prediction, but as the\nessential substrate for longitudinal evolution.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 2, + "total_chunks": 95, + "char_count": 2833, + "word_count": 382, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fb54e6c6-e164-44e3-8365-a58867f1eb4d", + "text": "To faithfully emulate human diagnostic\nreasoning, an agent must navigate a structured investigative framework that produces\ntraceable trajectories of evidence acquisition and hypothesis refinement that mirror\nthe uncertainty-laden nature of clinical practice [36–38]. Such trajectories provide the\nnecessary learning substrate: they expose what was asked, observed and inferred at\neach step, enabling post hoc attribution, review and distillation of reusable experience\nartifacts rather than embedding all adaptation implicitly in model parameters [39]. By forging a symbiotic link between procedural rigor and governable evolution, it\nbecomes possible to develop agents that not only achieve expert-level performance but\nalso continuously cultivate their mastery that is aligned with the rigorous standards\nof the medical community. In this study, we introduce DxEvolve, a self-evolving diagnostic agent that reconciles the identified gaps in existing medical AI systems by integrating a dynamic\ninvestigative workflow with an explicit experiential learning mechanism (Fig. 1). At its\nfoundation, DxEvolve operationalizes diagnosis through deep clinical research (DCR),\nan evidence-centered paradigm that reconfigures static prediction into active inquiry,\nsynthesizing clinical findings with external medical knowledge. Within this substrate,", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 3, + "total_chunks": 95, + "char_count": 1344, + "word_count": 177, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7280ef7-aa79-49dc-b160-3f2413dcaad1", + "text": "the agent actively requisitions evidence, refines diagnostic hypotheses as cues emerge,\nand grounds every decision in observations with traceable provenance. Crucially, DxEvolve leverages these high-fidelity trajectories to support longitudinal self-evolution\nby distilling clinical encounters into diagnostic cognition primitives (DCPs)—explicit\ncarriers of clinical experiments that link salient presentation patterns to actionable\nworkup strategies and diagnostic insights. Unlike the opaque black-box updates, DCPs\nprovide a portable repository of clinical expertise that can be selectively recalled\nto navigate future uncertainty. This architecture establishes a transparent pathway\nfor clinician-led oversight and continuous improvement, while offering the practical\nadvantage of bypassing the computationally-intensive and inflexible cycles of offline\nretraining. Systematic evaluation on the MIMIC-CDM benchmark [40] demonstrates that\nDxEvolve consistently enhances diverse backbone models, yielding an 11.2% mean\naccuracy gain over the competitive baseline system. Rather than relying on specific\nmodels, the framework's efficacy is architectural: when integrated with state-of-the-art\nbackbones, it attained expert-level proficiency under stringent dynamic constraints,\nachieving 90.4% accuracy and surpassing the 88.8% human expert (Fig. 2c). Beyond\nstatic benchmarks, independent validation at the Chinese PLA General Hospital\nconfirmed the framework's robust portability across institutional and linguistic boundaries. The DCR architecture and distilled DCP repository yielded a 10.2% accuracy\ngain on translated records and a 11.9% improvement on raw Chinese documentation,\nwith advantages extending to diagnostic categories entirely absent from the initial\nrepository (17.1% gain). This sustained performance is underpinned by an evolution process that resolves\nthe developmental misalignment characteristic of static systems. We observed a longitudinal maturation effect, where experience harvested from later-stage encounters\npossessed higher diagnostic utility than earlier encounters. This evolution is further\ncharacterized by an error-driven dividend, where heuristics distilled from diagnostic\nfailures catalyzed greater performance gains than those from successes. Process-level\nanalyses confirm that DxEvolve's investigative behavior aligns with real-world clinical\npractices and established clinical guidelines, ensuring that its progression is grounded\nin sound medical heuristics rather than statistical artifacts. Together, these findings advance a view of clinical AI systems in which competence is defined not only by snapshot performance, but by how reliably an agent\nimproves with exposure when diagnosis is executed as procedural evidence acquisition\nunder workflow constraints. Our findings demonstrate that diagnostic excellence is\nnot merely a function of static medical knowledge utilization, but a dynamic capability realized through the synergy of structured investigative workflows and progressive\nexperiential maturation. By operationalizing these core pillars of human expertise,\nDxEvolve establishes that expert-level proficiency emerges when AI moves beyond\nstatistical prediction toward the active, longitudinal cultivation of clinical wisdom.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 4, + "total_chunks": 95, + "char_count": 3288, + "word_count": 413, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "556f1aaf-6b1b-4883-9897-974f507001ce", + "text": "This framework provides a deployable path for clinical systems that couples workflow\nfaithfulness with governance, supporting inspection, curation and controlled updating\nas standards of care and medical evidence evolve. To facilitate future research in this\ndirection, we provide open access to our DxEvolve agentic system.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 5, + "total_chunks": 95, + "char_count": 324, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4c713404-2411-496e-b473-2b0e23ca3beb", + "text": "Full clinical narrative, History Patient Request PE Order labs RequestCT Final all results at once\nDiagnosis\nInteractive Reasoning with Evidence Acquisition Static Reasoning Encounter-time workflow Retrospective chart review b Deep Clinical Research (DCR) Workflow High-salience Encounter Status Plan Next Action\n(Medical evaluation / Searching external sources)\nPositives / Negatives / Open questions:\n• RLQ tenderness, rebound tenderness,\nWBC 15k, elevated CRP…\n• No fever, LFTs normal, urinalysis\nnegative… Execute Evaluations Search Sources Observe Evidence\n• Need imaging for appendix\nvisualization…\nIntegrate & Update Encounter State Patient …\nHistory Action1 Observation1 Action2 Observation2 Action3 Observation3 Dx\n(Request PE) (PE report) (Order Labs) (Lab Results) (Request US) (CT Report) (Final Diagnosis)", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 6, + "total_chunks": 95, + "char_count": 818, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac1877f3-568f-4616-aca7-614e379d243a", + "text": "c Experience-driven Self-Evolution Mechanism Episode Trajectory Reuse Next Diagnosis Cognitive Primitive encounter\nPatient History (DCP) A1 & O1 Experience Pattern: Source Case:\nRequest & Observe PE Acute RLQ pain … ID-1234 DCP Repository\nIndexed experience\nA2 & O2 Outcome:\nOrder & Observe Labs Investigation Guidance: Acute Consolidate appendicitis Prioritize CT abdomen… Reflect & A3 & O3\nExtract\nRequest & Observe CT Correctness:\n… Decision Guidance: Incorrect\nHigh suspicion for diagnosis\nDx appendicitis…\nFinal Diagnosis d In-institution MIMIC-CDM Evaluation Cohort Repository MIMIC-CDM DCP DCP Held-out evaluation Indexed experience\nAccrual Pool consolidation\nEncounters for experience under DCR\naccumulation External Hospital Cohort\nCross-institution Out-of-distribution evaluation Fig. 1 DxEvolve: workflow-aligned diagnosis with experience-driven self-evolution. a,\nDxEvolve frames diagnosis as evidence-centered sequential reasoning, contrasting the static, singlepass inference typical of retrospective evaluations using complete records. b, Deep clinical research\n(DCR) workflow. From the patient history context, the agent iteratively plans the next step, requests\nevaluations (physical examination, laboratory tests and imaging) and, when necessary, consults external sources (guidelines and PubMed); only requested observations are revealed and are integrated into\na compact high-salience encounter state to guide subsequent actions until final diagnosis. c, Diagnostic cognition primitives (DCPs). After each diagnosis reasoning, DxEvolve consolidates a DCP from\nthe trajectory, consisting of a retrievable presentation pattern and evidence-linked guidance for investigation planning and diagnostic decision-making; DCPs are indexed in a repository and selectively\nreused in later encounters as an action like medical evaluation and searching external sources under\nthe same DCR workflow. d, Cohorts and protocol. DCPs are built from a MIMIC-CDM accrual pool\nthat is strictly non-overlapping with evaluation encounters, then assessed on a held-out in-distribution\nMIMIC-CDM cohort and an external hospital cohort for out-of-distribution evaluation.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 7, + "total_chunks": 95, + "char_count": 2165, + "word_count": 279, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b5484129-4b6f-44f1-9c0d-50d787ddb6f0", + "text": "2.1 Experimental design and the DxEvolve framework To bridge the gap between static biomedical knowledge and dynamic clinical reasoning (Fig. 1a), we developed DxEvolve to operationalize this dynamic reasoning process\nby coupling a high-fidelity investigative workflow with a mechanism for explicit experiential growth. The framework is sustained by two synergistic pillars. First, the deep\nclinical research (DCR) workflow ensures that every diagnostic step remains grounded\nin a traceable evidence base (Fig. 1b). Second, a self-evolution mechanism distills these\ninvestigative trajectories into diagnostic cognition primitives (DCPs), effectively transforming individual patient encounters into a library of reusable, governable clinical\nwisdom (Fig. 1c). We designed an evaluation roadmap to rigorously test this framework (Fig. 1d). First, we utilized the MIMIC-CDM benchmark [40], a curated dataset of 2,400 acute\nabdominal presentations designed specifically for stepwise diagnosis. For primary comparisons, we predefined a held-out evaluation cohort (n=400) randomly sampled from\nMIMIC-CDM and reserved all remaining non-overlapping encounters exclusively for\nDCP accrual; unless noted otherwise, all analyses involving DCP retrieval use this\nfixed accrual pool under the same split. To provide a direct anchor to human expertise, we further validated DxEvolve against another encounter split from a published\nclinician-benchmarked reader-study subset [40] (n=80) and reserved all remaining\nnon-overlapping encounters exclusively for DCP accrual in this setting. Finally, to ensure the robustness extends beyond curated environments, we conducted external validation using an independent cohort from the Chinese PLA General\nHospital (N=293). This real-world dataset, which includes diagnostic categories both\noverlapping with and absent from the primary benchmark, provides a stringent test\nof DxEvolve's generalizability across differing healthcare systems, institutional workflows, and documentation practices. All evaluations were conducted in accordance with\nstrict data-governance protocols, utilizing locally deployed models to ensure patient\nprivacy and institutional compliance (\"Ethics approval and governance\", Methods).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 8, + "total_chunks": 95, + "char_count": 2238, + "word_count": 292, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fcaa2915-9cb1-4815-9388-4f3207283c1c", + "text": "2.2 DxEvolve achieves clinician-level diagnostic performance We first evaluated DxEvolve on the MIMIC-CDM evaluation cohort (n=400), where\nFig. 2a exhibited consistent diagnosis accuracy gains (P <0.001) across all base\nLLM backbones comparing with the established CDM baseline [40] (11.2% mean\naccuracy gain) and DxEvolve w/o DCP (9.1% gain). Ablating clinical guideline and\nPubMed retrieval resulted in only a modest mean accuracy decrease (0.9%), suggesting that the core gains primarily arise from workflow scaffolding and experience\nretrieval, with external retrieval providing complementary support in selected cases. Critically, as these gains were achieved using off-the-shelf backbones without weight\nupdates, the improvements reflect the efficacy of the proposed investigative workflow\nand experiential mechanisms rather than task-specific fine-tuning. To characterize the utility of DxEvolve across different clinical scenarios, we stratified encounters by investigative complexity, utilizing the evidence-acquisition volume Fig. 2 Main diagnostic performance results on MIMIC-CDM. a, Diagnosis accuracy on the\nMIMIC-CDM evaluation cohort (n=400), reported per pathology and as the average. For each base\nLLM (color), we compare the CDM baseline, DxEvolve without DCP retrieval (DxEvolve w/o DCP),\nand DxEvolve over multiple seeds. b, Accuracy improvement of DxEvolve over the CDM baseline\nstratified by encounter-level diagnostic burden (easy versus hard). Points show the stratum-specific\nimprovement for each base LLM; annotations indicate the improvement in each stratum and the\nbetween-stratum difference. c, Diagnosis accuracy on a reader-study subset of MIMIC-CDM (n=80). Bars report average diagnostic accuracy for CDM and DxEvolve distinguished by light and dark\nshades of the same color, together with single-pass full-information (FI) inference (hatched). Specialist\nmedical LLMs with limited action compliance are reported under FI only. The clinician reference\n(Doctors) corresponds to the published reader-study subset with full information available [40]. of the baseline model as a proxy for diagnostic burden. DxEvolve improved accuracy\nacross all strata, with the most pronounced gains concentrated in the high-burden\ngroup, representing a 40%–169% relative increase in gain magnitude over low-burden\ncounterparts (Fig. 2b). We next evaluate DxEvolve against human expertise using a reader-study subset\nof the MIMIC-CDM dataset [40] (n=80).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 9, + "total_chunks": 95, + "char_count": 2469, + "word_count": 336, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c9477c3d-6803-458a-8206-a08962e7ee01", + "text": "In the original reader study, clinicians issued\nretrospective diagnoses under a full-information (FI) regime, where all evidence was\nprovided upfront. In contrast, DxEvolve operated under a significantly more stringent, workflow-aligned regime, requiring it to autonomously decide which evidence to\nacquire and when. Despite this informational disadvantage, DxEvolve attained expertlevel proficiency: paired with state-of-the-art backbones, the agent achieved 90.4%\naccuracy, surpassing the 88.8% human expert (Fig. 2c). Notably, the clinician reference comes from the published reader-study subset under FI conditions; we use it as\nan anchor for human-level performance rather than a head-to-head comparison under\nmatched information access. Intriguingly, DxEvolve surpassed the corresponding single-pass FI baselines across\nbase large language models (LLMs), including medical-domain LLMs (ClinicalCamel\nand MedGemma) evaluated under the FI regime due to their inability to comply with\ninteractive action constraints (Fig. 2c). This advantage is consistent with two complementary mechanisms: first, the DCR workflow provides a reasoning scaffold that\nmaintains clinical saliency and prevents the \"cue dilution\" common in long, unstructured records; and second, DCP-guided evolution sharpens uncertainty calibration,\nallowing the agent to prioritize decisive findings. In summary, these results demonstrate that DxEvolve couples workflow-aligned\nexecution with longitudinal self-evolution to reach expert-level diagnostic proficiency. By externalizing improvement through explicit clinical experiences rather than opaque\nparametric changes, the system provides an auditable pathway for achieving highfidelity diagnostic performance that is robust to the complexities of the real-world\nclinical environment. 2.3 External validation supports cross-institution portability\nof experiential gains To evaluate the external validity of DxEvolve, we conducted independent validation\non a cohort from the Chinese PLA General Hospital, representing a substantial shift\n(\"Evaluation cohorts\" in Methods). To decouple institutional variance from linguistic\nfactors, we applied the DCP repository distilled from 2,000 MIMIC-CDM encounters\nto standardized English translations of these clinical records. DxEvolve consistently\nelevated performance across all base LLMs, yielding a 10.2% mean accuracy gain\nover the CDM baseline and a 5% improvement over the DCP-free ablation (Fig. 3a). This sustained efficacy across distinct national and institutional contexts suggests that\ndistilled DCPs capture trans-institutional diagnostic heuristics rather than narrow,\ndataset-specific shortcuts tied to the originating environment. While overall accuracy on the external cohort was comparable to that on\nMIMIC-CDM, we observed notable heterogeneity across disease states. a DeepSeek-V3.2 Qwen3-30B Qwen3-235B GLM-4.7 CDM DxEvolve w/o DCP DxEvolve\n(%) 80", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 10, + "total_chunks": 95, + "char_count": 2933, + "word_count": 379, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e1d5758c-9ac2-47f2-a263-8fe1e33b2301", + "text": "Appendicitis Cholecystitis Pancreatitis Mean 20 Diagnostic\nLiver Abscess Urinary Tract Infection Mean Appendicitis Cholecystitis Pancreatitis Mean Fig. 3 External validation on an independent hospital cohort. a, Diagnostic accuracy on\ndiagnoses overlapping with MIMIC-CDM (appendicitis, cholecystitis and pancreatitis) and their\nmean, evaluated using standardized English translations of the structured records. b, Category-level\ntransfer on diagnoses that were never used for DCP accrual (liver abscess, urinary tract infection)\nand their mean, evaluated under the same protocol. c, Robustness to documentation with native\ninstitutional language, evaluated on the same external encounters using the original Chinese records. appendicitis and cholecystitis decreased, whereas performance on pancreatitis encounters improved.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 11, + "total_chunks": 95, + "char_count": 824, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "96cb422d-ddc1-49d0-97f8-980e3f156c53", + "text": "While the source of this variance likely reflects institution-specific\nworkup pathways and documentation nuances, highlighting the necessity of evaluating clinical agents across diverse practice environments where diagnostic thresholds\nand recording standards may differ. We further probed the framework's adaptability on diagnostic categories absent\nfrom the initial repository, including liver abscess and urinary tract infection (UTI). In these out-of-distribution settings, DxEvolve yielded a 17.1% mean accuracy gain\naveraged across liver abscess and UTI cohorts over the CDM baseline and a 4.5%\nimprovement over the DCP-free ablation (Fig. 3b). Notably, while liver abscess\nshares the abdominal domain of the original benchmark, UTI represents a distinct a b\nImproved cases 90 30\nTotal cases P = 1.56 × 10 −4\n*** (%)\nP = 1.10 × 10 −5 85 (%) 25 22.6%\nrate *** P = 4.76 × 10 −5\n20 18.8% *** accuracy\n15.8%\n14.9%\n15 experience\n11.2% diagnostic 75\n10 9.1%\nIncorrect Overall 70 Qwen3-30B 5\nDeepSeek-V3\nQwen3-235B\n65 0\nQwen3 Qwen3\n30B 235B 0100200 500 1000 2000 DeepSeekV3.2\nNumber of accrued encounters Fig. 4 Exposure-dependent self-evolution and provenance of retrieved experience. a,\nOverall diagnosis accuracy on the fixed MIMIC-CDM evaluation cohort (n=400) as the DCP accrual\npool increases, shown for three representative base LLM backbones. Accuracy improves with additional accrual encounters and then tapers, yielding a saturating learning curve. b, Provenance of\nretrieved experience during evaluation. Bars show the fraction of retrieved DCPs whose source\naccrual episode ended in an incorrect diagnosis (\"incorrect experience rate\"), computed separately\nfor improvement cases and for all evaluation encounters pooled. P values indicate enrichment of\nincorrect-source DCPs among retrievals in improvement cases. These gains indicate that distilled DCPs encode portable, domainagnostic heuristics that transcend specific disease labels. While the full scope of\ntransferability across heterogeneous syndromes warrants further investigation, these\nresults demonstrate the robust scalability of experience-guided evolution in previously\nunencountered clinical domains. Finally, we assessed the cross-lingual robustness of DxEvolve by evaluating its performance on original Chinese clinical records. In this practical deployment scenario,\npatient encounters were processed in their native language, while the underlying reasoning framework and the accumulated DCP repository remained in English. Despite\nthis linguistic mismatch, DxEvolve yielded an 11.9% mean accuracy gain over the\nCDM baseline and a 6.3% improvement over the DCP-free ablation (Fig. 3c). Notably,\nabsolute diagnostic accuracy remained comparable to that achieved using standardized English translations.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 12, + "total_chunks": 95, + "char_count": 2781, + "word_count": 390, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aad6a29e-a34f-40d3-89b1-5b8e9d551ac6", + "text": "These observations demonstrate that the DCR framework\nand experiential heuristics within DxEvolve are language-agnostic, confirming the\nframework's viability in diverse, multilingual clinical environments. Together, these external evaluations demonstrate that DxEvolve's self-evolution\nmechanism confers substantial portability across institutional boundaries, documentation languages, and diagnostic categories. By externalizing clinical wisdom as\nsymbolic, governable assets, the framework provides a rigorous trajectory for maintaining high-fidelity performance amidst the inherent heterogeneity of real-world clinical\npractice.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 13, + "total_chunks": 95, + "char_count": 631, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1c52e694-2418-4ce2-9d2e-67b77d17c162", + "text": "2.4 Self-evolution shows exposure-dependent scaling behavior\nand error-driven correction We next studied whether DxEvolve exhibits exposure-dependent improvement consistent with clinician-like development, and whether the gains can be traced to reusable\nexperience rather than incidental trajectory variation. We therefore quantified selfevolution by scaling the pool of encounters available for DCP accrual while holding\nthe evaluation cohort fixed (\"Evaluation and analysis\" in Methods). Accuracy matured longitudinally, yielding reproducible learning curves across all\nevaluation schedules (Fig. 4a), with a mean accuracy gain of 8.97% after accrual\nover the first 0–1,000 encounters and a further 0.9% gain over 1,000–2,000 encounters. While initial gains were remarkable, trajectories eventually diverged by model\ncapacity: whereas weaker backbones reached an asymptotic plateau, more capable\nmodels sustained incremental growth throughout the accrual period. This divergence\nsuggests that the saturation point of experience-guided evolution is governed by the\nbase LLM's reasoning capability; stronger architectures demonstrate a superior ability\nto mine from complex, long-tail scenarios, effectively raising the ceiling of attainable\ndiagnostic expertise. To identify which experiences drive error correction, we analyzed improvement\ncases—encounters where DxEvolve succeeded but its baseline failed. In these cases,\nretrieved DCPs were significantly enriched with experiences distilled from prior diagnostic failures compared to the general retrieval distribution (Fig. 4b). This highlights\nan error-driven dividend, where heuristics rooted in past mistakes contribute more to\nsubsequent performance gains. These results suggest that failures represent high-value\nlearning events, providing the critical corrective logic necessary to navigate complex\ndiagnostic pitfalls that successful encounters may overlook. Together, these analyses connect exposure-dependent performance gains to an\ninspectable mechanism: improvement scales with accumulated experience, and the\nexperience invoked when errors are corrected exhibits a systematic provenance structure. This motivates examining not only how the repository grows, but how the content\nof accrued DCPs matures with continued exposure. 2.5 Self-evolution is accompanied by progressive maturation\nof experience To quantify the functional maturation of the experience repository, we examined\nwhether DCPs accrued in later developmental stages exhibit superior clinical utility\nand broader applicability than early-stage heuristics. This progression was validated through blinded expert assessment and comprehensive retrieval-log analyses\n(\"Evaluation and analysis\" in Methods).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 14, + "total_chunks": 95, + "char_count": 2733, + "word_count": 348, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1ece5d30-6c71-4c8f-9b5c-1d2a7d354656", + "text": "In a clinician reader study blinded to study condition, we randomly sampled 20\nDCPs from an early exposure window (encounters 1–300) and 20 from a late window (encounters 1700–2000). Two clinicians rated each DCP on clinical correctness\n(including safety concerns), actionability (guiding evidence acquisition and hypothesis\nrefinement) and generality (reusability beyond the source encounter and pathology). The robustness of the expert evaluation framework was confirmed by high inter-rater a Early (n = 20) Late (n = 20) P = 0.005 P = 0.16 P = 0.021 P = 0.007\n** n.s. * **", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 15, + "total_chunks": 95, + "char_count": 575, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d9b2ea8f-6412-44ea-820c-3f86bbacbd30", + "text": "Clinical Actionability Generalizability Mean\nCorrectness Score c Total cases Improved cases 15.9\nClinical Correctness 14.8\n5 Actionability 15 13.9 13.5\nGeneralizability 12.9 12.4\nscore (%) 10 rate\n2 4 experience\nExpert Late retrieval 5\nBubble size ∝ n 3\nICC (total) = 0.81 3 4 5 Qwen3-30B Qwen3-235B DeepSeek-V3.2\nExpert 1 score Fig. 5 Maturation of accrued experience artifacts with encounter exposure. a, Blinded\nclinician ratings of diagnostic cognition primitives (DCPs) sampled from an early exposure window\n(encounters 1–300; n=20) and a late window (encounters 1700–2000; n=20). DCPs were scored for\nclinical correctness, actionability and generalizability, with the mean shown as an aggregate. Boxes\ndenote interquartile range, centre line the median, and points individual DCPs; two-sided P values\nare shown (n.s., not significant). b, Inter-rater reliability of clinician ratings for the aggregate DCP\nscore (ICC=0.81), supporting the reliability of the clinician assessment. c, Evaluation-time retrieval\nsignal for late-stage DCPs, quantified as the fraction of retrieval events that involve DCPs in the late\nencounter window. reliability for the aggregate DCP scores (intraclass correlation coefficient (ICC)=0.81;\nFig. 5b). Late-stage DCPs scored higher across dimensions than early-stage DCPs,\nwith mean clinician rating 4.47 vs 4.17 on a 5-point scale (Fig. 5a). Both sets often\ncontained clinically reasonable guidance, but later DCPs more consistently articulated it in reusable, action-oriented terms (for example, clearer conditional checks and\nescalation cues), whereas early DCPs more often remained context-bound, supporting\ngradual maturation with exposure. To complement clinician ratings with a usage-based signal, we analyzed evaluationtime DCP retrieval logs. Using the same early and late exposure windows, we", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 16, + "total_chunks": 95, + "char_count": 1837, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a08730ff-74ea-4c49-b632-354c8a185170", + "text": "quantified for each DCP (i) retrieval breadth (the number of distinct evaluation encounters in which it was retrieved) and (ii) association with error-correcting episodes\n(retrieval events in encounters where DxEvolve was correct but DxEvolve w/o DCP\nwas incorrect). Retrieval log analyses confirmed that late-stage DCPs possess superior functional utility. While these artifacts maintained a baseline retrieval rate of\n12.4%–13.5% across total encounters, their prevalence increased to 13.9%–15.9%\nwithin error-correcting episodes (Fig. 5c). This enrichment was most pronounced in\nDeepSeek-V3.2. Taken together, clinician-blinded ratings and usage-based signals converge on a\nconsistent picture: with continued encounter exposure, DCPs become more reliably\nactionable and more broadly reusable, and their retrieval is increasingly enriched in\nerror-correcting episodes. These findings support that self-evolution involves qualitative refinement of accrued experience artifacts, rather than simply expanding the size\nof the DCP repository. 2.6 DxEvolve's evidence acquisition aligns with clinical\nworkflows and clinical guidelines In workflow-aligned diagnosis, performance depends not only on the final diagnosis but also on whether requested investigations resemble routine care. We therefore\nassessed DxEvolve's evidence-acquisition behaviour at the encounter level, measuring\nalignment with documented investigations and compatibility with common pathways\n(\"Evaluation and analysis\" in Methods). Across the MIMIC-CDM evaluation cohort, DxEvolve exhibited higher consistency\nwith recorded workups on all four trajectory-consistency measures than the standard workflow-aligned baseline (mean overall consistency across base LLMs, 0.89 and\n0.68, respectively), including physical-examination execution, laboratory-test set F1,\nimaging (modality, region) set F1 and action-order concordance. The results indicate\nmore reliable coverage of key investigation types and a workup sequence closer to the\nrecorded workflow (Fig. 6a). We further assessed workup behavior against established clinical guidelines using\na conservative, three-component compliance score that captures (i) whether physical\nexamination was performed before downstream testing, (ii) coverage of guidelinerecommended laboratory categories and (iii) whether the first imaging study matched\nguideline-supported modality–region choices for each condition. DxEvolve achieved\nhigher overall compliance than CDM across all evaluated backbones, with distributions shifted toward higher scores and statistically significant paired differences as\nshown in Fig. 6b. Together, these analyses indicate that DxEvolve's improvements extend beyond\nend-point accuracy to more clinically compatible evidence acquisition, rather than\narising from opportunistic or idiosyncratic request patterns. This study presents DxEvolve, a self-evolving diagnostic agent that instantiates diagnosis as an interactive deep clinical research (DCR) workflow, in which clinical evidence", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 17, + "total_chunks": 95, + "char_count": 3020, + "word_count": 380, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6a91b98b-59d2-46f9-9e68-603a29c3c747", + "text": "Physical Exam\nQwen3-235B LaboratoryImaging F1 Tests\nAction Order Physical Exam\nQwen3-30B LaboratoryImaging F1 Tests\nAction Order Physical Exam\nDeepSeek-V3.2 LaboratoryImaging F1 Tests\nAction Order Physical Exam\nGLM-4.7 LaboratoryImaging F1 Tests\nAction Order 0.2 0.4 0.6 0.8 1.0\nAgreement with clinical ground truth CDM DxEvolve\nP = 1.7×10−61 P = 3.9×10−58 P = 3.6×10−13 P = 2.2×10−17", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 18, + "total_chunks": 95, + "char_count": 384, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5291cdb-c7c2-48cd-8dd0-ac6dcdc60432", + "text": "60 compliance\nOverall 40 Qwen3-30B Qwen3-235B GLM-4.7 DeepSeek-V3.2 Fig. 6 DxEvolve produces more workflow-consistent investigations and shows improved\nalignment with clinical guidelines. a, Workup consistency. Across the MIMIC-CDM evaluation\ncohort (n=400), DxEvolve shows higher agreement with the documented investigation trace than the\nstandard decision-making baseline CDM for each backbone, spanning whether a physical examination\nwas performed, overlap with recorded laboratory testing, overlap with recorded imaging (modality\nand region), and concordance of the investigation ordering. Points are model-level means; grey lines\nconnect paired results for DxEvolve versus CDM under the same backbone. b, Guideline adherence. Distributions of encounter-level guideline-compliance scores, derived from the mean adherence across\nthree dimensions: physical examination, laboratory investigations, and imaging. Violin plots show\nscore densities; embedded boxplots indicate the median and interquartile range; points mark the\nmean. P values are from paired two-sided comparisons. is acquired procedurally through explicit evaluation actions, with optional consultation of external medical sources.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 19, + "total_chunks": 95, + "char_count": 1197, + "word_count": 151, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "322b600a-6d19-4007-ad66-3d9457495871", + "text": "DxEvolve is designed as a governed learning system\nover encounter-level diagnostic trajectories, supporting longitudinal self-evolution by\naccruing and retrieving diagnostic cognition primitives (DCPs) as reusable experience\nartifacts. Across a public, de-identified benchmark of clinical encounters formatted for\nprocedural evidence acquisition, DxEvolve reaches clinician-comparable performance\nunder interactive diagnosis. Importantly, evaluation on an external cohort from a\nChinese tertiary hospital operating in a distinct healthcare system shows consistent\nDCP-enabled gains, supporting the portability of experience under cross-institutional These findings show that workflow-aligned diagnostic agents can reach clinicianbenchmarked performance while preserving auditability, reframing progress from static\nfull-record prediction to governed, evidence-tethered execution and improvement as\nclinical expertise accrues. A central contribution of DxEvolve lies in the experience-driven self-evolution\nmechanism, which renders encounter exposure an explicit learning signal within a\nworkflow-aligned diagnostic process. Unlike paradigms that treat each case as a static,\nfull-record input, where all documented findings are provided upfront, DxEvolve operates through procedural evidence acquisition and iterative hypothesis refinement under\nthe DCR framework. This design more closely mirrors the temporal and inferential structure of routine diagnostic workups. By generating standardized, clinically\nauditable trajectories with explicit provenance, DxEvolve learns from practice in a\nmanner analogous to human clinicians. Through this process, DCPs are accumulated\ninto a reusable experience repository and can be retrieved to steer subsequent evidence gathering and diagnostic refinement without parameter updates. When external\nmedical sources are consulted, their evidence can provide additional authoritative corroboration. Empirically, diagnostic performance improved with cumulative encounter\nexposure, yielding a reproducible, exposure-dependent scaling curve. Notably, DCPs\noriginating from prior diagnostic failures were enriched in improvement cases, suggesting an error-driven learning mechanism: unsuccessful episodes preferentially yield\ncorrective effects that reduce the likelihood of repeating similar mistakes in similar\nclinical contexts. Because DCP-based self-evolution remains non-parametric and traceable, these primitives can be inspected, curated, or even retracted as needed. This\noffers a practical pathway for governed, longitudinal adaptation, a capability difficult\nto achieve through conventional model training.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 20, + "total_chunks": 95, + "char_count": 2649, + "word_count": 319, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fe5cfd59-d2f8-4e87-801b-0a7a97162e3f", + "text": "External validation at the Chinese PLA General Hospital confirms that DxEvolve's\nadvantages transcend institutional boundaries, linguistic variations, and diagnostic categories. The DxEvolve's sustained performance across translated and native\nChinese documentation suggests that its distilled experiences capture portable,\nworkflow-level logic rather than language-specific artifacts. Notably, the observed\ngains in diagnostic categories absent from the initial repository underscore a crossdisease generalizability essential for real-world deployment. Collectively, the DCR\nworkflow provides a portable execution substrate for stepwise evidence acquisition\nunder heterogeneous documentation, and DCP-based self-evolution supplies a reviewable mechanism for adaptation as institutions, languages and workup patterns drift. They offer a practical route to maintaining dependable diagnostic performance beyond\nthe originating benchmark. Beyond exposure-dependent performance gains, our results suggest that selfevolution is accompanied by a progressive improvement in the quality of accrued\nDCPs, echoing how clinicians' experiential knowledge can mature with seniority rather\nthan remaining isolated reflections. In clinician-blinded assessments, experiences accumulated later scored higher on clinical correctness, actionability and generality than\nearlier experiences, although both stages were broadly clinically reasonable. Consistent\nwith this, usage-based analyses showed that later experiences were retrieved across a\nwider range of evaluation encounters and were more often observed in error-correcting episodes under identical workflow constraints.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 21, + "total_chunks": 95, + "char_count": 1657, + "word_count": 198, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "53c059d6-8d48-418b-adc5-668263dc3638", + "text": "Together, these signals support a maturation process in which accrued experience becomes more reliably actionable and more\nbroadly reusable, rather than simply expanding in volume. In practice, the gains from\nself-evolution reflect experience refinement as well as accumulation. For workflow-aligned clinical agents, terminal diagnostic accuracy is an incomplete endpoint because the agent determines the sequence and intensity of evidence\nacquisition, with downstream implications for test utilization and imaging escalation. DxEvolve's requested investigations matched encounter-recorded workups more\nclosely than the baseline across behavioural concordance measures, and more often\nselected guideline-supported first-line imaging. Together with the accuracy gains, these\nprocess-level improvements suggest that the gains are not primarily explained by indiscriminate escalation of investigations. Such process alignment provides an auditable\nsubstrate for governance, enabling calibration of investigation intensity and targeted\nreview of recurrent failure patterns. Notwithstanding these advances, several limitations and corresponding priorities\nfor future work warrant consideration.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 22, + "total_chunks": 95, + "char_count": 1189, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5f13439b-0476-4594-94c6-9da3e82449e8", + "text": "First, our experiments use de-identified EHRderived records to enable reproducible, auditable measurement of evidence acquisition\nand experience reuse; extending this framework to prospective settings will benefit\nfrom incorporating additional real-world factors, such as clinician–patient interaction. Second, we observe consistent gains when applying distilled experiences to diagnostic\ncategories beyond those represented in the initial repository, supporting portability\nacross disease settings; broader evaluations across diverse case-mix and clinical contexts will further delineate generalizability in complex practice. Third, our current\naction schema emphasizes the core diagnostic-relevant actions required for diagnosis in an interactive workup setting; the framework is naturally extensible to richer\nactions as needed for specific clinical deployments. These considerations motivate three\nnext steps: (i) prospective clinician-in-the-loop studies that evaluate workflow fidelity,\nefficiency and patient-relevant endpoints; (ii) expanded multi-institutional and multispecialty evaluation to characterize when and where experience-guided self-evolution\ngeneralizes; and (iii) extension of the action space to incorporate richer operational\nactions while preserving auditability and benchmarking comparability. In summary, DxEvolve links workflow-aligned diagnostic investigation with longitudinal, governed improvement through experience-driven self-evolution. By operationalizing diagnosis as procedural evidence acquisition alongside auditable experience\nconsolidation, the framework reflects two core elements of clinical expertise: systematic investigation within a patient encounter and progressive learning across a career. Consistent with this, DxEvolve reaches clinician-level performance under evaluations\nthat emulate clinically realistic diagnostic constraints, demonstrating that sophisticated diagnostic reasoning emerges when structured investigative protocols are refined\nby an ever-maturing repository of DCPs. By externalizing learning into inspectable\nartifacts rather than opaque parameter updates, DxEvolve aligns AI advancement\nwith the transparency standards essential to clinical safety. More broadly, our findings\nsupport governed, auditable self-evolution as a promising direction for clinical AI that\nmust remain reliable as evidence and standards of care evolve.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 23, + "total_chunks": 95, + "char_count": 2400, + "word_count": 284, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a383ec95-e40c-4bd7-981d-0c3c5cf20401", + "text": "4.1 DxEvolve framework DxEvolve is a self-evolving diagnostic agent that closes two coupled gaps observed\nin clinical AI diagnosis: a process gap between static full-information prediction and\nworkflow-aligned stepwise evidence acquisition, and a learning gap in which apparent competence does not accumulate into more reliable evidence-consistent reasoning\nunder uncertainty. DxEvolve operationalizes diagnosis as an evidence-centric deep\nclinical research workflow and the proposed self-evolution mechanism externalizes longitudinal improvement as auditable diagnostic cognition primitives, distilled from and\nreinvoked within the same diagnostic trajectories, without any parameter updates to\nthe base large language models (LLMs). At the core of each clinical encounter, DxEvolve implements a deep clinical research (DCR) framework—an agentic research protocol that treats diagnosis\nas evidence-driven investigation rather than single-pass prediction, while enforcing\nworkflow-aligned constraints on evidence acquisition. Each encounter starts from the\npresenting complaint with limited initial context, mirroring early-stage clinical uncertainty. The agent then iteratively plans the next information need, executes a concrete\nacquisition action, and updates an explicit encounter state that integrates newly\nrevealed findings with the evolving hypothesis set and a structured plan for subsequent\nsteps. The DCR workflow thus proceeds through repeated cycles of (i) formulating\nthe next evidence-seeking objective conditioned on the current state, (ii) acquiring\nthe selected information through tool-mediated actions, and (iii) synthesizing the new\nevidence into the state to refine hypotheses and commit to the next investigative\ndecision. The action space is aligned with routine workup operations and includes requests\nfor physical examination findings, laboratory testing results and imaging reports. Because evidence availability and recommended workup choices are often guided by\nevolving clinical guidance and best practices, relying solely on parametric model\nknowledge can be insufficient, particularly early in an encounter when patient-specific\nevidence is sparse. DxEvolve can therefore optionally invoke external medical evidence interfaces (PubMed and clinical guidelines) within the same workflow to support\nevidence-grounded decision-making and to reduce reliance on unsupported rationales. Specifically, clinical guidelines are accessed via dense retrieval through semantic\nvector-space indexing to identify contextually relevant standards, while peer-reviewed\nevidence is sourced through queries to the official PubMed search utilities. The DCR workflow can rapidly obtain long and heterogeneous text (for example,\nmulti-parameter laboratory outputs, narrative imaging reports and retrieved documents), in which weakly relevant or incidental content may dilute clinically decisive\nsignals. To mitigate this, DxEvolve applies context engineering by prioritizing clinically\nsalient findings and suppressing incidental content in the running context, performing\nan automatic summarization step that extracts and carries forward diagnostically relevant information when needed. This mechanism preserves continuity of the diagnostic\ntrajectory while maintaining a stable, high-signal representation to inform subsequent Importantly, the DCR-generated diagnostic trajectories can drive longitudinal learning with real encounter-derived workups and outcomes rather than by\nabstract, simulator-specific feedback.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 24, + "total_chunks": 95, + "char_count": 3528, + "word_count": 448, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "01863dcf-15a5-4efc-aa13-273fff32d704", + "text": "The central innovation of DxEvolve is the longitudinal self-evolution mechanism\nthat enables progressive improvement with clinical exposure by accumulating and\nreusing experience from prior episodes, without any parameter updates to the underlying base LLM. This design is motivated by clinician cognition: expertise is not only the\nrecall of medical facts, but the ability to recognize recurring clinical patterns, anticipate high-yield investigations and apply context-appropriate decision rules shaped by\nprior successes and failures. This design externalizes learning into accountable experience artifacts that clinicians can audit, revise or remove, rather than relying on latent\nbehavioural drift. After each completed diagnostic episode in the accumulation pool, DxEvolve performs a structured post-hoc consolidation step over the trajectory and distills a\ndiagnostic cognition primitive (DCP) optimized for reuse under uncertainty. Each\nDCP contains three components: experience pattern, test-ordering experience, and\ndiagnostic decision experience. The experience pattern provides a high-salience signature for retrieval, summarizing the presentation and discriminative cues at a level\nintended to generalize beyond the originating patient. The test-ordering experience\nencodes actionable workup guidance for the stepwise setting, including high-yield nextstep evaluations, contingency options when findings are equivocal and safety-oriented\nguardrails that reduce common omissions or inappropriate escalation. The diagnostic decision experience captures evidence-linked implications for hypothesis refinement\nand final decision-making, including discriminative patterns that support or refute\nleading hypotheses, red-flag checks, and corrective lessons when the source trajectory\nexposed an error mode.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 25, + "total_chunks": 95, + "char_count": 1812, + "word_count": 229, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aa889cb9-70ef-4ce8-bb62-ef23a07aa254", + "text": "DCPs are written as portable guidance rather than narrative\nrationales. To support mechanistic analyses and traceable governance, each DCP is stored\nwith lightweight provenance metadata for in-depth analysis, including its exposure\nindex, diagnostic category and whether the source episode produced a correct primary\ndiagnosis. This provenance enables analyses of how DCP sources relate to subsequent\nperformance gains and error correction. During diagnosis on encounters, DxEvolve treats the DCP repository as a growing long-term memory. At the step of deciding to retrieve prior experience, the agent\nderives a retrieval query from its current evidence-grounded state and retrieves a small\nset of candidate DCPs whose experience patterns best match the current presentation. Retrieved DCPs are injected as a bounded context and applied as conditional\nguidance: they may steer evidence seeking, highlight discriminative cues to verify or\nprovide evidence-linked guidance for final diagnostic commitment. To mitigate spurious memory-driven bias, DxEvolve is instructed to use a DCP only when it is\ncompatible with the patient-specific evidence acquired so far and to disregard DCP\nguidance that is irrelevant with observed findings. By combining workflow-aligned trajectories with structured DCP consolidation\nand evidence-compatible reuse, DxEvolve provides an accountable pathway for exposure-dependent improvement while preserving transparency and avoiding finetuning-induced shifts in base-model behaviour.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 26, + "total_chunks": 95, + "char_count": 1510, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "22d37b13-d9f5-4d5d-99f1-d67f0ae08584", + "text": "Diagnostic reasoning trajectories and\nDCP examples are shown in Supplementary Section C and D. Benchmark experiments used MIMIC-CDM [40], a clinical decision-making benchmark curated from MIMIC-IV [41]. MIMIC-IV is a large, de-identified electronic\nhealth record resource sourced from routine clinical care at Beth Israel Deaconess\nMedical Center (Boston, MA, USA), including longitudinal structured variables, laboratory measurements and linked clinical documentation [41]. MIMIC-CDM inherits\nthis real-world provenance and comprises 2,400 de-identified patient presentations of\nacute abdominal pain spanning four diagnostic categories (appendicitis, cholecystitis, diverticulitis and pancreatitis), formatted for workflow-aligned diagnosis in which\nadditional evidence (such as physical examination findings, laboratory results and\nimaging reports) is revealed only when explicitly requested through the corresponding\naction [40].", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 27, + "total_chunks": 95, + "char_count": 932, + "word_count": 114, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "702bcc8c-95e4-4568-b42d-3afba1de2aa9", + "text": "To prevent label leakage, agent-facing inputs excluded any diagnosis fields or labelbearing metadata. Evidence items were provided as structured text fields in the dataset\nrelease, with field boundaries preserved to avoid inadvertent information disclosure\nthrough formatting, concatenation or re-ordering. When multiple items of the same\nevidence type were available, they were retained in their original record order and\nwere exposed only after the agent issued the matching request action.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 28, + "total_chunks": 95, + "char_count": 492, + "word_count": 69, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99213c4e-beda-47e5-b1c3-7379936e5cd3", + "text": "4.3 Evaluation cohorts Across all experiments, we enforced strict non-overlap between encounters used\nfor longitudinal experience accumulation (i.e., construction of the diagnostic cognition primitive repository, DCP) and those used for evaluation, implemented at the\nencounter level using unique identifiers. For primary comparisons under the deep\nclinical research (DCR) workflow, we predefined a held-out MIMIC-CDM evaluation\ncohort of 400 encounters and kept it fixed across base models, ablations and random\nseeds; all remaining non-overlapping MIMIC-CDM encounters were used exclusively\nfor DCP accrual. To contextualize against published clinician benchmarking, we additionally evaluated on the reader-study subset from Hager et al. (80 encounters; 20 per pathology) [40], which was treated as an independent evaluation cohort and strictly excluded\nfrom DCP accrual. On this subset, we report both workflow-aligned evaluation and\nsingle-pass full-information (FI) inference using identical underlying encounter content, differing only in the information-availability interface (complete record provided\nupfront for FI, with evidence-request actions disabled). For external validation, we assembled an independent cohort of de-identified\nencounters (2020–2024) from the Chinese PLA General Hospital (N=293) curated\nwith a standardized record structure, including appendicitis (n=30), cholecystitis\n(n=39) and pancreatitis (n=174), which match diagnostic categories in MIMICCDM, as well as liver abscess (n=39) and urinary tract infection (n=11). composition reflects the natural prevalence and clinical distribution of these conditions within the institution's stream, preserving the ecological validity of the dataset\nand ensuring that the evaluation mirrors the diagnostic challenges encountered in\nunconstrained real-world practice.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 29, + "total_chunks": 95, + "char_count": 1841, + "word_count": 238, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db6a89c6-d9d8-49de-beab-4a8aa27f672d", + "text": "All external encounters were used exclusively for\nout-of-distribution testing and were never used for DCP accrual. For external-cohort\nexperiments, the DCP repository was built solely from the MIMIC-CDM accrual pool\nusing the same base LLM as in the corresponding evaluation. Records were harmonized to follow the MIMIC-CDM task format, preserving the\ninitial presenting complaint and a pool of candidate evidence items retrievable through\nexplicit requests. Imaging evidence followed the MIMIC-CDM convention by providing\nonly the final narrative report text. Owing to source-format constraints, laboratory\ntesting was returned as a consolidated results field, analogous to physical examination\nreturns. To enable controlled cross-institutional evaluation with English-prompted base\nmodels, we produced standardized English translations of the structured records using\nan offline, locally run translation tool with human verification. Translation was performed at the field level to preserve section boundaries and avoid reordering or merging\nacross fields; numerical values, units and unambiguous medical abbreviations were\nretained. For cross-language robustness, we additionally evaluated DxEvolve on the original\nChinese structured records under the same workflow and action schema. In this setting,\nonly the patient-specific encounter content was in Chinese, whereas prompts and the\nDCP repository remained in English.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 30, + "total_chunks": 95, + "char_count": 1424, + "word_count": 192, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8fa1575-b118-4b93-8584-b68f78f4806c", + "text": "4.4 Ethics approval and governance MIMIC-IV and the derived MIMIC-CDM cohort contain de-identified patient data\nand were accessed via PhysioNet under the required credentialing and data-use agreements, in accordance with the dataset governance policies [40, 41]. All analyses were\nconducted on de-identified data, and no directly identifiable information was used for\nmodel evaluation, reporting or dissemination.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 31, + "total_chunks": 95, + "char_count": 413, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5b3d1bc0-dd4e-45a0-aa9c-396f88224408", + "text": "The external institution cohort from the Chinese PLA General Hospital comprised\nretrospectively collected encounters and was de-identified prior to analysis under institutional policies. Use of these records for this study was reviewed and approved by\nthe hospital's institutional ethics committee of the Chinese PLA General Hospital\n(Approval No. S2020-418-01), with a waiver of informed consent where applicable\nunder the approved protocol. Data access was authorized through institutional governance procedures, and all processing and analyses were performed by authorized\nstudy personnel within institutionally approved computing environments.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 32, + "total_chunks": 95, + "char_count": 647, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e45636b2-78c5-4ebe-85e3-3d8620832e38", + "text": "4.5 Models and implementation DxEvolve was implemented as an LLM-orchestrated agent operating in a workflowaligned diagnostic environment with a constrained action schema, standardized tool\ninterfaces and explicit termination criteria. Across all experiments, we used offthe-shelf, open-weight base LLMs. Model inference was conducted locally to satisfy", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 33, + "total_chunks": 95, + "char_count": 353, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b8a2d475-40d7-4d1a-9e11-77d853c8ade7", + "text": "data-governance requirements for both the MIMIC-derived benchmark and the external hospital cohort, which preclude transmitting patient-level content to third-party\nhosted LLM services or external APIs. Base LLMs and inference settings. Unless otherwise stated, all experiments in this study applied Qwen3-30B (Qwen3-30B-A3B-Instruct), Qwen3-235B\n(Qwen3-235B-A22B-Instruct-2507) [42], DeepSeek-V3.2 [43] and GLM-4.7 [44]\nas backbones. To contextualize DxEvolve against domain-specific models, we\nalso evaluated MedGemma [45] (medgemma-27b-text-it) and ClinicalCamel [46]\n(ClinicalCamel-70B). During preliminary testing, these medical-domain LLMs\ndemonstrated insufficient compliance with the structured action-calling protocol\nrequired for workflow-aligned evaluation; specifically, they frequently failed to adhere\nto the pre-specified JSON output format or generated invalid investigative actions. Consequently, these models were evaluated exclusively under the single-pass fullinformation regime. All experiments were run on a local server equipped with NVIDIA\nA100 GPUs (80 GB), without using external hosted services. Within each base model,\ndecoding configurations were held fixed across all compared methods and ablations to\nensure that differences reflect workflow and experience mechanisms rather than sampling settings.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 34, + "total_chunks": 95, + "char_count": 1329, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6110b00c-4204-42c8-bb5d-a2ee43b4670d", + "text": "For all evaluated LLMs, we set temperature to 0.1, top-p to 0.7 and\ntop-k to 50. Prompt specification. All workflow-aligned experiments used a single, shared\nprompt contract that defines the action space and semantics, tool-call formatting, the\nagent state representation and the termination criteria. The same prompt template\nwas applied across all evaluated base models without model-specific adapters or taskconditional modifications, ensuring that comparisons differ only in the underlying\nmodel and the enabled system components. Full prompt templates are provided in the\nSupplementary Section A and B. DxEvolve uses a unified dense retrieval stack for both (i)\nexperience retrieval from the DCP repository and (ii) retrieval of external clinical guidelines when enabled. For both retrieval pathways, queries and candidate\ndocuments were embedded using bge-large-en-v1.5 [47] as dense encoder with\nvector-based similarity search (FAISS [48]). Similarity was computed by cosine\nsimilarity between ℓ2-normalized embeddings.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 35, + "total_chunks": 95, + "char_count": 1026, + "word_count": 143, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d4072bab-da48-44a4-9de8-22c43c4a7955", + "text": "Retrieval was performed locally for\nreproducibility and, for sensitive cohorts, to avoid external transfer of patient information. We collected abdominal-condition guideline documents from authoritative\nclinical sources (for example, the American College of Gastroenterology, the World\nSociety of Emergency Surgery and Mayo Clinic) and manually verified relevance,\nauthority and recency, excluding outdated materials and ultimately retained 35 guidelines. The guidelines were converted to structured text, lightly cleaned (for example,\nremoving acknowledgements) before being locally indexed for retrieval. PubMed\nretrieval was implemented via the official NCBI Entrez (E-utilities) API, with queries\nrestricted to de-identified, non-patient-specific medical terms (for example, disease and\nsymptom keywords) and containing no patient-level records or identifiable information.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 36, + "total_chunks": 95, + "char_count": 877, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "158dd197-9f9f-43a6-b083-978b13d3535d", + "text": "Baseline details and implementation parity. We use two complementary reference points: a published workflow-aligned baseline (CDM [40]) and an in-framework\nablation (DxEvolve w/o DCP) that isolates the marginal contribution of DCR and self-evolution mechanism.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 37, + "total_chunks": 95, + "char_count": 260, + "word_count": 34, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "325f2549-69c0-41dd-9db2-64240aefb4e4", + "text": "CDM is an established clinical decision-making diagnostic baseline capable of stepwise inquiry but lacking both a specialized investigative\narchitecture for evidence acquisition and a framework for experiential evolution. Our\nevaluation strategy prioritizes head-to-head, backbone-matched ablations within a\nunified architectural framework, an approach designed to isolate the specific contributions of workflow grounding and experiential reuse. Direct comparisons with\ngeneral-purpose agent frameworks are confounded by fundamental disparities in their\nunderlying diagnostic paradigms. For instance, most existing models focus on examcentric reasoning like USMLE-style scenarios, or are optimized for patient-physician\ndialogues. These settings diverge significantly from the sequential, uncertainty-laden\ninvestigation inherent to real-world clinical workups, where evidence is latent and must\nbe actively requisitioned. To preserve domain fidelity, DxEvolve is intentionally architected to mirror the structured rigor of actual bedside practice, where evidence is latent\nand must be actively requisitioned. Such divergent information-access constraints and\ninteraction modes make evaluation parity non-trivial; benchmarking against a standardized, workflow-aligned baseline and its corresponding ablations therefore ensures\nthat observed gains are strictly attributable to our architectural innovations rather\nthan artifacts of mismatched task definitions. 4.6 Evaluation and analysis This section defines the evaluation protocol and analysis definitions used throughout\nthe study. We report encounter-level diagnosis accuracy under the DCR workflow,\ncomplemented by regime comparisons against single-pass full-information (FI) inference, exposure-indexed self-evolution analyses based on DCP accrual, and process-level\nmetrics that characterize evidence-acquisition behaviour. All analyses were conducted\non held-out evaluation cohorts with prespecified encounter-level definitions. Episodes, regimes and primary endpoint. Each diagnostic episode starts\nfrom the presenting complaint and limited initial context. The agent iteratively issues\nactions to request additional evidence and receives results only for requested items. Episodes terminate when the agent outputs a final primary diagnosis or reaches\na prespecified maximum number of 20 interaction steps.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 38, + "total_chunks": 95, + "char_count": 2365, + "word_count": 291, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d96e0d1-091f-419c-96a3-20e202e9cff7", + "text": "The primary endpoint is\nencounter-level correctness of the final primary diagnosis; episodes that terminate\nwithout a valid diagnosis output are scored as incorrect.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 39, + "total_chunks": 95, + "char_count": 165, + "word_count": 23, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "85c595df-d3df-452d-a255-3c8cb12886c4", + "text": "We report two regimes that\ndiffer only in information availability and interaction constraints. In the interactive\nregime, the agent must explicitly request evidence and may condition decisions only\non evidence acquired within the episode. In single-pass full-information (FI) inference,\nthe model receives the complete record upfront and produces a single-step diagnosis. Single-pass FI inference was evaluated only on the reader-study subset (n=80) as a\nmatched control. Investigative burden and stratification. To analyze the efficacy of DxEvolve\nacross varying levels of diagnostic difficulty, we defined an investigative complexity\nproxy derived from the baseline diagnostic burden. For each encounter, complexity was quantified as the evidence-acquisition footprint—defined as the total number\nof investigative steps required by the baseline CDM model to reach termination. Encounters were stratified into \"high-burden\" and \"low-burden\" groups based on a median split of this footprint across the 400-case evaluation cohort. This stratification\nallowed us to assess whether experience-guided evolution provides differential benefits\nin cases requiring extensive iterative reasoning versus more straightforward clinical\npresentations. Longitudinal self-evolution and improvement cases provenance. To quantify exposure-dependent self-evolution, we varied the number of encounters available\nfor DCP accrual while holding the evaluation cohort fixed (n=400). Accrual encounters were ordered deterministically, and DCP repositories were constructed in a nested\nmanner: at exposure level k, the repository contains DCPs consolidated from the first\nk accrual encounters. This design yields an exposure-indexed learning curve without\nrepeated re-sampling. The DCP-free ablation (DxEvolve w/o DCP) is exposureindependent by construction and was evaluated under the same interactive constraints\nas a reference.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 40, + "total_chunks": 95, + "char_count": 1906, + "word_count": 251, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c0216905-6437-493a-84c8-f37f8dd6d7d3", + "text": "To isolate evaluation encounters in which DCP reuse plausibly contributes to error\ncorrection, we defined improvement cases as evaluation encounters satisfying all of the\nfollowing criteria: (i) DxEvolve produced a correct primary diagnosis, (ii) DxEvolve\nw/o DCP produced an incorrect diagnosis under the same workflow constraints, and\n(iii) DxEvolve retrieved at least one DCP during the episode. For provenance analyses,\neach retrieved DCP was labeled by the outcome of its source accrual episode at the time\nof consolidation (correct versus incorrect primary diagnosis). We quantified provenance\nenrichment by comparing the distribution of source-episode outcomes among DCPs\nretrieved in improvement cases against the corresponding distribution among DCPs\nretrieved across the full evaluation cohort (that is, pooling retrieval events over all\nevaluation encounters). Unless otherwise stated, provenance analyses were performed\nusing the fixed accrual pool defined by the non-overlapping MIMIC-CDM split.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 41, + "total_chunks": 95, + "char_count": 1008, + "word_count": 139, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "acd89bc0-ac6d-4089-9ec4-e4d53fcbb2af", + "text": "Clinician assessment of DCP clinical maturation. To assess whether DCPs\nconsolidated later in exposure are more clinically useful and reusable, we conducted a\nclinician reader study contrasting an early exposure window (encounters 1–300) and a\nlate exposure window (encounters 1700–2000). For this assessment, we recruited two\nboard-certified internal medicine physicians, one from the Chinese PLA General Hospital, China (with 15 years of clinical experience), one from the Peking University\nThird Hospital, China (with 8 years of clinical experience). Clinicians were masked\nto the exposure window of each DCP and the study hypothesis. From each window,\nwe randomly sampled 20 DCPs (40 total).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 42, + "total_chunks": 95, + "char_count": 695, + "word_count": 102, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ad5d7ac2-2339-45d4-9a0b-f820a144c469", + "text": "Each DCP was presented in its native\nthree-part format (experience pattern, test-ordering experience and diagnostic decision experience) with all provenance metadata removed (including exposure index,\nsource outcome and pathology labels) and translated to Chinese via a standardized\ntranslation procedure followed by terminology checks. Two board-certified clinicians\nindependently rated each DCP on a 1–5 ordinal scale across three prespecified dimensions: clinical correctness (including potential safety concerns), actionability (capacity\nto guide evidence acquisition and hypothesis refinement in an interactive workflow)\nand generality (reusability beyond the originating encounter and pathology). Rating\norder was randomized and raters were blinded to sampling window and DCP source. Inter-rater agreement for the clinician ratings was assessed using ordinal-appropriate reliability metrics (quadratic-weighted Cohen's κ and intraclass correlation). Agreement for the aggregate DCP score (mean across the three dimensions) was high\n(weighted κ=0.83, ICC= 0.81), supporting the reliability of the clinician assessment\nfor downstream analyses. For analysis and visualization, ratings were aggregated by\naveraging the two clinicians' scores for each dimension and for the aggregate score.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 43, + "total_chunks": 95, + "char_count": 1291, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "550d723e-ebd0-4b0c-a07f-1309d72a2581", + "text": "Process-level behaviour. We assessed evidence-acquisition behaviour by comparing the investigations requested by each method (DxEvolve and the CDM baseline)\nwith those documented in the MIMIC-CDM structured record for the same encounters\n(n=400). All metrics were averaged across encounters.\n• Trajectory consistency. We quantified workup consistency using four complementary measures. (i) Physical examination (PE) agreement was a binary indicator\nof whether the agent requested a physical examination at any point in the episode (1\nif requested, 0 otherwise). (ii) Laboratory-set F1 compared the set of laboratory tests\nordered by the agent with the set recorded in MIMIC-CDM using a set-level F1 score. Before scoring, laboratory item identifiers were canonicalized using a precomputed\nmapping that collapses equivalent codes to a canonical identifier, reducing artefactual disagreement due to coding variations. Precision reflects avoidance of unnecessary\ntests, whereas recall reflects coverage of recorded tests. (iii) Imaging-set F1 was computed analogously, but over sets of (modality, region) tuples extracted from imaging\nrequests, and a match required agreement on both modality and region. (iv) Actionorder concordance evaluated whether the relative ordering of broad investigation types\nfollowed the reference clinical ordering. We restricted comparison to the intersection\nof investigation types executed by both the agent and the record; if fewer than two\ntypes were present, concordance was defined as 1. Otherwise, we computed pairwise\nconcordance as the fraction of ordered pairs (a, b) consistent with the reference order\nthat were also ordered as a before b in the agent's episode.\n• Clinical guideline adherence proxies. We additionally scored adherence to\nguideline-informed workup expectations using rules-based proxies with three components, reported on a 0–100 scale and averaged to form an overall score. (i) PE timing\nscore captured whether PE was performed as the first workup step (100), performed\nlater (50) or not performed (0). (ii) Laboratory adherence score measured coverage\nof pathology-specific recommended laboratory categories with a two-tier weighting\nscheme: primary tests contributed weight 1.0 each, secondary tests contributed weight\n0.5 each with the total secondary contribution capped by the primary maximum\nto prevent inflation by extensive secondary testing; scores were normalized by the\nmaximum attainable weight for the pathology. (iii) Imaging adherence score evaluated only the first imaging study, scoring whether its modality and region matched\na pathology-specific preferred option (100), an acceptable alternative (50) or otherwise (0), including missing imaging. Guideline categories and imaging preferences were\nderived from established society guidelines (WSES [49–51] for appendicitis, diverticulitis and pancreatitis; Tokyo Guidelines [52] for cholecystitis), and this analysis was\nintended as a conservative, descriptive check for gross deviations rather than a claim\nof a single optimal workup for all clinical contexts.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 44, + "total_chunks": 95, + "char_count": 3085, + "word_count": 433, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ec3bd54-d062-478a-8d1e-79a61d205f36", + "text": "The MIMIC-IV dataset is available via PhysioNet subject to completion of the required\ndata-access training and a data use agreement. The MIMIC-CDM benchmark used\nin this study is derived from MIMIC-IV and is available from the original release at\nhttps://physionet.org/content/mimic-iv-ext-cdm under the same terms. After obtaining access to MIMIC-CDM, the data preprocessing and cohort-splitting scripts used\nin this study (to reproduce the non-overlapping accrual and evaluation partitions)\nare available at https://github.com/RUCAIBox/DxEvolve. The external cohort from\nthe Chinese PLA General Hospital is not publicly available due to institutional datagovernance requirements. Access to the minimum dataset necessary to reproduce\nthe external-cohort analyses may be considered for qualified researchers, subject to\napproval by the hospital's data governance procedures and execution of an appropriate\ndata-use agreement; requests should be directed to the corresponding authors. The code for DxEvolve is available at https://github.com/RUCAIBox/DxEvolve. All\nprompts used in DxEvolve are included in the Supplementary Information. L., Franklin, N. & Gordon, R.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 45, + "total_chunks": 95, + "char_count": 1165, + "word_count": 155, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "580478e4-0164-4f64-9235-8700a3fcc901", + "text": "Diagnostic error in internal medicine. Archives of internal medicine 165, 1493–1499 (2005). [2] Singh, H. & Sittig, D.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 46, + "total_chunks": 95, + "char_count": 118, + "word_count": 18, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "82b198ab-7aae-4c7a-84eb-bce236790c97", + "text": "Advancing the science of measurement of diagnostic\nerrors in healthcare: the safer dx framework. BMJ quality & safety 24, 103–110\n(2015). [3] Singh, H., Meyer, A. The frequency of diagnostic errors in\noutpatient care: estimations from three large observational studies involving us\nadult populations. BMJ quality & safety 23, 727–731 (2014).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 47, + "total_chunks": 95, + "char_count": 341, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2636151e-d366-4be7-adf3-bda6888f872c", + "text": "The causes of errors in clinical reasoning: cognitive biases,\nknowledge deficits, and dual process thinking. Academic Medicine 92, 23–30\n(2017). Adverse diagnostic events in hospitalised patients: a singlecentre, retrospective cohort study. BMJ Quality & Safety 34, 377–388 (2025). Improving Diagnosis in Health Care\n(National Academies Press, 2016). [7] Schwartzstein, R. Critical thinking for 21st-century\nmedicine—moving beyond illness scripts. JAMA 334, 1509–1510 (2025). [8] Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J. & Powell, D. Cognitive\nbias in clinical large language models. npj Digital Medicine 8, 428 (2025). [9] Ferber, D. et al.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 48, + "total_chunks": 95, + "char_count": 654, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ea0c4d08-c860-43f5-9177-6c4a60bd405d", + "text": "Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nature cancer 1–13\n(2025). [10] Nenadic, I. et al. Physicians as context engineers in the era of generative AI. Nature Medicine (2026). URL https://doi.org/10.1038/s41591-026-04215-x. [11] Singhal, K. et al. Large language models encode clinical knowledge.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 49, + "total_chunks": 95, + "char_count": 372, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5819c483-058a-496d-ae40-24f237adbd85", + "text": "Nature 620,\n172–180 (2023). [12] Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). V., M¨oller, S. & Ryg, J. Use of gpt-4 to diagnose complex clinical\ncases (2024). [14] Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 50, + "total_chunks": 95, + "char_count": 257, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e4a44d4f-23c8-4791-9106-79edcb3c6218", + "text": "Diagnostic reasoning\nprompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 7, 20 (2024). Quantifying the reasoning abilities of llms on clinical cases. Nature\nCommunications 16, 9799 (2025).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 51, + "total_chunks": 95, + "char_count": 241, + "word_count": 33, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd9b2cc3-1b52-43d1-9eaf-5975894f5ccd", + "text": "Knowledge-practice performance gap in clinical large language models: Systematic review of 39 benchmarks. Journal of Medical Internet Research 27, e84120 (2025). Assessment of large language models in clinical reasoning: a\nnovel benchmarking study. NEJM AI 2, AIdbp2500120 (2025). Reliability of LLMs as medical assistants for the general\npublic: a randomized preregistered study. Nature Medicine (2026). URL https:\n//doi.org/10.1038/s41591-025-04074-y. Comparative analysis of multimodal large language model\nperformance on clinical vignette questions. JAMA 331, 1320–1321 (2024). [20] Kaczmarczyk, R., Wilhelm, T. I., Martin, R. & Roos, J.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 52, + "total_chunks": 95, + "char_count": 641, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "60b3157b-bbda-4201-91a7-3fbaba51e916", + "text": "Evaluating multimodal\nai in medical diagnostics. npj Digital Medicine 7, 205 (2024). [21] McDuff, D. et al. Towards accurate differential diagnosis with large language\nmodels. [22] Z¨oller, N. et al. Human–ai collectives most accurately diagnose clinical vignettes.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 53, + "total_chunks": 95, + "char_count": 265, + "word_count": 37, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bcf4da4a-e905-412f-9a7f-1615a6c56e9e", + "text": "Proceedings of the National Academy of Sciences 122, e2426153122 (2025). [23] Bhasuran, B. et al. Preliminary analysis of the impact of lab results on large\nlanguage model generated differential diagnoses. npj Digital Medicine 8, 166\n(2025). Macd: Multi-agent clinical diagnosis with self-learned knowledge for\nllm. arXiv preprint arXiv:2509.20067 (2025). Enhancing diagnostic capability with multi-agents conversational\nlarge language models. NPJ digital medicine 8, 159 (2025). An agentic system for rare disease diagnosis with traceable\nreasoning.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 54, + "total_chunks": 95, + "char_count": 550, + "word_count": 74, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43353817-1e68-4c0e-88d3-bf55d2249e03", + "text": "[27] Charlin, B., Boshuizen, H. Scripts and clinical\nreasoning. Medical education 41, 1178–1184 (2007).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 55, + "total_chunks": 95, + "char_count": 103, + "word_count": 14, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e9c1889f-264d-4d88-84f7-9960e6ef970d", + "text": "Zaimis, E. (ed.) A-mem: Agentic memory for llm agents. (ed.Zaimis,\nE.) Advances in Neural Information Processing Systems (2025). Agent hospital: A simulacrum of hospital with evolvable medical\nagents. arXiv preprint arXiv:2405.02957 (2024). [30] Food, U., Administration, D. et al.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 56, + "total_chunks": 95, + "char_count": 281, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "df3c4d0c-17eb-4816-ba3f-ebd9913763df", + "text": "Transparency for machine learning-enabled\nmedical devices: Guiding principles. US Food And Drug Administration. Retrieved\nJune 30, 2024 (2024).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 57, + "total_chunks": 95, + "char_count": 143, + "word_count": 18, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e385003c-f8be-4870-8961-bde3ec5514bf", + "text": "[31] Babic, B., Glenn Cohen, I., Stern, A. D., Li, Y. & Ouellet, M. A general framework\nfor governing marketed ai/ml medical devices. npj Digital Medicine 8, 328 (2025). A generalist medical language model for disease diagnosis assistance. Nature medicine 31, 932–942 (2025). Empirical data drift detection experiments on real-world medical\nimaging data.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 58, + "total_chunks": 95, + "char_count": 354, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "eeb8f3cc-fc38-41ab-b3ce-7127cdfaabe8", + "text": "Nature communications 15, 1887 (2024). [34] Subasri, V. et al. Detecting and remediating harmful data shifts for the responsible deployment of clinical ai models. JAMA Network Open 8, e2513685–e2513685\n(2025).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 59, + "total_chunks": 95, + "char_count": 209, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2e4330f1-837d-407c-aadc-d794a284e74e", + "text": "Zaimis, E. (ed.) Memory injection attacks on llm agents via queryonly interaction. (ed.Zaimis, E.) Advances in Neural Information Processing\nSystems (2025). Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023). Towards conversational diagnostic artificial intelligence. Nature 642,\n442–450 (2025).", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 60, + "total_chunks": 95, + "char_count": 337, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8035a68d-eaf8-47b0-80d0-842fb4af7aa6", + "text": "Sequential diagnosis with language models. arXiv preprint [39] Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. Ai in health and medicine. Nature medicine 28, 31–38 (2022). [40] Hager, P. et al. Evaluation and mitigation of the limitations of large language\nmodels in clinical decision-making. Nature Medicine (2023). URL https://doi.\norg/10.1038/s41591-024-03097-1. Mimic-iv, a freely accessible electronic health record dataset. Scientific data 10, 1 (2023). Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025). Deepseek-v3. 2: Pushing the frontier of open large language models. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models\n(2025). URL https://arxiv.org/abs/2508.06471. arXiv:2508.06471. [45] Sellergren, A. et al.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 61, + "total_chunks": 95, + "char_count": 748, + "word_count": 98, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "10cfba3a-2122-4b7d-8ab9-d8c2bc42922d", + "text": "Medgemma technical report. arXiv preprint arXiv:2507.05201\n(2025). Clinical camel: An open expert-level medical language model with\ndialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023). [47] Xiao, S., Liu, Z., Zhang, P. & Muennighoff, N. C-pack: Packaged resources to\nadvance general chinese embedding (2023). arXiv:2309.07597. [48] Johnson, J., Douze, M. & J´egou, H.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 62, + "total_chunks": 95, + "char_count": 389, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c40cd1aa-701d-45ea-81c2-6dcdfe04c3ad", + "text": "Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 535–547 (2019). [49] Di Saverio, S. et al. Diagnosis and treatment of acute appendicitis: 2020 update of\nthe wses jerusalem guidelines. World journal of emergency surgery 15, 27 (2020). [50] Sartelli, M. et al. 2020 update of the wses guidelines for the management of\nacute colonic diverticulitis in the emergency setting. World Journal of Emergency\nSurgery 15, 32 (2020). [51] Lepp¨aniemi, A. et al. 2019 wses guidelines for the management of severe acute\npancreatitis. World journal of emergency surgery 14, 27 (2019). [52] Yokoe, M. et al. Tokyo guidelines 2018: diagnostic criteria and severity grading\nof acute cholecystitis (with videos). Journal of Hepato-biliary-pancreatic Sciences\n25, 41–54 (2018). Supplementary Information A Diagnostic Prompt Template The following is the main diagnostic prompt template of DxEvolve used in all experiments across various base models reported in this paper, with medical examinations,\nexperience retrieval, clinical guidelines, and PubMed search enabled.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 63, + "total_chunks": 95, + "char_count": 1076, + "word_count": 157, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "993bad2e-6daa-4465-8c5e-14a7913045fb", + "text": "Template variables are shown in {braces}. The tags {system tag start}, {system tag end},\n{user tag start}, {user tag end}, and {ai tag start} are replaced with modelspecific chat delimiters at runtime. Supplementary Table 1: Diagnostic Prompt Template. {system tag start}\nYou are a senior physician. Your task is to perform stepwise diagnostic reasoning\nusing ONLY the allowed tools. You must strictly follow one of the two output\nformats below at every step. INFORMATION GATHERING\nThought: [1-2 concise sentences: what you know + what uncertainty remains +\nwhy next action is needed]\nAction: [One of: Physical Examination, Laboratory Tests, Imaging, Experience\nSearch, Guideline Search, PubMed Search]\nAction Input: [Specific and valid request, MUST be within tool scope]\nObservation:\n[The system will fill this. DO NOT include any results yourself.] FINAL DIAGNOSIS\nThought: [1-2 concise sentences summarizing key findings leading to the diagnosis]\nFinal Diagnosis: [Single, clear, concise, and standard diagnosis. (Avoid overly complex or speculative etiological chains, focus on the most likely and commonly\nrecognized diagnosis.)]", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 64, + "total_chunks": 95, + "char_count": 1135, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "73740833-07fa-429a-bdc8-36a9f0faafd1", + "text": "You MUST always follow the exact format (A or B). For any test, ONLY request those allowed by the corresponding tool.\n- Laboratory Tests: only valid lab names.\n- Imaging: must specify ' ' format (e.g., 'Abdomen\nUltrasound', 'Abdomen CT').\n- No invented tests, no unsupported modalities.\n3. Before giving the final diagnosis, you MUST explicitly perform all three core\ntypes of medical evaluation as actions – at least one Physical Examination, one\nLaboratory Test, and one Imaging.\n- Consider all clinically relevant imaging modalities for the suspected condition. Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 65, + "total_chunks": 95, + "char_count": 605, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4346664b-c878-48d7-82d2-d68aa4e0afe7", + "text": "- Do not omit a modality that is commonly recommended or diagnostically critical\nunless it is clearly inappropriate.\n4. You MUST use Experience Search at least once before giving the final diagnosis.\n- In Action Input you SHOULD provide a short case style description of this\npatient (age, sex, chief complaint, symptom pattern, duration, key exam or lab or\nimaging findings), not just a single disease keyword.\n- If the retrieved experience is clearly irrelevant or not useful, you may reformulate\nthe Action Input once and try a second Experience Search query. Do NOT keep\nsearching repeatedly.\n- Only integrate insights that are consistent with this patient's objective data.\n5. You MUST use Guideline Search at least once before giving the final diagnosis.\n6. Stop when a confident diagnosis is possible based on available information.\n7. When using Experience Search, Guideline Search, or PubMed Search, integrate\nonly relevant insights into your Thought and proceed; do not rely on them if they\nconflict with patient-specific objective data.\n8. If uncertainty remains but no high-yield action exists, you MUST provide the\nbest-supported diagnosis (Format B) based on currently available data, without\nloop actions indefinitely. CRITICAL FORMAT RULES:\n1. MUST output the \"Observation:\" label immediately after Action Input as a\nsignal to pause for respond.\n2. Keep \"Action\", \"Action Input\" and \"Final Diagnosis\" fields concise and to the\npoint. AVAILABLE TOOLS:\n- Physical Examination: Request physical examination of patient and receive the\nobservations. This is a strongly recommended Examination in the clinical diagnostic\nprocess and should be performed first.\n- Laboratory Tests: Request specific laboratory test and receive text values. Specify\ntest names in 'Action Input' clearly.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 66, + "total_chunks": 95, + "char_count": 1793, + "word_count": 274, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ce69bee7-5381-48c2-8597-43d72c41e79b", + "text": "This is a common diagnostic step in the clinical\nevaluation.\n- Imaging: Request imaging scans and receive the radiologist report. Region AND\nmodality MUST be specified in the 'Action Input' field.\n- Experience Search: Dense retrieval over past diagnostic cases. Action Input\nSHOULD be a short case style description of this patient, not just a disease name.\n- Guideline Search: Retrieve relevant clinical guidelines. Provide a concise clinical query in \"Action Input\" (symptoms, suspected diagnosis, key labs/imaging, or\ndecision point).\n- PubMed Search: Conduct targeted search on PubMed and receive relevant medical\narticles. Concise and specific search query (few KEYWORDS) MUST be specified\nin \"Action Input\". Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 67, + "total_chunks": 95, + "char_count": 736, + "word_count": 110, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b0a3b14d-cd2f-4143-8efb-5aea8707786a", + "text": "BE EFFICIENT: Prioritize high-yield diagnostic actions before broad or low-yield\nones. Some medical examination information may not be available, do not focus\non the unavailable data, make full use of the information that can be obtained to\ndiagnose.\n{system tag end}{user tag start} Patient History:\n{input} BEGIN YOUR DIAGNOSTIC PROCESS:\n{user tag end}{ai tag start}\nThought:{agent scratchpad} The prompt instructs the LLM to act as a senior physician performing stepwise diagnostic reasoning in an action-based loop. Two output formats\nare enforced: Format A for iterative information gathering (Thought →Action →\nObservation) and Format B for the final diagnosis with thought. B Experience Construction Prompt Template After each diagnostic case is completed, the following template is used to distill the\ncase into a reusable diagnostic cognition primitive (DCP) through reflection on the\ndiagnostic trajectory. The DCP is stored in the DCP repository for retrieval in future\ncases. Supplementary Table 2: Experience Construction Prompt. {system tag start}\nYou extract reusable diagnostic reasoning experience from completed clinical cases\nfor future tool using agents. Your goal:\n- Do NOT retell the full case or reproduce chain of thought.\n- Do NOT include treatment.\n- Distill ONE Diagnostic Cognition Primitive (DCP): a short heuristic that improves\nfuture diagnosis. The DCP must:\n- Be consistent with the ground truth diagnosis and the correctness flag.\n- Focus on diagnostic reasoning, not management or consultation.\n- Emphasize when and how to use ONLY the following tools in future similar cases:\n- Physical Examination (no additional input)\n- Laboratory Tests (input: names of the lab tests to run)", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 68, + "total_chunks": 95, + "char_count": 1714, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2494151a-cc8d-4d03-b0ef-b83634b66755", + "text": "Continued on next page - Imaging (input: imaging modality and region to be scanned) Tool input templates (copyable):\n- Physical Examination\n- Laboratory Tests: , , ...\n- Imaging: modality=, region= Coverage constraints:\n- Only recommend tests or imaging settings that are explicitly supported by the\nprovided case context, meaning they appear in at least one of:\n1) Clinician test orders (from the chart). Use this as a high quality reference for\nrealistic first line test selection and sequencing.\n2) Diagnostic steps where the tool call succeeded (has a non-error observation)\n3) Rule based feedback 'message' or retrieved guidance that explicitly recommends a specific test or imaging setting\n- Prefer to fully cover the explicitly provided clinician orders and successful tool\ncalls before adding anything else.\n- Do not invent new tests, imaging modalities, regions, or non-provided measurement names. Field roles:\n- Experience Pattern:\n- Case-style trigger pattern for retrieval, built from symptoms, basic context, and\nkey objective findings.\n- You may append compact labels such as the final correct diagnosis and common\nmisdiagnoses to improve retrieval.\n- Test Ordering Experience:\n- Constructive test-ordering heuristic using only the allowed tools and toolcompatible inputs.\n- You may rank actions by priority and specify escalation criteria, in natural\nclinical language.\n- Avoid blanket prohibitions. If a test is lower priority, express it as conditional\nor deferred rather than discouraged.\n- When naming tests or imaging, use the copyable tool input templates above.\n- Diagnostic Decision Experience:\n- Short rule on how to weigh key findings and move from differential diagnosis to\nthe correct final diagnosis. Error correction rules:\n- If correctness is \"Correct\":\n- Treat the model's diagnostic process as broadly appropriate.\n- Extract the most reusable diagnostic pattern and test ordering heuristic.\n- If correctness is \"Incorrect\": Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 69, + "total_chunks": 95, + "char_count": 2022, + "word_count": 304, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9cefc019-9a2d-4bf1-94bf-e0c44bd5e869", + "text": "- Treat the model's final diagnosis and reasoning as a negative example.\n- Do NOT justify or reuse the incorrect diagnosis.\n- Use the ground truth and the rule based feedback in 'message' as the primary\nreference.\n- Base the DCP on the ideal diagnostic process implied by that feedback. Input fields:\n- Patient input: raw case description.\n- Diagnostic steps: chronological list of tool calls and observations.\n- Model final diagnosis: what the model concluded.\n- Ground truth diagnosis: correct diagnosis label for this case.\n- Correctness flag: \"Correct\" or \"Incorrect\".\n- Rule based feedback: comments about missing exams, unnecessary tests, wrong\nimaging, and efficiency.\n- Clinician test orders (from the chart): tests ordered by the treating clinician as\ndocumented in the chart, expressed with the same tool names and inputs, and\nserving as a realistic reference for first line test selection and sequencing. Case context:\nPatient input:\n{input} Diagnostic steps:\n{intermediate steps} Model final diagnosis:\n{output} Ground truth diagnosis:\n{ground truth} Correctness flag:\n{correctness} Rule based feedback on process:\n{message} Clinician test orders (from the chart):\n{clinician} Now output exactly in this format: Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 70, + "total_chunks": 95, + "char_count": 1246, + "word_count": 189, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "86942047-03fb-44eb-84c7-44fa58af8451", + "text": "Experience Pattern: \nTest Ordering Experience: \nDiagnostic Decision Experience: \n{system tag end} This template implements the Experience Construction module that\ngenerates DCPs from completed cases. Each DCP consists of three fields: (1) Experience Pattern, a case-style trigger description optimized for dense retrieval; (2) Test\nOrdering Experience, a prioritized test-ordering heuristic grounded in clinician orders\nand successful tool calls; and (3) Diagnostic Decision Experience, a concise rule\nfor weighing findings toward the correct diagnosis. The {message} variable contains rule-based evaluator feedback on the diagnostic process, which identifies missing\nexaminations, unnecessary tests, or procedural deviations based on pathology-specific\nevaluation criteria. For example, if the agent failed to request appropriate imaging for\nsuspected appendicitis, the feedback might state: \"Imaging: no appropriate abdominal imaging was requested. Set region='Abdomen' and request imaging (ultrasound\nis typically preferred in pediatric or pregnant patients, while CT is generally recommended for adult non-pregnant patients).\" This feedback guides the DCP construction\nto emphasize the correct diagnostic workflow.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 71, + "total_chunks": 95, + "char_count": 1598, + "word_count": 210, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "36cb8573-3552-40f9-a689-7000ecb5cf1f", + "text": "The {clinician} variable provides real\nclinician test orders extracted from the MIMIC-IV chart, serving as a high-quality\nreference for realistic test selection and sequencing. C Example Diagnostic Cognition Primitive The following is a representative DCP generated through reflection on the diagnostic\ntrajectory from a correctly diagnosed case of acute biliary pancreatitis. This DCP is\nstored in the DCP repository and retrieved via vector-based dense retrieval when the\nagent encounters similar presentations in future cases. Supplementary Table 3: Example DCP (Correct Case). Experience Pattern:\nPost-cholecystectomy patient with acute RUQ/back pain, elevated liver enzymes\nand lipase. (Acute pancreatitis, DDx: Biliary pancreatitis vs. other etiologies) Test Ordering Experience: Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 72, + "total_chunks": 95, + "char_count": 808, + "word_count": 109, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b953f0b3-ff96-4aee-b115-bf7407a9fdc7", + "text": "First, confirm pancreatitis with Laboratory Tests: Lipase, Amylase, CBC, CMP. Concurrently, order first-line biliary imaging: Imaging: modality=Ultrasound,\nregion=Abdomen. If ultrasound is negative for stones/dilation but liver enzymes remain elevated,\nescalate to definitive biliary evaluation (ERCP) per clinician orders; do not escalate\nto CT or MRCP without specific indications (e.g., concern for complications or\nfailed ERCP). Diagnostic Decision Experience:\nIn a post-cholecystectomy setting, acute pancreatitis with concurrent transaminitis/hyperbilirubinemia is biliary in origin until proven otherwise, even with a\nnegative initial ultrasound, as microlithiasis or sphincter dysfunction may be the\ncause. Below is a second example DCP generated from an incorrectly diagnosed case,\ndemonstrating the error correction mechanism. The agent originally diagnosed \"adhesive small bowel obstruction\" but the ground truth was cholecystitis. Supplementary Table 4: Example DCP (Incorrect Case). Experience Pattern:\nYoung to middle-aged female with acute right abdominal pain, sharp on palpation,\nbilious vomiting, chills, and history of prior abdominal surgeries (e.g., laparoscopies). Past medical history of endometriosis. (Correct: cholecystitis; Common\nmisdiagnosis: adhesive small bowel obstruction)", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 73, + "total_chunks": 95, + "char_count": 1305, + "word_count": 165, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83049e30-22e6-4fbd-85d9-8d42322541e7", + "text": "Test Ordering Experience:\n1. Physical Examination.\n2. Laboratory Tests: CBC differential, CMP, (Blood) Lactate, (Urine) HCG.\n3. Imaging: modality=Ultrasound, region=Abdomen. Escalate to further imaging (e.g., CT) only if ultrasound is non-diagnostic and\nclinical suspicion for obstruction or other complication remains high. Diagnostic Decision Experience:\nIn a patient with right upper quadrant or right-sided abdominal pain, vomiting,\nand chills, prioritize gallbladder pathology. A history of prior surgery should not\nprematurely anchor to adhesive obstruction; a finding of gallstones on ultrasound,\nespecially with local tenderness, strongly supports cholecystitis over obstruction.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 74, + "total_chunks": 95, + "char_count": 687, + "word_count": 88, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc7ac002-333e-48ca-9cd0-aeea7bc32dbf", + "text": "The first DCP illustrates how a correctly diagnosed case is consolidated\ninto a reusable experience artifact. The experience pattern provides a high-salience\nsignature for retrieval, summarizing the presentation and discriminative cues. The testordering experience encodes actionable workup guidance, including high-yield nextstep evaluations and contingency options. The diagnostic decision experience captures evidence-linked implications for hypothesis refinement and final decision-making. The\nsecond DCP demonstrates how corrective lessons are incorporated when the source\ntrajectory exposed an error mode: when the agent misdiagnosed cholecystitis as small\nbowel obstruction in a case with atypical presentation, the DCP was constructed from\nthe ground truth and evaluator feedback, explicitly labeling the common misdiagnosis\nand providing the correct reasoning pathway. D Example Diagnostic Reasoning Trace The following is a complete diagnostic reasoning trace from a real case in the MIMICCDM benchmark, showing the agent's stepwise process from initial presentation to\nfinal diagnosis.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 75, + "total_chunks": 95, + "char_count": 1096, + "word_count": 144, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a1baab34-1052-4901-b76d-a6d921a75847", + "text": "The case involves an elderly patient with diabetes presenting with\nacute right upper quadrant abdominal pain, ultimately diagnosed with acute calculous\ncholecystitis. Supplementary Table 5: Example Diagnostic Reasoning Trace. Elderly male patient with history of type 2 diabetes mellitus on insulin, hypothyroidism, hypertension, and prostate cancer status-post radiotherapy presented to\nthe emergency department with acute onset abdominal pain. The patient reported\nthat the pain began suddenly at approximately 3 AM, waking him from sleep. He\ndescribed it as sharp, constant, and localized to the right side of the abdomen. When\nthe pain persisted, he initially attempted to contact his primary care physician but\nwas unable to secure an appointment, prompting him to call emergency services. In the emergency department, initial vital signs were temperature 98°F, blood pressure 150/86 mmHg, heart rate normal, respiratory rate 16 breaths per minute,\noxygen saturation 100% on room air. Initial laboratory studies showed lactate 2.3\nmmol/L, glucose 279 mg/dL, normal liver function tests, and hematocrit 33.2%.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 76, + "total_chunks": 95, + "char_count": 1113, + "word_count": 161, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8de6d4a4-acea-4582-8262-2cc6348f10d8", + "text": "Right upper quadrant ultrasound demonstrated extrahepatic biliary duct dilation\nwith common bile duct measuring 10 mm and presence of gallstones. Computed\ntomography of the abdomen and pelvis confirmed these findings. He was treated\nwith intravenous fluids, morphine for pain control, and ondansetron for nausea. On admission to the hospital floor, the patient reported improved but persistent\nright upper quadrant abdominal pain. He endorsed dry heaves and constipation,\nwith his last bowel movement occurring two days prior. He denied fever, chills, chest\npain, or shortness of breath. Continued on next page Past Medical History: Prostate cancer status-post radiotherapy, type 2 diabetes\nmellitus on insulin, hypertension, hypothyroidism, chronic kidney disease stage III,\nhistory of gallstones, nephrolithiasis, osteoarthritis, glaucoma, cataracts, sickle cell\ntrait, hearing loss. Past Surgical History: Pilonidal cyst excision, bilateral inguinal hernia repairs.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 77, + "total_chunks": 95, + "char_count": 968, + "word_count": 132, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e5c5522b-6436-45ea-a9a8-e6e9ba60b9ab", + "text": "Social History: Details not specified. Family History: Type 2 diabetes mellitus in family members; no family history of\nprostate cancer. Step 1: Physical Examination Action: Physical Examination Observation:\nVitals: Temperature 98.2°F, BP 148/68 mmHg, HR 90 bpm, RR 20, O2 sat 100%. General: Alert and oriented, no acute distress. HEENT: Sclerae anicteric, moist mucous membranes, oropharynx clear, pupils\nequal and reactive to light, bilateral cloudy lenses (cataracts). Cardiovascular: Regular rate and rhythm, normal S1 and S2 heart sounds, no\nmurmurs, rubs, or gallops. Pulmonary: Clear to auscultation bilaterally, no wheezes, rales, or rhonchi. Abdomen: Soft, tender to palpation in the right upper quadrant, non-distended,\nbowel sounds present, no organomegaly, no rebound tenderness or guarding. Genitourinary: No Foley catheter in place. Extremities: Warm and well-perfused, 2+ pulses bilaterally, no clubbing, cyanosis,\nor edema. Neurological: Cranial nerves II-XII grossly intact, strength grossly intact, noted\nhearing loss.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 78, + "total_chunks": 95, + "char_count": 1036, + "word_count": 144, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "10e9e0f5-b2f0-4dd9-bfa2-3523b334652a", + "text": "Step 2: Laboratory Tests Continued on next page Action: Laboratory Tests Action Input: Complete blood count with differential, comprehensive metabolic\npanel, lipase Observation:\nWhite Blood Cell Count: 7.6 K/uL (reference range: 4.0–11.0 K/uL) Neutrophils: 89.1%, Absolute Neutrophil Count: 6.77 K/uL [Elevated, indicating\nneutrophilia] Lymphocytes: 6.4%, Absolute Lymphocyte Count: 0.49 K/uL Monocytes: 3.2%, Absolute Monocyte Count: 0.24 K/uL Eosinophils: 0.4%, Absolute Eosinophil Count: 0.03 K/uL Basophils: 0.5%, Absolute Basophil Count: 0.04 K/uL Red Blood Cell Count: 4.73 M/uL Hemoglobin: 11.2 g/dL (reference range: 13.5–17.5 g/dL) [Low] Hematocrit: 33.2% (reference range: 41–53%) [Low] MCV: 70.0 fL (reference range: 80–100 fL) [Low, microcytic] MCH: 23.7 pg (reference range: 27–31 pg) [Low]", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 79, + "total_chunks": 95, + "char_count": 803, + "word_count": 108, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "892406a4-e133-418b-9d11-959332c907a3", + "text": "MCHC: 33.7 g/dL (reference range: 32–36 g/dL) Platelet Count: 156 K/uL (reference range: 150–400 K/uL) RDW: 14.6% (reference range: 11.5–14.5%) Glucose: 279 mg/dL (reference range: 70–100 mg/dL) [Elevated] Sodium: 141 mEq/L (reference range: 136–145 mEq/L) Potassium: 3.6 mEq/L (reference range: 3.5–5.1 mEq/L) Chloride: 99 mEq/L (reference range: 98–107 mEq/L) Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 80, + "total_chunks": 95, + "char_count": 384, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ee9b897e-9a0e-47cb-a34d-19f91acda218", + "text": "Bicarbonate: 25 mEq/L (reference range: 22–29 mEq/L) Blood Urea Nitrogen: 19 mg/dL (reference range: 7–20 mg/dL) Creatinine: 1.1 mg/dL (reference range: 0.7–1.3 mg/dL) Calcium, Total: 9.3 mg/dL (reference range: 8.5–10.5 mg/dL) Alanine Aminotransferase (ALT): 43 IU/L (reference range: 7–56 IU/L) Aspartate Aminotransferase (AST): 30 IU/L (reference range: 10–40 IU/L) Alkaline Phosphatase: 91 IU/L (reference range: 40–150 IU/L) Total Bilirubin: 0.6 mg/dL (reference range: 0.3–1.2 mg/dL) Lipase: 12 IU/L (reference range: 7–60 IU/L) Step 3: Imaging (Abdominal Ultrasound) Action Input: Abdomen Ultrasound (focused on gallbladder and biliary tree) Observation:\nRight Upper Quadrant Ultrasound: TECHNIQUE: Grayscale and color Doppler ultrasound images of the abdomen. Liver: Homogenous attenuation throughout. No focal lesions identified.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 81, + "total_chunks": 95, + "char_count": 838, + "word_count": 110, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e4be4f10-6fb8-409b-a927-df1436e2d28c", + "text": "Hepatobiliary: Mild intrahepatic and extrahepatic biliary duct dilation noted. Common bile duct measures 12 mm in diameter. Gallbladder contains gallstones without\nwall thickening or surrounding inflammation visible on this initial study. Trace\npericholecystic free fluid noted. Pancreas: Diffuse pancreatic atrophy. No pancreatic duct dilation or peripancreatic\nfat stranding. Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 82, + "total_chunks": 95, + "char_count": 400, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "680e35ac-1025-46b8-b91d-da2ad595c185", + "text": "Spleen: Normal size and attenuation, no focal lesions. Kidneys: Bilateral subcentimeter renal hypodensities too small to characterize. A 6.1 × 5.6 cm simple cyst in the interpolar region of the left kidney. No\nhydronephrosis or perinephric abnormality. IMPRESSION: Cholelithiasis, common bile duct dilation (12 mm), trace pericholecystic fluid. Recommend correlation with clinical findings and consider advanced\nbiliary imaging if indicated. Step 4: Experience Search Action: Experience Search", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 83, + "total_chunks": 95, + "char_count": 493, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "075f29c3-9f08-4d40-ae02-d3ac8baf853b", + "text": "Action Input: Elderly male with diabetes, acute RUQ pain, leukocytosis, normal\nLFTs, imaging shows acute cholecystitis (gallbladder wall edema, pericholecystic\nfluid) and common bile duct dilation with papillary stenosis from passed stones Experience Pattern: Elderly female with known gallstones and prior biliary intervention (ERCP/stent) presents with acute right upper quadrant pain and tenderness;\nnormal white blood cell count and liver enzymes; ultrasound shows gallstones,\ngallbladder wall thickening, and common bile duct dilation. [Correct: Acute\nCholecystitis — Misdiagnoses: Cholangitis, Biliary Colic, Pancreatitis] Test Ordering Experience: First, perform a Physical Examination focusing on\nabdominal signs. Then, order initial Laboratory Tests: CBC, CMP (including liver\nenzymes and bilirubin), and lipase. If these are normal but suspicion remains, escalate to Laboratory Tests: C-reactive protein (CRP). For imaging, the first and\noften definitive test is Imaging: modality=Ultrasound, region=Abdomen (specifically right upper quadrant). Do not escalate to CT or MRCP unless the ultrasound\nis non-diagnostic or complications are suspected. Diagnostic Decision Experience: In a patient with known gallstones, acute RUQ\npain/tenderness, and ultrasound findings of gallstones with gallbladder wall thickening, diagnose acute cholecystitis even if inflammatory markers and LFTs are\nnormal; CBD dilation and pneumobilia in this context are likely chronic sequelae\nof prior ERCP and do not rule out cholecystitis.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 84, + "total_chunks": 95, + "char_count": 1524, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c10bdef5-16c9-4453-b457-366688533a9e", + "text": "[Additional guidelines omitted for brevity] Continued on next page Step 5: Imaging (MRCP) Action Input: Abdomen MRCP Observation:\nMagnetic Resonance Cholangiopancreatography (MRCP): TECHNIQUE: T1- and T2-weighted multiplanar images of the abdomen acquired\nin a 1.5 T magnet. Intravenous contrast: 9 mL Gadavist. Oral contrast: 1 cc of\nGadavist mixed with 50 cc of water. Lower Thorax: Mild bibasilar atelectasis. No large pleural effusion. Liver: Homogeneous signal intensity without focal lesions. Mild, nonspecific periportal edema. Biliary: The gallbladder is distended and contains numerous calculi. Since prior\nexaminations, there is progressive gallbladder wall edema and pericholecystic fluid\nwith gallbladder wall enhancement. No clear pericholecystic hepatic hyperenhancement.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 85, + "total_chunks": 95, + "char_count": 785, + "word_count": 104, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9f4b5d8f-751c-4da2-bbaf-599aa641729d", + "text": "The common bile duct measures up to 1.2 cm in diameter. It tapers abruptly\nwith a fixed loss in caliber, most consistent with papillary stenosis, likely the result\nof previously passed stones.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 86, + "total_chunks": 95, + "char_count": 192, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f14bafdc-aaf6-4ba0-9a0e-475a5832e814", + "text": "Pancreas: Normal signal and enhancement. Several subcentimeter cystic lesions. Adrenal Glands: Unremarkable. Large simple cyst in the upper pole of the left kidney,\nunchanged. Minimal nonspecific perinephric fat stranding. Gastrointestinal Tract: No significant abnormalities. Lymph Nodes: No enlarged lymph nodes. Vasculature: Patent aorta and main branches. Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 87, + "total_chunks": 95, + "char_count": 382, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "354dc49d-95c9-4520-be9b-15a3c8f2e44c", + "text": "Osseous and Soft Tissue Structures: Mild degenerative changes throughout the\nlower thoracic and lumbar spine. Mild symmetric bilateral gynecomastia noted\nincidentally. IMPRESSION: Acute calculous cholecystitis with gallbladder wall edema, pericholecystic fluid, and wall enhancement. Common bile duct dilation with abrupt\ntapering suggestive of papillary stenosis from previously passed stones.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 88, + "total_chunks": 95, + "char_count": 394, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42223731-b893-44e6-9c13-9b99427d2c7b", + "text": "Step 6: Guideline Search Action: Guideline Search Action Input: Management of acute cholecystitis with common bile duct dilation,\nsuspected choledocholithiasis or papillary stenosis Observation:\nGuideline 1 (score=0.5146) Title: 2020 World Society of Emergency Surgery updated guidelines for the\ndiagnosis and treatment of acute calculus cholecystitis Key points:\n- Common bile duct dilation on ultrasound (>6 mm with gallbladder in situ) is a\nstrong risk factor for common bile duct stones (CBDS), but is not diagnostic alone;\nfurther tests are required.\n- For moderate-risk patients (e.g., CBD dilation + bilirubin 1.8–4 mg/dL),\npreoperative MRCP or endoscopic ultrasound is recommended to diagnose CBDS.\n- For high-risk patients (e.g., visualized CBD stone on ultrasound or ascending\ncholangitis), preoperative ERCP is recommended to diagnose CBDS.\n- An increased common bile duct diameter on ultrasound is an indirect sign but is\nnot sufficient alone to diagnose choledocholithiasis, necessitating further diagnostic\ntests.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 89, + "total_chunks": 95, + "char_count": 1027, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "80d4c345-3ea0-425f-8dfa-a4c136534d92", + "text": "[Additional guidelines omitted for brevity] Step 7: PubMed Search Action: PubMed Search Action Input: acute cholecystitis common bile duct dilation normal liver enzymes Continued on next page", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 90, + "total_chunks": 95, + "char_count": 191, + "word_count": 27, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7dd3a33e-e81e-4dd0-9647-bf52ec4b4223", + "text": "Observation:\nArticle 1: Role of liver function tests in symptomatic cholelithiasis. Abstract: Cholelithiasis is a common problem in Western and developing nations\nwith continuously rising incidence. This prospective study evaluated the usefulness\nof routine liver function tests (LFTs) in patients undergoing cholecystectomy for\nsymptomatic cholelithiasis. Out of 124 patients, 87.10% had LFTs within normal\nrange. Most patients with normal LFTs (95.37%) presented with chronic calculous\ncholecystitis. Of those with deranged LFTs, 43.75% presented with acute calculous\ncholecystitis. None had dilated common bile duct on preoperative ultrasound or\nintraoperative findings. Conclusion: Routine LFTs in preoperative assessment of\nuncomplicated symptomatic cholelithiasis usually come out normal and are not\ngood predictive markers for detecting silent common bile duct calculi. Article 2: Endoscopic Retrograde Cholangiopancreatography in Kidney Transplant\nPatients: Results From a Single Center. Abstract: Twenty-one kidney transplant patients underwent 23 ERCP procedures\nfor various biliary indications including choledocholithiasis (28.6%), common bile\nduct dilation with liver enzyme elevations (19%), and cholangitis (23.8%). Findings\nincluded choledocholithiasis in 47.6% and dilated common bile duct in 9.6%. ERCP\nwas safe with no complications such as acute pancreatitis, bleeding, or perforation. [Additional guidelines omitted for brevity]", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 91, + "total_chunks": 95, + "char_count": 1449, + "word_count": 188, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "888958fd-1c23-4642-a0bc-b1b6ab71a86a", + "text": "The patient presents with acute right upper quadrant pain, neutrophilia (89.1%\nneutrophils, absolute neutrophil count 6.77 K/uL), and cholelithiasis on imaging. Initial ultrasound demonstrated gallstones with common bile duct dilation (12 mm)\nand trace pericholecystic fluid. MRCP definitively showed progressive gallbladder\nwall edema, pericholecystic fluid, and gallbladder wall enhancement—classic imaging findings for acute cholecystitis. The common bile duct dilation with abrupt\ntapering is consistent with papillary stenosis from previously passed stones. Notably,\nliver function tests remained entirely normal (ALT 43 IU/L, AST 30 IU/L, alkaline\nphosphatase 91 IU/L, bilirubin 0.6 mg/dL), and lipase was normal (12 IU/L), effectively ruling out concurrent pancreatitis or acute cholangitis. The retrieved clinical\nexperiences and guidelines confirm that normal liver enzymes do not exclude acute\ncholecystitis, and that CBD dilation in this setting likely reflects chronic sequelae\nfrom stone passage rather than active choledocholithiasis. Final Diagnosis: Acute calculous cholecystitis Continued on next page Ground Truth: Cholecystitis ✓ This trace demonstrates the complete deep clinical research (DCR)\nworkflow with actual laboratory results and rich experience retrieval.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 92, + "total_chunks": 95, + "char_count": 1285, + "word_count": 170, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7bffc860-80eb-466f-b50f-f06f3a124d9a", + "text": "The agent follows a clinically logical sequence: physical examination first revealing right upper\nquadrant tenderness, followed by comprehensive laboratory evaluation showing relative neutrophilia (89.1% neutrophils, absolute neutrophil count 6.77 K/uL) with liver\nenzymes within normal limits (ALT 43 IU/L, AST 30 IU/L, alkaline phosphatase 91\nIU/L, bilirubin 0.6 mg/dL) and normal lipase (12 IU/L). Initial right upper quadrant ultrasound showed cholelithiasis with common bile duct dilation (12 mm) and\ntrace pericholecystic fluid. The agent escalated to MRCP for more definitive biliary\nassessment, which revealed gallbladder wall thickening and edema, pericholecystic\nfluid, and increased T2 signal—findings consistent with acute calculous cholecystitis. The Experience Search retrieved relevant cases from the experience library, providing\nguidance on test-ordering strategies and diagnostic reasoning for similar presentations.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 93, + "total_chunks": 95, + "char_count": 934, + "word_count": 121, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6685804c-d00c-47ff-9288-a87733448cb2", + "text": "The retrieved experiences noted that acute cholecystitis can present with normal\nliver enzymes and that CBD dilation in the absence of visualized stones reduces\nthe likelihood of active choledocholithiasis. The Guideline Search retrieved the 2020\nWorld Society of Emergency Surgery guidelines on acute calculous cholecystitis, which\ninformed the diagnostic reasoning regarding CBD dilation and the appropriateness of\nMRCP for moderate-risk patients. The PubMed Search provided supporting evidence\nregarding the prevalence of normal liver function tests in acute cholecystitis. The final\ndiagnosis of acute calculous cholecystitis was correct, matching the ground truth label.", + "paper_id": "2603.10677", + "title": "Emulating Clinician Cognition via Self-Evolving Deep Clinical Research", + "authors": [ + "Ruiyang Ren", + "Yuhao Wang", + "Yunsen Liang", + "Lan Luo", + "Jing Liu", + "Haifeng Wang", + "Cong Feng", + "Yinan Zhang", + "Chunyan Miao", + "Ji-Rong Wen", + "Wayne Xin Zhao" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10677v1", + "chunk_index": 94, + "total_chunks": 95, + "char_count": 675, + "word_count": 92, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10678_semantic.json b/data/chunks/2603.10678_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..397426ff5a0a089105496ca2bc0daf47804447c6 --- /dev/null +++ b/data/chunks/2603.10678_semantic.json @@ -0,0 +1,684 @@ +[ + { + "chunk_id": "56c7faba-1a2e-4fc9-b444-0fa791e1dbf0", + "text": "Surrogate models for nuclear fusion with parametric\nShallow Recurrent Decoder Networks: applications to\nmagnetohydrodynamics", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 0, + "total_chunks": 31, + "char_count": 124, + "word_count": 14, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "15cf85ef-3e78-4aca-8291-1405d50a3d36", + "text": "Matteo Lo Versoa, Carolina Introinia, Eric Cervia, Laura Savoldib, J. Nathan\nKutzc, Antonio Cammid,a,1 aDepartment of Energy, Politecnico di Milano, Milano, 20133, Italy2026 bMATHEP Group, Dept. of Energy \"Galileo Ferraris\", Politecnico di Torino, Torino, Italy cAutodesk Research, 6 Agar Street, London UK dDepartment of Mechanical and Nuclear Engineering & Emirates Nuclear Technology Center,Mar Khalifa University, Abu Dhabi, 127788, United Arab Emirates", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 1, + "total_chunks": 31, + "char_count": 457, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fadd67c3-142e-4411-af78-2617828a44c6", + "text": "Magnetohydrodynamic (MHD) effects play a key role in the design and operation[cs.LG] of nuclear fusion systems, where electrically conducting fluids (such as liquid\nmetals or molten salts in reactor blankets) interact with magnetic fields of\nvarying intensity and orientation, which affect the resulting flow. The numerical\nresolution of MHD models involves highly nonlinear multiphysics systems of\nequations and can become computationally expensive, particularly in multiquery, parametric, or real-time contexts. This work investigates a fully datadriven framework for MHD state reconstruction that combines dimensionality\nreduction via Singular Value Decomposition (SVD) with the SHallow REcurrent\nDecoder (SHRED), a neural network architecture designed to recover the full\nspatio-temporal state from sparse time-series measurements of a limited number\nof observables. The methodology is applied to a parametric MHD test case\ninvolving compressible lead-lithium flow in a stepped channel subjected to\nthermal gradients and magnetic fields spanning a broad range of intensities. To improve efficiency, the full-order dataset is first compressed using SVD,\nyielding a reduced representation used as reference truth for training. OnlyarXiv:2603.10678v1 temperature measurements from three sensors are provided as input, while the\nnetwork reconstructs the full fields of velocity, pressure, and temperature. To\nassess robustness with respect to sensor placement, thirty randomly generated\nsensor configurations are tested in ensemble mode.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 2, + "total_chunks": 31, + "char_count": 1537, + "word_count": 208, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e6535541-e649-45af-acf9-ff4055a2e6cf", + "text": "Results show that SHRED\naccurately reconstructs the full MHD state even for magnetic field intensities\nnot included in the training set. These findings demonstrate the potential of ∗Corresponding author. Email address: antonio.cammi@polimi.it", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 3, + "total_chunks": 31, + "char_count": 242, + "word_count": 32, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4fb7449c-5baa-4875-bc5c-811e20fd5bdf", + "text": "SHRED as a computationally efficient surrogate modeling strategy for fusionrelevant multiphysics problems, enabling low-cost state estimation with possible\napplications in real-time monitoring and control. Keywords: Nuclear Fusion, Nuclear Reactors, Magnetohydrodynamics, Machine\nLearning, SHRED Magnetohydrodynamics (MHD) investigates the flow dynamics of electrically\nconducting fluids under the influence of magnetic fields [1]. This theory provides\nmathematical models extensively used in the nuclear fusion field, especially\nin magnetic confinement fusion (MCF). Indeed, not only can thermonuclear\nplasmas be modeled as conducting fluids confined by intense magnetic fields, but\nMHD theory also applies for the description of the electrically conducting fluids\nforeseen in the blankets of many tokamaks, like molten salts [2] or liquid metals\n[3]. In fact, in MCF, residual magnetic field lines from the plasma chamber may\nreach the blanket, interacting with the conducting fluids within and affecting\ntheir fluid-dynamics behaviour. Therefore, when designing MCF reactors, MHD\neffects in the blanket must be considered and properly understood, not only for\nnominal operations, but also for transient conditions such as plasma disruptions. Given the status of development of MCF systems, numerical investigations of\nthis phenomenon must be adopted. However, MHD models are systems of nonlinear and highly complex partial\ndifferential equations, where the flow and the magnetic field are coupled in a\nmultiphysics framework [4]. These models require significant computational\nresources. Additionally, the specific effects induced in the flow by the magnetic\nfield strongly depend on their orientation and intensity [5], and simulating every\npossible case is prohibitive from a computational point of view. The presence of\na large number of potential cases becomes even more relevant when it comes to\nreal-time applications for control purposes: in general, high-fidelity models should\nbe able to predict even unforeseen conditions; however, their computational time\nwill likely be too high for any meaningful real-time action. This is a common\nchallenge in multiphysics scenarios governed by nonlinear, strongly coupled sets of\nequations. In this framework, Reduced Order Modeling (ROM) [6, 7] approaches\nhave been studied as a possible strategy to reduce the computational complexity\nin simulating complex parametric scenarios for engineering applications. Indeed,\nthey provide an efficient alternative to full-order models (FOMs) for multi-query\nsimulations: given a starting high-fidelity dataset, ROM algorithms can construct\na surrogate model capable of reproducing the key system physics at a significantly\nreduced computational cost whilst keeping the desired accuracy.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 4, + "total_chunks": 31, + "char_count": 2780, + "word_count": 388, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3fc5a1f6-ee8a-4373-8a35-ac7450bb693f", + "text": "In practice,\nthey project the behavior of the high-dimensional system on a low-dimensional\nmanifold, spanned by the most dominant spatial features, using techniques of\ndimensionality reduction, including the Singular Value Decomposition (SVD). Once deployed, these surrogate models can operate in quasi-real-time even for previously-unseen parametric configurations and conditions. As a result, ROM techniques enable rapid exploration of parametric spaces for\nparametric analysis, uncertainty quantification, and sensitivity, making them\nparticularly suitable for real-time, control-oriented, and design applications\nin fusion technology. Although data-driven ROM techniques are now well\nestablished in many areas of computational physics, including nuclear fission\n[8, 9, 10, 11, 12], their use within MHD physics has only recently begun to\nemerge [13, 14, 15] and remains especially limited for configurations involving\nelectrically conducting liquid metals [16, 17, 18]. In parallel with ROM strategies, the fusion community has recently witnessed a\nrapid growth in the adoption of Machine Learning (ML) and Artificial Intelligence\n(AI) methodologies, particularly for real-time control, monitoring and digital\ntwin applications [19]. Recent approaches rely on deep learning architectures\nwhich have been successfully applied to plasma control [20], instability mitigation\n[21] and profile regulation [22] in tokamak devices. These data-driven strategies\nhave demonstrated remarkable capabilities in learning highly nonlinear dynamics\nand in enabling fast predictions. However, despite their promising performance,\npurely data-driven AI typically require very large training datasets and entail\nsubstantial training times, which may become prohibitive when high-fidelity\nsimulations are expensive or when experimental data are scarce, noisy or difficult\nto acquire. These limitations are particularly critical in MHD scenarios involving\nliquid metal flows in fusion blankets, where generating extensive datasets under\nrealistic operating conditions remains a major challenge. In this context, an appealing alternative consists in exploiting ML techniques\nwithin a reduced and physically informed framework, where the dimensionality of\nthe problem is first compressed by reduced order modelling techniques. By performing learning in a low-dimensional latent space, it is possible to significantly\nreduce the amount of training data and computational effort required, while\nretaining the essential physical features of the underlying MHD dynamics. This\ncompressive training paradigm provides a natural bridge between physics-based\nmodelling and data-driven approaches, and represents a particularly suitable\nstrategy for MHD applications in fusion technology. By compressing the starting\ndataset, the training cost of ML models can be significantly reduced, compared\nto training directly in the high-dimensional space. Moreover, this framework\nfacilitates the integration of measurements collected from the physical system\nwith prior knowledge from models, offering advantages over conventional data\nassimilation techniques, which, being based on optimization problems, are often\nlimited by long computational times.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 5, + "total_chunks": 31, + "char_count": 3217, + "word_count": 427, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a0d12a55-b672-4040-b576-330948cfb678", + "text": "Within this framework, this work discusses the possibility of adopting a combination of SVD and an ML technique to provide an accurate and reliable state\nreconstruction of MHD physics, considering a parametric scenario. The selected\narchitecture is the SHallow REcurrent Decoder (SHRED) [23, 24], an ML architecture capable of mapping sparse trajectories of a measured observable to the\nfull high-dimensional state space, thereby indirectly estimating also unmeasured\nquantities. Through a recurrent unit followed by a shallow decoder, SHRED efficiently learns the spatio-temporal dynamics of the system, even when trained\nwith a small number of sensors. More importantly, it can generalize across\ndifferent parameter values, making it suitable in the MHD framework for reconstructing flows under a range of various magnetic field intensities and orientations. This work represents the first application of the SHRED methodology to MHD\nphysics for conducting fluids: to assess its performance for this class of problems,\nthe selected test case is a compressible MHD flow in a channel with steps and\nthermal gradients. The structure of the present paper is now reported. Section 2 provides an\noverview of the SHRED architecture. Section 3 describes the MHD model\nand presents the key numerical results. Finally, Section 4 resumes the main\nconclusions of the present work, along with some future perspectives.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 6, + "total_chunks": 31, + "char_count": 1407, + "word_count": 212, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a1efb38d-1f52-4002-adcd-ecfe3b43970f", + "text": "SHallow REcurrent Decoder The SHallow REcurrent Decoder network (SHRED) is a novel and promising\ndata-driven machine learning technique first proposed by [23], designed for state\nestimation and forecasting of complex dynamical systems from sparse time-series\nmeasurements. Its standard architecture consists of a Long Short-Term Memory\n(LSTM) unit [25] to capture temporal dependencies in the latent dynamics and\na Shallow Decoder Network (SDN) [26] for nonlinear mapping between latent\nand physical spaces. In this work, a compressed version of SHRED is exploited:\nthe training dataset is pre-processed by compression through Singular Value\nDecomposition (SVD), significantly reducing the number of features1. The use\nof SVD within SHRED has been shown to significantly enhance computational\nefficiency by reducing the dimensionality of the data at the training level [27],\nallowing for training even on personal laptops. Figure 1 shows the architecture\nof the SHRED network used in this work. At first, the architecture learns the temporal evolution of the system trajectories in accordance with Takens' embedding theory [28], which states that the\ndynamics of a high-dimensional system can be reconstructed from a sequence of\ntime-delayed observations of a few variables: the LSTM captures the temporal\ndependencies and nonlinear correlations embedded within the sensor measurements. Subsequently, the SDN maps the learned latent trajectories back to the\nreconstructed space, where SVD is employed to decompress the latent features\nand recover the full-state representation of the system. This architecture offers several advantages over traditional data-driven techniques\nfor surrogate models. First, SHRED has been proven to be able to perform\naccurate state reconstructions with an exceptionally small number of sensors\n(typically three are enough) beyond which reconstruction errors tend to saturate 1It must be mentioned that, in this case, the reference truth becomes the SVD compression,\nwhich acts as a lower bound for the reconstruction error of the starting dataset.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 7, + "total_chunks": 31, + "char_count": 2079, + "word_count": 302, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b266850c-eda0-420e-8aa0-f9a7382e6c48", + "text": "Figure 1: Visual sketch of the SHRED architecture applied to the MHD channel flow. Following\ncompression of the starting dataset through SVD, three sensors are used for measuring the local\nevolution of temperature over time. The temporal trajectories are encoded in a latent space\nthrough a long-short-term memory (LSTM). Then, a Shallow Decoder Network (SDN) projects\nthe resulting latent representation into a compressive representation of all spatio-temporal\nfield variables. Finally, the compressive representation is mapped back to the full-order state\nspace by the SVD. [27]: this property makes SHRED particularly effective in low-measurement or\nhigh-cost sensing scenarios. Furthermore, SHRED is agnostic to sensor locations,\nsince it has been proven to achieve accurate state reconstruction even when\nsensors are randomly distributed [23]: this means that sensors can be placed\nwhere installation is most practical or accessible, and that optimization of the\nsensor positioning is no longer a hard requirement. In fusion systems, where\nthe placement of sensors may be constrained by geometry, temperature, and\nradiation conditions, this is an important benefit, since SHRED can provide a\npractical and efficient strategy to reconstruct the entire dynamical field from\na minimal set of available measurements, located in easy-to-access parts of\nthe domain. Furthermore, the model can process multiphysics data derived\nfrom a single observable, enabling the recovery of strongly coupled quantities of\ninterest even when direct measurements are unavailable. This capability may\nbe especially beneficial in tokamak systems, where certain quantities (such as\ntemperature) are easier to measure with respect to others (like fluid velocity,\nneutron flux). By exploiting correlations learned during training, SHRED can\nrepresent a strategy for estimating all the variables of interest from the most\naccessible signals. Compared to other ML techniques, SHRED can be trained\ndirectly on compressed data representations, greatly reducing computational\ncosts and memory usage, allowing laptop-level training without the need for highperformance computing. Additionally, SHRED requires minimal hyperparameter\ntuning, as it has been proven that the same architecture works efficiently across\nvery diverse physical systems [23]. A further key advantage of SHRED, which\nis particularly relevant in nuclear engineering, lies in its strong mathematical\nfoundations. The methodology builds upon Takens' embedding theorem [28] and\ncan be interpreted as a generalization of classical separation of variables approach\nto data-driven settings.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 8, + "total_chunks": 31, + "char_count": 2629, + "word_count": 373, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2fc4eeb5-b77f-48ef-84ef-a3e18bdfd7c1", + "text": "This theoretical foundation, combined with the shallow\nnetwork architecture, results in a model with a very limited number of trainable\nparameters (typically fenwer than 103), which stands in contrast to many deep learning approaches relying on millions of parameters.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 9, + "total_chunks": 31, + "char_count": 268, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "37b16753-a7d3-421a-9c31-c0a558eb7ef3", + "text": "As a consequence, SHRED\noffers a higher degree of interpretability, facilitating physical insight into the\nlearned dynamics and increasing confidence in its application to safety-relevant\nnuclear scenarios. All these features make SHRED an excellent candidate for\nstate reconstruction in complex physics. So far, SHRED has been successfully\ntested across a wide range of physical systems [23, 24, 27, 29], consistently\ndemonstrating excellent performance and generalization capabilities. In the\ncontext of nuclear applications, SHRED has been previously employed for state\nestimation in scenarios involving fission reactors [30, 31, 32, 33], but it has never\nbeen applied to fusion systems. In its original formulation, the SHRED was proposed in a single-parameter\nconfiguration [23, 24, 31], focusing on the reconstruction of system dynamics\nunder fixed physical conditions. However, the same architecture can be easily\nextended to parametric datasets, as done in [27, 32]. This flexibility arises from\nthe intrinsic design of SHRED: since the LSTM operates on lagged time-series\ndata, the architecture naturally accommodates multiple trajectories corresponding to different parameter values. In this extended setting, a physical parameter\nµ can be incorporated either as an additional input, when its value is known, or\nas an output variable, when parameter estimation is desired. More in detail, for each parameter value µp, the snapshot matrix is defined as\nXµp ∈RNh×Nt, where Nh denotes the number of cells of the mesh (number of\nfeatures) and Nt the number of saved time instants. Then, the resulting matrix\nis compressed with an SVD through the reduced basis Uµp ∈RNh×r of rank r,\nfrom which a corresponding latent representation Vµp = (Uµp)T Xµp ∈Rr×Nt Vµp represents the temporal coefficients which embed the\ndynamics associated with the parameter µp, and are used as training data for\nSHRED. However, when dealing with a parametric dataset, it is necessary to\nconstruct a common reduced basis that spans the entire parametric space, thus\nencoding the most representative physical features across different parameter\nvalues. Then, the full dataset is stacked in the form: X = [Xµ1 |Xµ2| . . . |", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 10, + "total_chunks": 31, + "char_count": 2203, + "word_count": 337, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbc9c16d-9d40-4c91-ad6e-f14b8373d4a5", + "text": "In the present work, the parameter of interest is represented by the intensity\nof the applied vertical magnetic field, which plays a crucial role in determining\nthe evolution of the lead–lithium flow within the fusion reactor blanket [5, 34]. Extending SHRED to this parametric configuration enables the model to learn\nhow changes in the magnetic field affect the flow dynamics, thereby providing a\npowerful data-driven tool for studying MHD phenomena in fusion environments. More generally, the proposed framework is not limited to this specific choice and\ncan be easily extended to other relevant parameters (such as the inlet velocity or the orientation of the applied magnetic field). The focus on a vertically\napplied magnetic field in this study is motivated by practical considerations. For\nthe sake of simplicity, restricting the analysis to a single parametric direction\nallows for a clearer assessment of the capability of SHRED to generalise across\ndifferent operating conditions, while limiting the complexity of the training\ndataset. Extending the approach to multi-parameter spaces, including arbitrary\nmagnetic field orientations and flow conditions, is therefore a natural and feasible\ndirection for future investigations.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 12, + "total_chunks": 31, + "char_count": 1238, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "474076ea-e8e1-4ccb-918f-08d2ed16e698", + "text": "The SHRED architecture has been implemented in Python utilizing the PyTorch\npackage, adapting the original code of [23]. Both the LSTM and the SDN units\nof the implemented SHRED architecture are composed of 2 hidden layers: the\nlayers of the former have 64 neurons each, whereas those of the latter consist of\n350 and 400 neurons, respectively. Figure 2: Computational domain of the selected benchmark The selected test case, shown in Figure 2, consists of lead-lithium MHD flow in\na bi-dimensional channel with multiple steps. Although the selected benchmark\ndoes not correspond to any specific blanket geometry, it provides an interesting\ntest case for a first application of SHRED to MHD physics for several reasons:\nfirst and foremost, despite its apparent simplicity, this setup retains all the\nkey MHD phenomena relevant to liquid metal flows in fusion blankets, while\ninvolving a sufficiently intricate multiphysics coupling.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 13, + "total_chunks": 31, + "char_count": 932, + "word_count": 145, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "27e4c6f5-5994-4d93-929c-c51d89564dfb", + "text": "The geometry includes two steps on the upper wall and one on the lower wall,\nrepresenting obstacles to the flow. The upper steps are assumed at a temperature\nlower than the inlet fluid temperature T0, while the bottom step is set at a\nhigher temperature. These temperature conditions produce thermal gradients\nin the flow and, consequently, density variations and potential buoyancy effects\nsuperimposed on the main flow. In addition, the three steps act as physical\nobstacles that, in the absence of a magnetic field, would produce strong turbulent\ndynamics. However, when a magnetic field is imposed, the resulting Lorentz\nforce suppresses the small-scale motions, leading to a progressive laminarization\nand regularization of the flow [35]: the level of laminarization depends directly on\nthe magnetic field intensity. Although this scenario does not directly represent a\nrealistic blanket geometry, it constitutes a meaningful test case for evaluating the ability of SHRED to accurately reconstruct complex flow dynamics. In\nparticular, it allows the assessment of how the technique captures the varying\ndegrees of turbulence suppression and convective effects that arise in MHD flows\ndepending on the intensity of the magnetic field. As an initial condition, the flow is assumed to be at null velocity, and a perpendicular magnetic field B0 is imposed in the domain. Regarding boundary\nconditions, a uniform fluid velocity at the inlet and an external pressure at\nthe outlet are imposed. Moreover, all the walls are assumed to be no-slip and\nperfectly electrically conducting, subjected to the uniform vertical magnetic field\nB0.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 14, + "total_chunks": 31, + "char_count": 1634, + "word_count": 252, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "48693d18-baf0-48c4-84e3-c43be002b293", + "text": "The resulting magnetohydrodynamic model for the considered compressible,\nvisco-resistive MHD flow [36, 37] is the following:  ∂ρ + ∇· (ρu) = 0 in Ω, t > 0\n∂(ρu) 1\n∇· ∇· + (ρu ⊗u) = −∇p + τ + ρg + Ω, t > 0 ∇× B × B in\n∂t µ0\n1 ∂(ρcvT)\n∇· |∇× + = κ∆T + B|2 Ω, t > 0 in (ρcvTu)\n∂t σµ2 0\nρ = ρ0 (1 −β (T −T0)) in Ω, t > 0\n−2 τ = µ ∇u + (∇u)T u I) Ω, t > 0 in 3µ(∇·\n∂B 1\n∇× = (u × B) + ∆B Ω, t > 0 in\n∂t σµ0\nB = 0 Ω, t > 0 in ∇·\n(2)\nwith the following initial and boundary conditions:  u = 0, T = T0, ρ = ρ0, B = B0 in Ω, t = 0\nu = t > 0 on uin Γinlet,\n∂u = 0 t > 0 on Γoutlet, ∂n\nu = 0 t > 0 on Γwalls,\nB = t > 0 on B0 Γwalls,  ∂B (3)\n∂n = 0 on Γinlet ∪Γoutlet, t > 0\np = t > 0 on pout Γoutlet,\n= 0 t > 0 on ∂Ω\\Γoutlet, ∂n\nT = t > 0 on Ttop Γtop steps,\nT = t > 0 on Tbottom Γbottom step,  where Ωrepresents the domain, ∂Ωthe entire boundary, Γ the surfaces of the\nboundary and t is the time. Moreover, u is the fluid velocity, p the pressure, B\nthe magnetic field, ρ the density, τ the viscous stress tensor, T the temperature,\ng the gravity, µ the dynamic viscosity, µ0 the magnetic permeability and σ\nthe electrical conductivity, κ the thermal conductivity, cv the specific heat. All\nphysical and numerical parameters, including the initial and boundary conditions,\nare reported in Table 1. The proposed model consists of a complex system of\nequations featuring strong multiphysics coupling: the fluid variables (velocity, pressure, and temperature) are mutually dependent and are also influenced by\nthe specific magnetic field experienced by the fluid. Table 1: Physical and numerical parameters for the FOM. ρ0 9806 kg/m3 κ 20.93 Wm−1 K−1 Nh 14460\nµ 1.93 ∗10−3 Pa · s uin 0.0492 m/s L 0.2 m\nµB 1.26 ∗10−6 H/m pout 105 Pa H 0.02 m\nσ 7.82 ∗105 Ω−1m−1 T0 600 K H1 0.006 m\nβ 1.3 ∗10−4 K−1 Ttop 550 K H2 0.008 m\nc 189.5 JKg−1K−1 Tbottom 650 K z 10−4 m (1 cell) In this analysis, Np = 19 different values for the magnetic field intensity are\nconsidered. The selected MHD scenario has been solved numerically multiple\ntimes, imposing different values for the magnetic field in a range between 0.01 T\nand 0.5 T. Each considered case has been simulated up to 3 s with a variable\ntimestep according to the CFL condition, to ensure numerical stability. Data\nwere saved every 0.025 s (so Nt = 120). All the snapshots have been generated\nusing the OpenFOAM MHD library magnetoHDFoam, developed in [38] and\navailable on https://github.com/ERMETE-Lab/MHD-magnetoHDFoam under the\nMIT license.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 15, + "total_chunks": 31, + "char_count": 2543, + "word_count": 516, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cee81e41-e979-4a1f-a19e-0d8b937fb780", + "text": "The snapshots simulations have been performed on an HPC cluster,\nwith each case requiring approximately 20 minutes.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 16, + "total_chunks": 31, + "char_count": 115, + "word_count": 17, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "927c6c6a-7569-4842-8364-95fdb4e502bb", + "text": "The snapshots of each field have been stacked as described in Section 2, and\nthey have been rescaled using the min-max formula2, i.e.: T −Tmin u −umin p′ −p′min ˜T = , ˜u = , ˜p = (4)\nTmax −Tmin umax −umin p′max −p′min where p′ = p −ρgh represents the pressure without the hydrostatic component. In the following, all variables will be considered in their normalized form, and for\nsimplicity of notation, they will be denoted simply as T, u, and p.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 17, + "total_chunks": 31, + "char_count": 448, + "word_count": 83, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0e4f3213-2d66-4c90-a825-7fed1556f9dc", + "text": "The scaled\ndataset has been divided into training (≃73.7%), validation (≃15.8%), and test\n(≃10.5%) snapshots. This subdivision follows a standard practice in machine\nlearning, where approximately three quarters of the available data are used for\ntraining, while the remaining portion is reserved for validation and testing. In\nthis framework, the surrogate model is trained using only a subset of the dataset,\nand its accuracy is subsequently assessed by comparing the surrogate model\npredictions with the test data, which are not seen during the training phase.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 18, + "total_chunks": 31, + "char_count": 562, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ec5ad89c-9678-426d-9054-1a58d9a0e4fb", + "text": "Figure 3 reports an overview of the performed subdivision of the data. It can\nbe observed that the dataset is denser for lower magnetic field intensities and\nsparser for higher ones. This choice is motivated by the fact that, as previously\ndiscussed, the magnetic field tends to laminarize the flow, and higher magnetic", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 19, + "total_chunks": 31, + "char_count": 319, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c68f63b2-93f6-4c21-81ad-2d9e1cc47a21", + "text": "2A common practice in Machine Learning to improve the efficiency and performance of the\nmodel. Figure 3: Subdivision of the dataset in training, validation and test snapshots. field intensities generally lead to more homogeneous and stable flow patterns\n[35]. Consequently, it is more appropriate to enrich the dataset with cases\ncharacterized by lower magnetic field strengths, where the dynamics are more\ncomplex, variable, and diverse, and therefore more informative for training\nthe model. Two different test cases have been selected, one associated with a\nvery low (B0 = 0.06 T) and one with a quite high (B0 = 0.3 T) magnetic field\nintensity. This selection has been done to test the ability of the SHRED to\nreconstruct MHD scenarios subjected to both weak and strong vertical magnetic\nfields, and thus to retrieve a general representation even considering different\ndynamics. As previously explained, SHRED requires a limited set of time-series\nmeasures of a single field to establish a mapping between the observed values\nof that field and the reduced coefficients of all fields. To this end, several\nsensors were placed within the geometry to collect the measurements of the\ntemperature field. Notably, SHRED is able to operate effectively with only\nthree sensors, as shown in [27]. However, to verify its independence from the\nlocations of sensors, 30 randomly distributed triplets of sensors were considered\n(see Figure 4), building 30 distinct SHRED models, each associated with a\ndifferent configuration (ensemble mode). To numerically generate the sensor\nmeasurements, the temperature values over time corresponding to the mesh cells\nassociated with each sensor location were extracted from the dataset. Figure 4: Visualization of the 30 randomly generated configurations of triplets of sensors\nused in this work for recording point measurements of temperature dynamics. Each color\ncorresponds to a different triplets of sensors. The dimensionality of the snapshots is now reduced through the SVD, building\na reduced representation of the problem considering only the first r principal\nmodes. To select the rank r of the reduced space, the decay of the singular values\nrelated to the training snapshots has been investigated.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 20, + "total_chunks": 31, + "char_count": 2239, + "word_count": 346, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0d0d51ba-fde7-4114-a544-7e6b88066c12", + "text": "Figure 5 shows both\nthe decay of singular values and the relative information/energy discarded as a\nfunction of r. By examining the decay of the singular values and the associated relative information, a rank of r = 20 was selected. This choice ensures that\nonly a negligible portion of the total information is neglected (less than 0.1%),\nensuring that the reduced representation still encodes not only the dominant\nlarge-scale behavior but also the relevant small-scale dynamics. Figure 5: Singular values (a) and relative information/energy content discarded (b) of the\ntraining snapshots as a function of the rank for the temperature, velocity and pressure fields. During the training phase, SHRED was trained using the temperature measurements and the compressed representations of the training and validation\nsnapshots in order to learn a mapping between the sensor inputs and the corresponding SVD temporal coefficients. Each SHRED model took about 10 minutes\nfor the training phase on a personal computer with an Intel Core i7-9800X\nprocessor. Subsequently, in the test phase, SHRED takes as input only the\ntemperature measurements from the test case and, using the mapping between\nmeasures and SVD coefficients learned during training, reconstructs the full\nstate for the (unseen) value of the magnetic field intensity. The associated\ncomputational time required from each trained SHRED to generate the new\noutput is practically null (less than 1 s).", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 21, + "total_chunks": 31, + "char_count": 1459, + "word_count": 225, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "55ef861a-00cb-40a8-af65-1b1d47f89ad6", + "text": "At this point, the outputs of the 30\nmodels were averaged, and the mean result is taken into account. Figure 6 shows the results obtained for the test case with the lower magnetic field\nintensity. In particular, the truth solution, corresponding to the effective numerical resolution of the full-order model, and the average SHRED reconstruction are\ncompared. The comparison shows that the SHRED model is able to reproduce\nthe evolution of all the considered fields with remarkable accuracy, relying solely\non temperature field measurements. The reconstructed solution closely matches\nthe full-order one, and the residuals (computed as the absolute difference between\nthe FOM and the SHRED) are generally very small, with noticeable values only\nat a few regions located after the steps, where the dynamics are more complex. Figure 7 reports the results obtained in the test case with the higher magnetic\nfield. The results show that, under a stronger magnetic field, SHRED exhibits\nan even enhanced ability to capture the dynamics of the relevant fields. As\nillustrated in the figure, the reconstructed solution closely matches the original\none, and the residuals are even lower than in the previous case.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 22, + "total_chunks": 31, + "char_count": 1205, + "word_count": 190, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd82bc38-4a63-4207-bf5d-b341e4de9e98", + "text": "arises because higher magnetic field intensities tend to further suppress vortical\ndynamics through the Lorentz force, promoting a completely laminarised and\nmore homogeneous flow, which is easier to reconstruct, as the small-scale chaotic\nstructures are damped. Moreover, a comparison between Figures 6 and 7 clearly puts in evidence that\nthe MHD dynamics strongly depend on the specific value of the magnetic field, as\nthe flows obtained in the two considered cases are completely different. However,\na single SHRED model, trained over a broad range of magnetic field intensities, is\ncapable of accurately reconstructing both physical scenarios. This demonstrates\nthat the architecture can generalize across highly distinct physical regimes, capturing the underlying dynamics even when the input conditions vary significantly.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 23, + "total_chunks": 31, + "char_count": 828, + "word_count": 119, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "87d5ea15-8eda-408b-b139-386d405ac853", + "text": "Furthermore, Figures 6 and 7 show the standard deviations across the outputs of\nthe 30 SHRED models. The computed deviations are consistently low throughout\nthe entire geometry for both the considered test cases. This indicates that the 30\nsolutions, each corresponding to a different configuration of three sensor locations,\nare highly similar, with differences that are practically negligible, further proving\nthe agnosticism of SHRED to sensor locations.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 24, + "total_chunks": 31, + "char_count": 457, + "word_count": 67, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "67625d2d-1f2d-4433-939c-e3d0322c393c", + "text": "The results presented so far illustrate the reconstructed flow fields across the\nentire geometry at a fixed time instant. To further demonstrate the accuracy of\nSHRED models over the whole temporal window, the time evolution of selected\nglobal quantities is analyzed. For each physical field, the temporal evolution of\nits spatial average across the geometry is analyzed.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 25, + "total_chunks": 31, + "char_count": 371, + "word_count": 57, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "72cd5c7d-b917-4a00-ac52-8e5f7052e9ad", + "text": "As already explained, for\nevery time instant, the 30 trained SHRED models provide 30 reconstructions\nover the domain. These outputs were first averaged across the 30 models to\nobtain a representative mean reconstruction at each time step, which was shown\npreviously. Now, a further spatial average over the entire geometry is computed\nfrom this mean reconstruction, yielding a single time-dependent quantity that\ncharacterizes the overall evolution of the field. In addition, to assess the consistency among the models, the standard deviation\nof the spatial averages across the 30 reconstructions is also calculated. This\nallows evaluating not only the accuracy of the mean reconstruction with respect\nto the full-order model dynamics, but also how similar the outputs of the different\nsensor configurations are in terms of their global temporal behavior. Figures\n8 and 9 report the temporal dynamics of the spatially averaged temperature,\nvelocity, and pressure of the fluid for the cases with lower and higher magnetic\nfield, respectively. The results clearly show that the SHRED reconstruction closely follows the\nfull-order profile for all the fields and throughout the entire time interval, confirming the SHRED capability to accurately reconstruct the true flow dynamics. Moreover, standard deviations remain low for all reconstructed fields over time,\nindicating that the outputs of the 30 models are highly similar, despite being\ntrained on input temperature measurements taken at different sensor locations. This further confirms that SHRED is effectively agnostic to sensor placement,\nmeaning that its reconstruction performance does not depend on the specific", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 26, + "total_chunks": 31, + "char_count": 1670, + "word_count": 248, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c5553bd-4fd9-4c52-9560-fa5f71e2be03", + "text": "Figure 6: Results for the temperature (first column), velocity (second column) and pressure\n(third column) for the case with B0 = 0.06 T at time t = 2 s. The first row displays the\nreference full-order solution while the second row shows the mean reconstruction obtained by\naveraging the outputs of the 30 SHRED models. The third row reports the difference between\nthe FOM and the mean SHRED reconstruction while the fourth row shows the standard\ndeviation among the 30 reconstructions. positions of the sensors providing the input data. Moreover, the relative L2-error related to the SHRED reconstruction over time\nhas been calculated. For the given field ψ, the relative error has been computed\nas:\n∥ψF OM −ψSHRED∥\nϵψ = (5)\n∥ψF OM∥\nwhere ∥·∥represents the classical L2-norm. Figure 10 shows the relative errors\nfor both the considered test cases. In the scenario with low magnetic field intensity (Figure 10-(a)), the reconstruction error exhibits a mild growth over time but remains consistently low\nthroughout the entire time interval. Specifically, the error stays below approximately 6% for velocity and pressure and 3% for the temperature. The observed\nincrease in error is attributable to the fact that the flow does not yet reach\na steady or quasi-steady regime in the considered period; instead, the system\ncontinues to present dynamic evolution, as already shown by the temporal profile\nof the spatially averaged quantities (Figure 8), which keep varying and oscillating\nover time. Nevertheless, despite this gradual growth, the relative error remains\nvery small for all fields.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 27, + "total_chunks": 31, + "char_count": 1589, + "word_count": 253, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f7d1c621-02f9-4531-8681-adc47c8c3fa8", + "text": "Figure 7: Results for the temperature (first column), velocity (second column) and pressure\n(third column) for the case with B0 = 0.3 T at time t = 2 s. The first row displays the reference\nfull-order solution while the second row shows the mean reconstruction obtained by averaging\nthe outputs of the 30 SHRED models. The third row reports the difference between the FOM\nand the mean SHRED reconstruction while the fourth row shows the standard deviation among\nthe 30 reconstructions.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 28, + "total_chunks": 31, + "char_count": 485, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4560f2c-9b54-414b-9916-82f60add7055", + "text": "Furthermore, in the scenario with higher magnetic field intensity (Figure 10-\n(b)), the error is even lower and follows a much more stable profile, eventually\nstabilizing at about 2% for all physical fields. This behavior is fully consistent\nwith the corresponding temporal evolution of the spatially averaged fields (Figure\n9), which tends toward a plateau within the considered time interval, since the\nflow approaches a more stationary regime under stronger magnetic influence. Moreover, the global mean relative error, obtained by averaging the relative\nerror over time, and the related standard deviation have been computed for both\nthe test cases (Figure 10-(c) and 10-(d)). They provide a global and cumulative\nassessment of the model accuracy over the entire time window, confirming the\nexcellent overall performance of SHRED, since the mean error is below 3% for\nall fields and the standard deviations remain small.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 29, + "total_chunks": 31, + "char_count": 924, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1fb94689-b377-4e4b-932e-f74d378c77ee", + "text": "This work presents the first application of the SHallow REcurrent Decoder\nto magnetohydrodynamic physics involved in nuclear fusion reactors. The\nSHRED network was trained with scenarios involving a wide range of different\nmagnetic field intensities. The results demonstrate that SHRED, once trained,\nstarting only from the measure of the temperature in 3 random points, is Figure 8: Temporal evolution of the spatial averages across the geometry for the temperature\n(a), velocity (b) and pressure (c) fields in the test case with B0 = 0.06 T. For each field the\ntrue profile associated to the full-order solution is compared with the average across the 30\nreconstructions obtained, with the related standard deviations. Figure 9: Temporal evolution of the spatial averages across the geometry for the temperature\n(a), velocity (b) and pressure (c) fields in the test case with B0 = 0.3 T. For each field the\ntrue profile associated to the full-order solution is compared with the average across the 30\nreconstructions obtained, with the related standard deviations. able to accurately reconstruct the flow dynamics (temperature, velocity and\npressure) across the entire geometry for magnetic field intensities not seen\nduring training, successfully reproducing flow regimes ranging from weakly\nmagnetized and dynamically evolving configurations to strongly damped and\nfully laminarized flows. Moreover, the SHRED proved to be robust with respect\nto sensor placement. Indeed 30 different randomly generated configurations of 3\nsensors were investigated, and the resulting reconstructions exhibit negligible\nvariability, confirming that the model maintains high accuracy independently of\nsensor locations. All these results make SHRED particularly suitable for fusion\napplications. Firstly, accurate full-state reconstruction may be achievable by\nleveraging the intrinsic multiphysics of MHD flows, using only measurements\nof the temperature, which is the easiest and most practical quantity to access\nin fusion blankets. Secondly, sensors may be installed wherever they is most\naccessible, safe, or convenient, without requiring extensive optimization studies Figure 10: Temporal behavior of relative L2-error of the SHRED reconstruction over time for\ntemperature, velocity and pressure for the cases with B0 = 0.06 T (a) and B0 = 0.3 T (b). Global average over time of the relative error and related standard deviation (whiskers) for\nthe cases with B0 = 0.06 T (c) and B0 = 0.3 T (d).", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 30, + "total_chunks": 31, + "char_count": 2485, + "word_count": 373, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cf7f8838-c87b-4e62-bddc-0384475fd788", + "text": "to determine ideal locations. This property is especially advantageous in fusion\nenvironments, where geometric constraints and extreme operating conditions\nmay limit sensor placement. Furthermore, a single SHRED model, once trained\nover a broad range of magnetic fields, may be used to accurately reconstruct\nflow regimes that differ substantially from one another, capturing the distinct\nMHD effects that emerge under different magnetic configurations. Overall, the presented methodology offers a computationally efficient, fully datadriven framework for real-time or multi-query MHD state reconstruction. Its\nability to infer the full multiphysics state from sparse measurements highlights\nits potential for integration into monitoring, diagnostics, and control pipelines\nin fusion reactors. Future works will focus on extending the methodology to\nmore complex and realistic geometries even in a three-dimensional framework. Moreover, the approach can be implemented in scenarios involving more complex\nmagnetic configurations, such as time-varying profiles or magnetic fields with\nmultiple spatial components.", + "paper_id": "2603.10678", + "title": "Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics", + "authors": [ + "M. Lo Verso", + "C. Introini", + "E. Cervi", + "L. Savoldi", + "J. N. Kutz", + "A. Cammi" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10678v1", + "chunk_index": 31, + "total_chunks": 31, + "char_count": 1112, + "word_count": 148, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10680_semantic.json b/data/chunks/2603.10680_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..11d8bbf71022e8e393222bc0a0b0c34c84fca492 --- /dev/null +++ b/data/chunks/2603.10680_semantic.json @@ -0,0 +1,527 @@ +[ + { + "chunk_id": "d8a1a26d-19ea-42ae-ad94-10c4e1fcb6ef", + "text": "A Platform-Agnostic Multimodal Digital Human Modelling\nFramework: Neurophysiological Sensing in Game-Based\nInteraction", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 0, + "total_chunks": 25, + "char_count": 118, + "word_count": 12, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7c8a6c33-7dc3-4849-a444-45f16987fdfd", + "text": "Buxton1[0000−0002−8729−3736], Mufti Mahmud1,2[0000−0002−2037−8348], Jordan J. Bird1[0000−0002−9858−1231], Thomas Hughes-Roberts1[0000−0002−3204−8610], and David J. Brown1[0000−0002−1677−7485] 1 Nottingham Trent University, Nottingham, NG11 8NS, United Kingdom\n{dan.buxton, jordan.bird, thomas.hughes-roberts, david.brown}@ntu.ac.uk\n2 King Fahd University of Petroleum and Minerals, Dhahran 31261, Kingdom of Saudi Arabia\nmufti.mahmud@kfupm.edu.sa2026\nMar\nAbstract.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 1, + "total_chunks": 25, + "char_count": 464, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8a938362-2dd2-4b0c-9bcf-ec2711f669a0", + "text": "Digital Human Modelling (DHM) is increasingly shaped by advances in artificial intelligence (AI), wearable biosensing, and interactive digital environments, particularly11\nin research addressing accessibility and inclusion. However, many AI-enabled DHM approaches remain tightly coupled to specific platforms, tasks, or interpretative pipelines, limiting reproducibility, scalability, and ethical reuse. This paper presents a platform-agnostic DHM framework designed to support AI-ready multimodal interaction research by explicitly separating sensing, interaction modelling, and inference readiness. The framework[cs.HC] integrates the OpenBCI Galea headset as a unified multimodal sensing layer, providing concurrent Electroencephalogram (EEG), Electromyogram (EMG), Electro-oculogram (EOG), Photoplethysmogram (PPG), and inertial data streams, alongside a reproducible, gamebased interaction environment implemented using SuperTux. Rather than embedding AI models or behavioural inference, physiological signals are represented as structured, temporally aligned observables, enabling downstream AI methods to be applied under appropriate ethical approval. Interaction is modelled using computational task primitives and timestamped event markers, supporting consistent alignment across heterogeneous sensors and platforms. Technical verification via author self-instrumentation confirms data integrity, stream continuity, and synchronisation; no human-subjects evaluation or AI inference is reported.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 2, + "total_chunks": 25, + "char_count": 1503, + "word_count": 173, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "63aa279d-00eb-42a0-9806-edfa06eeeeb7", + "text": "Scalability considerations are discussed with respect to data throughput, latency, and extension to additional sensors or interaction modalities. Illustrative use cases demonstrate how the framework can support AI-enabled DHM and HCI studies, includ-arXiv:2603.10680v1\ning accessibility-oriented interaction design and adaptive systems research, without requiring architectural modifications. The proposed framework provides an emerging-technologyfocused infrastructure for future ethics-approved, inclusive DHM research. Keywords: Digital Human Modelling · Multimodal Neurophysiological Sensing · PlatformAgnostic Frameworks · Game-Based Interaction · Accessibility and Inclusion. Digital Human Modelling (DHM) plays a central role in the design of human–computer systems across domains such as ergonomics, safety, health, and accessibility. Recent advances in wearable sensing and interactive technologies have expanded the range of signals available for modelling human interaction, including neurophysiological, muscular, ocular, and cardiovascular measures. At the same time, there is growing recognition that accessibility and inclusion must be treated as first-class design considerations within DHM, particularly when research aims to support diverse populations and contexts. Despite these advances, many existing digital modelling and multimodal interaction approaches", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 3, + "total_chunks": 25, + "char_count": 1378, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2951f241-c530-4aae-b216-2339982ef816", + "text": "remain tightly coupled to specific platforms, experimental setups, or task environments. interaction, and interpretation are often integrated within bespoke pipelines optimised for a single study or application, limiting reproducibility, portability, and ethical reuse. This coupling presents challenges for accessibility-oriented research, where interaction tasks and sensing configurations may need to be adapted to accommodate differing motor, sensory, or cognitive needs without re-engineering the entire system. In parallel, the use of neurophysiological signals in human–computer interaction has raised", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 4, + "total_chunks": 25, + "char_count": 608, + "word_count": 76, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c18806aa-03be-4fa5-b869-c5cccc285841", + "text": "important ethical considerations. While such signals can provide valuable contextual information about interaction, their interpretation is frequently conflated with inference about internal cognitive or emotional states.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 5, + "total_chunks": 25, + "char_count": 221, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2bd48fce-4528-4b73-8cc4-412a57da6e4c", + "text": "For DHM research, particularly in accessibility-sensitive contexts, there is a need for infrastructures that clearly separate data acquisition from interpretation, allowing physiological and interaction data to be treated as descriptive observables rather than diagnostic This paper addresses these challenges by presenting a platform-agnostic multimodal DHM framework that decouples neurophysiological sensing, interaction modelling, and inference readiness through a modular abstraction architecture. The framework integrates the OpenBCI Galea headset as a unified sensing layer, providing concurrent neurophysiological and inertial data streams, alongside a reproducible, game-based interaction environment implemented using SuperTux. Interaction is modelled through structured task primitives and timestamped event markers, enabling consistent alignment between sensing and interaction while remaining independent of specific hardware or software platforms. The contribution of this work is architectural rather than evaluative. Technical verification is limited to the authors' self-instrumentation to confirm data integrity, stream continuity, and temporal alignment; no human-subjects research is reported, and no behavioural, emotional, or", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 6, + "total_chunks": 25, + "char_count": 1247, + "word_count": 152, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0c3875cd-7658-4724-ab70-cdeb3cef87ea", + "text": "accessibility outcomes are inferred. By focusing on infrastructure rather than inference, the proposed framework provides a reusable scaffold for future ethics-approved DHM studies, supporting A Platform-Agnostic Multimodal DHM Framework 3 inclusive and accessible research design through platform-independent sensing and interaction This paper is organised into the following sections: Related Works: reviews prior research in Digital Human Modelling, multimodal physiological sensing, game-based interaction, accessibility, and ethical considerations, positioning the present work within existing DHM and HCI literature while identifying limitations in portability, abstraction, and ethical separation. Framework Overview: introduces the design objectives and architectural principles of the proposed platform-agnostic DHM framework, including separation of sensing, interaction modelling, and inference readiness, with emphasis on accessibility-oriented and ethically bounded Sensing Integration and Verification: describes the integration of the OpenBCI Galea headset as a multimodal sensing layer, detailing signal abstraction, temporal synchronisation, technical verification via author self-instrumentation, and considerations for scalability and data Interaction Modelling and Applied Implications: presents the game-based interaction environment and interaction primitives, followed by illustrative DHM and HCI use cases and", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 7, + "total_chunks": 25, + "char_count": 1433, + "word_count": 171, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd1f753f-d73e-4cd3-a3b4-e49d4a6c199c", + "text": "concrete accessibility adaptation examples that demonstrate how the framework may support inclusive research without embedding evaluative or diagnostic assumptions. Conclusion: summarises the contribution and limitations of the framework and outlines planned ethics-approved validation steps and future research directions. Human Modelling has a long history within ergonomics, safety, and human-system interaction, where computational representations of human characteristics are used to inform system design rather than to evaluate individual performance [4,5]. Early DHM research established that modelling need not be limited to visual or biomechanical avatars, but can instead operate at the level of interaction structure and task abstraction [6]. Layered DHM architectures separating data acquisition, abstraction, and modelling have subsequently been advocated to support reuse across application domains and experimental contexts [6]. In parallel, research in physiological computing has demonstrated that signals such as Electroencephalography (EEG), Electromyography (EMG), Electro-oculography (EOG), and cardiovascular measures can be incorporated into interactive systems as additional information channels Importantly, foundational work in this area treats physiological signals as interaction-level observables rather than direct indicators of internal cognitive or emotional state. sensing approaches are commonly adopted to improve robustness and contextual coverage in", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 8, + "total_chunks": 25, + "char_count": 1486, + "word_count": 186, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db778e1a-75df-4100-9308-2ec08e25ed9d", + "text": "wearable and human-centred systems [2], although much of the literature focuses on downstream classification or inference, raising methodological and ethical considerations.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 9, + "total_chunks": 25, + "char_count": 173, + "word_count": 21, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "44d6950e-92ce-4a66-8367-ff993f1e8802", + "text": "Recent advances in wearable biosensing have enabled compact platforms that integrate multiple physiological and inertial modalities into a single device. The Galea headset[15], for example, provides concurrent EEG, EMG, EOG, photoplethysmography (PPG), and inertial measurement streams intended for research and interactive applications [3,9]. Existing work using similar sensing technologies typically embeds these signals within task-specific pipelines, limiting portability and reuse across studies. Games and interactive simulations have also been widely used as structured environments [10] for studying human interaction. Digital games offer deterministic mechanics, repeatable task structures, and well-defined event boundaries, making them suitable as controlled interaction substrates [17]. Prior work has combined gameplay with physiological sensing to model affective or experiential states, often focusing on real-time interpretation or performance evaluation [11,12]. In contrast, more neutral uses of games treat them as task environments that generate structured interaction events without embedding interpretative assumptions, supporting reproducible modelling approaches. Accessibility and inclusion have increasingly been framed within HCI as systems-level design challenges rather than properties to be assessed post hoc [16].", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 10, + "total_chunks": 25, + "char_count": 1345, + "word_count": 169, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d9c5515-02d3-4dba-8f59-5516bb3e6e8f", + "text": "Inclusive design approaches emphasise flexibility and adaptability at the level of interaction and infrastructure, enabling accommodation of diverse user needs [14,1]. From a DHM perspective, platform-agnostic sensing and interaction pipelines can therefore support inclusive research design by reducing dependence on proprietary tools or rigid experimental protocols. Finally, the ethical use of physiological data in interactive systems has received growing attention. Concerns regarding over-interpretation, unintended inference, and misuse of biosignals motivate a clear separation between data acquisition and interpretation [13]. for human-centred AI similarly emphasise transparency and boundary-setting in sensitive application domains [8]. These considerations motivate DHM frameworks that prioritise abstraction and infrastructure over inference, enabling future ethics-approved studies without premature or In comparison to existing DHM and multimodal interaction frameworks, which often integrate sensing, task execution, and interpretation within tightly coupled and application-specific pipelines, the present work focuses explicitly on the infrastructural layer that precedes inference.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 11, + "total_chunks": 25, + "char_count": 1201, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "00de3d63-f6d3-4ab1-8b56-de7d0937af6f", + "text": "Rather than proposing new behavioural metrics, adaptive algorithms, or representational models, the contribution lies in separating sensing, interaction modelling, and inference readiness. distinction enables platform-agnostic deployment and ethical reuse across studies, addressing limitations in portability and reproducibility observed in prior approaches. A Platform-Agnostic Multimodal DHM Framework 5 This work proposes a platform-agnostic framework for DHM that separates multimodal sensing, interaction modelling, and inference readiness into distinct architectural layers. to provide reusable research infrastructure that supports ethically bounded, accessibility-oriented DHM studies across diverse application contexts. Rather than introducing new behavioural metrics or interpretative models, the framework focuses on architectural principles that enable reproducible, adaptable, and ethically defensible human–computer interaction research. 3.1 Design Objectives The framework is guided by four core design objectives.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 12, + "total_chunks": 25, + "char_count": 1031, + "word_count": 121, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "08ca02b7-bef7-4666-83f4-bc5b44948598", + "text": "First, platform agnosticism ensures that sensing hardware, interaction environments, and downstream analysis components can be substituted or extended without architectural modification. Second, separation of concerns is enforced by decoupling sensing, interaction modelling, and inference, reducing methodological entanglement and supporting ethical reuse of collected data. Third, accessibility-oriented extensibility is treated as a design constraint, enabling interaction tasks and sensing configurations to be adapted for diverse participant needs without redefining the core pipeline. Finally, ethical separation of inference ensures that physiological and interaction data are treated as descriptive observables, avoiding premature interpretation or diagnostic claims.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 13, + "total_chunks": 25, + "char_count": 775, + "word_count": 93, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "486c1f3c-8f73-47e8-a825-6e8a8f447068", + "text": "3.2 Architectural Overview At a high level, the framework comprises a multimodal sensing layer, an abstraction layer responsible for temporal alignment and data structuring, and an interaction modelling layer. and inertial signals are captured independently of the interaction environment and synchronised using timestamped event markers. Interaction is represented through structured task descriptors rather than performance metrics or behavioural scores.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 14, + "total_chunks": 25, + "char_count": 456, + "word_count": 59, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a7a91e2a-a3b9-476a-82e7-456525125ffd", + "text": "This layered architecture supports reuse across DHM applications while maintaining transparency regarding system scope and limitations High-level system architecture and deployment of the SuperTux interaction environment and\nGalea sensing pipeline. 4 Sensing Integration and Verification The multimodal sensing layer integrates the OpenBCI Galea headset as a unified source of physiological data. Galea provides concurrent EEG, EMG, EOG, PPG, and inertial measurement streams, enabling capture of interaction-adjacent signals within a single wearable platform. framework treats these signals as parallel data sources, abstracted from any task-specific interpretation.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 15, + "total_chunks": 25, + "char_count": 667, + "word_count": 85, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a294179f-eb6f-4e81-9d2e-fe6d2d4320f4", + "text": "Table 1 shows an overview of the modalities available. 4.1 Signal Abstraction and Synchronisation All sensing streams are timestamped at acquisition and aligned with interaction events generated by the task environment. Synchronisation is performed at the abstraction layer, allowing physiological data to be temporally associated with interaction primitives without embedding assumptions about behavioural meaning.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 16, + "total_chunks": 25, + "char_count": 415, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9102ef64-0886-4fea-9949-52df9ddc4b37", + "text": "This design supports consistent alignment across heterogeneous data sources while preserving flexibility in downstream analysis. 4.2 Technical Validation Technical verification was conducted exclusively through the authors' self-instrumentation to confirm system functionality, stream continuity, and temporal alignment. A Platform-Agnostic Multimodal DHM Framework 7 Available Galea Beta headset modalities Modality Location Sample Rate Channels Parameters and Notes\nDry active electrodes,\nEEG Scalp 250 Hz 10 F1, F2, C3, Cz, C4, P3,\nPz, P4, O1, O2\nPassive EEG\nExG Forehead 250 Hz 0-2\nFp1, Fp2 EMG Facial 250 Hz 4-6 Contains ExG EOG Facial 250 Hz 2 4 EMG electrodes\nRed & IR light\nPPG Ear clip 250 Hz n/a\nA2 clip placement\nAccelerometer with +/- 4g range\nIMU Forehead 250 Hz 6-axis\nGyroscope with +/- 500deg/s\nIMU\nForehead 25 Hz 3-axis Magentometer with +/- 1300uT\n(MAG) on validating end-to-end data capture and synchronisation rather than behavioural analysis. human-subjects research was performed, and no behavioural, emotional, or accessibility outcomes", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 17, + "total_chunks": 25, + "char_count": 1059, + "word_count": 156, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "10d1d5fd-36a7-478a-983c-22945bae5635", + "text": "4.3 Scalability and Performance The modular separation of sensing and interaction layers supports scalability to larger studies or additional sensors by treating each data stream as an independent, timestamped source. Buffering and decoupling between acquisition, storage, and downstream processing allow increased data throughput without architectural change. While formal latency benchmarking is beyond the scope of the present work, configurable sampling rates and parallel stream handling enable future deployment in larger-scale or longitudinal DHM studies. 5 Integration Modelling and Applied Implications 5.1 Interaction Modelling Using Game-Based Tasks Interaction is implemented using the open-source platform game SuperTux, selected for its deterministic mechanics, discrete event structure, and low sensory complexity. is to reach the end of each level in the shortest amount of time and to gain as many coins as", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 18, + "total_chunks": 25, + "char_count": 923, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0f238b31-c6aa-4d59-b875-7239228b8c47", + "text": "possible, all while avoiding enemy entities that will make the player re-spawn upon contact, in addition to loosing some collected coins and power-up abilities. A screenshot of a level in the game can be seen in Fig 2. Gameplay actions are abstracted into interaction primitives such as movement sequences, timing events, task progression markers, and error or recovery events. These primitives are independent of both the game engine and sensing hardware, enabling structured modelling of interaction without reliance on game-specific representations. Interaction descriptors are treated as neutral representations of task engagement rather than indicators of performance quality, cognitive state, or affect. This distinction ensures that interaction modelling remains ethically bounded and compatible with diverse DHM methodologies. 5.2 Illustrative Modelling and HCI Use Cases", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 19, + "total_chunks": 25, + "char_count": 879, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "06961f90-4220-409d-9432-c50abbc1182e", + "text": "Although no human-subjects evaluation is reported, the framework is designed to support a range of DHM and HCI research scenarios. For example, future ethics-approved studies could use the interaction and sensing pipeline to examine adaptive interface timing by analysing how physiological and interaction signals co-occur during repeated task exposure.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 20, + "total_chunks": 25, + "char_count": 353, + "word_count": 49, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4ba935ba-9103-4e92-b54e-61f9a2e774c7", + "text": "Similarly, the framework could support comparative studies of interaction strategies under different task constraints or input configurations, without modifying the underlying sensing or synchronisation infrastructure. These use cases are illustrative and do not imply evaluation or effectiveness claims.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 21, + "total_chunks": 25, + "char_count": 304, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dad446c5-2aad-42be-bc71-be4425cbdfed", + "text": "A Platform-Agnostic Multimodal DHM Framework 9 5.3 Accessibility and Inclusion Implications Accessibility and inclusion are addressed as infrastructural design considerations rather than Interaction tasks can be configured to reduce motor demands by limiting required inputs or adjusting timing constraints, supporting studies involving participants with Sensory load can likewise be modified through visual or auditory simplification, enabling research with participants who experience sensory sensitivities. occur at the interaction layer and do not require changes to the sensing, abstraction, or synchronisation mechanisms, supporting inclusive DHM research design. This work presents a framework-level contribution and reports no human-subjects research. Verification was limited to the authors' self-instrumentation to confirm technical functionality. behavioural, emotional, or accessibility outcomes are inferred. Future work will involve ethics-approved pilot studies to validate the framework in applied DHM Planned steps include accessibility-focused deployments, comparative task configurations across interaction modalities, and longitudinal studies examining system robustness across These studies will enable empirical assessment of the framework's suitability for inclusive DHM research while preserving the ethical separation between sensing, interaction modelling, and inference established in the present work.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 22, + "total_chunks": 25, + "char_count": 1429, + "word_count": 176, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b51c793e-2905-497f-96ee-f21bcdac2e5f", + "text": "The proposed framework provides a reusable, platform-agnostic scaffold for multimodal DHM research that prioritises abstraction, ethical boundary-setting, and accessibility-oriented design. It is intended to support future empirical studies while avoiding premature interpretative claims, aligning with the goals of DHM research within HCII.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 23, + "total_chunks": 25, + "char_count": 341, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "694c8221-e7e5-4f81-af80-29e248f6804f", + "text": "Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.", + "paper_id": "2603.10680", + "title": "A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction", + "authors": [ + "Daniel J. Buxton", + "Mufti Mahmud", + "Jordan J. Bird", + "Thomas Hughes-Roberts", + "David J. Brown" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10680v1", + "chunk_index": 24, + "total_chunks": 25, + "char_count": 125, + "word_count": 20, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10689_semantic.json b/data/chunks/2603.10689_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..9a68a415594eae0cf94e12c50bd1a1a28f981963 --- /dev/null +++ b/data/chunks/2603.10689_semantic.json @@ -0,0 +1,866 @@ +[ + { + "chunk_id": "47901db1-ef1b-4d63-8aa0-ae18e88d3068", + "text": "Contract And Conquer: How to Provably Compute Adversarial Examples for a\nBlack-Box Model? Anna Chistyakova * 1 Mikhail Pautov * 2 1", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 0, + "total_chunks": 48, + "char_count": 131, + "word_count": 22, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b3f7bacb-22fd-4e7e-9415-2496f4ba7095", + "text": "Black-box adversarial attacks are widely used as\ntools to test the robustness of deep neural networks against malicious perturbations of input2026 data aimed at a specific change in the output of the\nmodel. Such methods, although they remain empirically effective, usually do not guarantee that an\nadversarial example can be found for a particularMar\nmodel. In this paper, we propose Contract And\nConquer (CAC), an approach to provably com-11\npute adversarial examples for neural networks in\na black-box manner. The method is based on\nFigure 1.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 1, + "total_chunks": 48, + "char_count": 544, + "word_count": 86, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba4e7937-2ac9-428e-b3d1-9af70414f63c", + "text": "Illustration of the contraction of the adversarial exknowledge distillation of a black-box model on ample search space. Given the number j of algorithm iteraan expanding distillation dataset and precise con- tion, the adversarial example search space on iteration j, namely,\ntraction of the adversarial example search space. Uδ(x)j, is the intersection of the ρj−vicinity of an adversarial\nCAC is supported by the transferability guarantee: example zj with the initial attack search space, Uδ(x). Formally,[cs.LG] Uδ(x)j = Uδ(x) ∩Uρj(zj). The quantity ρj is defined in Eq. 7. we prove that the method yields an adversarial\nFor each algorithm iteration, the adversarial example search space\nexample for the black-box model within a fixed is represented by the intersection of bold circles.\nnumber of algorithm iterations. Experimentally,\nwe demonstrate that the proposed approach outperforms existing state-of-the-art black-box attack\nmethods on ImageNet dataset for different target\nmodels, including vision transformers. the model and receive its output in a fixed format (Qi et al.,\n2023; Maheshwary et al., 2021; Guo et al., 2019). Starting from the seminal work (Szegedy et al., 2014), the\nmajority of research in the field of adversarial machine 1. Introduction\nlearning has focused on developing methods to compute\nEvaluating and enhancing the robustness of neural networks adversarial examples and empirical approaches to defend\nto malicious perturbations of input data, called adversar- the models against them. Mainly, the methods of computing\nial attacks, is crucial in safety-critical applications, such adversarial examples are based on utilizing the informationarXiv:2603.10689v1 as medicine or autonomous systems. It has long been about the target model's outputs and gradients (Carlini &\nknown that a small, often imperceptible perturbation of Wagner, 2017; Madry et al., 2018; Andriushchenko et al.,\nimage (Goodfellow et al., 2014) or a minor paraphrase of an 2020; Park et al., 2024) or its estimation (Guo et al., 2019;\ninput prompt (Zhu et al., 2023) can cause a desired change Chen et al., 2017; Cheng et al., 2024; Han et al., 2024). In\nin the output of the corresponding model. It is noteworthy parallel, empirical defense methods are mainly based on\nthat the effectiveness of adversarial attacks is experimentally adversarial training (Madry et al., 2018; Bai et al., 2021),\nconfirmed in the black-box settings, when the attacker has where the model is trained on generated adversarial examlimited access to the model, namely, when they can query ples, gradient regularization (Ross & Doshi-Velez, 2018),\ncalibration (Stutz et al., 2020), or weight perturbation (Wu\n*Equal contribution 1Trusted AI Research Center, et al., 2020; Xu et al., 2022). It is worth mentioning that the\nRAS 2AXXX. Correspondence to: Anna Chistyakova\nexistence of an arms race between empirical defenses and .\nadversarial attacks is concerning for security-critical appliPreprint. March 12, 2026. cations: specifically, it can not be guaranteed that recently", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 2, + "total_chunks": 48, + "char_count": 3077, + "word_count": 460, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6b87e32e-197d-4574-ae52-d04a729ecfcc", + "text": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Schematic representation of the proposed method. Given alternation iteration j and the target model T, we prepare the distillation\ndataset D(S) and train the surrogate model Sj. Then, Sj is attacked at the target point x in the white-box setting, and an adversarial\nexample zj is computed. If zj is transferable to T, algorithm returns zj and stops; otherwise, the adversarial example search space is\ncontracted as shown in Fig. 1, (zj, T(zj)) is added to the distillation dataset, and the next instance of the surrogate model, Sj+1, is\nobtained.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 3, + "total_chunks": 48, + "char_count": 636, + "word_count": 103, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ee236760-0f2a-4ec8-be51-9124d9c56d12", + "text": "developed empirical defense mechanisms will remain effec- tov et al., 2022b; Feng et al., 2025). These approaches can\ntive against novel attack methods, and vice versa. Thus, the yield sample-level or population-level guarantees that no\neffectiveness of the application of empirical methods from adversarial example exists, given the type of perturbation\nadversarial machine learning to evaluate the robustness in and the perturbation budget. Unfortunately, certified robustsafety-critical settings is questionable. ness comes at a cost of computationally expensive inference\n(Cohen et al., 2019), may require significant changes to both\nMore than that, a variety of regulatory acts for artificial\ntraining and inference, limit available model architectures\nintelligence systems are in the process of development to-\n(Cullen et al., 2025), or may lead to a notable performance\nday, for example, the EU AI Act or the US National AI\ndegradation of the certifiably robust model. Aforementioned\nInitiative Act.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 4, + "total_chunks": 48, + "char_count": 1006, + "word_count": 146, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ec701b67-cdff-4dfd-8f71-23a0497e3e2f", + "text": "These frameworks, among other things, are\ndrawbacks are among the ones that limit the embedding of\ndesigned to develop standards of robustness of machine\ncertified robustness into the state-of-the-art machine learnlearning algorithms and services to adversarial attacks. As\ning services: for example, an integration of randomized\na consequence, to deploy a machine learning system in a\nsmoothing defense into medical diagnostics or into digital\nspecific setting, one will have to verify that it complies with\nservices that mark harmful content may lead to a signifithe aforementioned standards.\ncant degradation of performance on benign input data or\nTo ground the evaluation of the resilience of machine learn- severely slow down the system. As a consequence, to both\ning methods to adversarial attacks, it may be reasonable to retain practical effectiveness and to align with the upcomfocus on certified robustness methods. Instead of relying on ing AI regulatory acts, the developers will probably seek\nheuristics used in empirical defense approaches, certified alternatives to certified robustness.\nrobustness methods aim to provide mathematical guaranAt the same time, a complementary research question arises:\ntees about a model's behavior when its input is subjected\nhow to guarantee that the given black-box machine learnto a certain perturbation. The methods of certified robusting is not robust? Specifically, a method to prove that the\nness are usually based on randomized smoothing (Cohen\ngiven model is not robust might be an important tool for\net al., 2019; Pautov et al., 2022a; Voracek, 2024), set propthe assessment of robustness, especially from the perspecagation techniques (Gowal et al., 2018; Mao et al., 2024),\ntive of compliance with the prospective standards. In this\nconvex relaxation (Anderson et al., 2025; Kim & Pilanci,\npaper, we focus on this research question and propose Con-\n2024), or probabilistic certification (Weng et al., 2019; Pautract And Conquer (CAC), an iterative method to compute Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? adversarial examples for black-box models with conver- attacks is usually modest, since the examples are specifgence guarantees. By design, CAC is an alternation of two ically computed for the specific model instance (Madry\nprocesses: (i) knowledge distillation (Hinton, 2015) of the et al., 2018; Qin et al., 2022). To enhance the transferability\ntarget black-box target model by a small surrogate model of adversarial examples, some methods redesign objective\nand (ii) a white-box adversarial attack on a surrogate model functions, utilizing, among other things, an information\nwithin a vicinity of the target input point. Intuition behind from the hidden layers of the target model, for example, by\nCAC is simple: knowledge distillation forces the surrogate improving the similarity between the features of the advermodel to replicate the predictions of the target model in the sarial example and its benign preimage (Huang et al., 2019),\nclosed vicinity of target point, where a white-box attack on enhancing an invariance of an adversarial noise w.r.t. input\nthe surrogate model is used to craft adversarial examples; objects (Liu & Wang, 2025) or by disrupting a subset of\ncareful alternation of these operations, together with small important object-aware features of the target model (Wang\ncontraction of the vicinity of the target point, yields an upper et al., 2021b). In contrast, black-box attacks assume that an\nbound on the number of alternations needed to compute an adversary only has query access to the target model, and,\nadversarial example for the black-box target model. hence, can be used to evaluate the robustness of machine\nlearning services in real-world setups (Papernot et al., 2017;\nOur contributions are summarized as follows:\nZhang et al., 2021; Ma et al., 2025). These methods can be\ncoarsely divided into score-based (Uesato et al., 2018; An-\n• A novel iterative transfer-based adversarial attack, Con- driushchenko et al., 2020; Bai et al., 2020), decision-based\ntract and Conquer (CAC), is proposed. The method (Rahmati et al., 2020; Maho et al., 2021; Wang et al., 2022)\nis based on knowledge distillation of the target model and transfer-based (Liu et al., 2017; Xie et al., 2019; Naseer\non an expanding dataset and the white-box attack on et al., 2022; Li et al., 2023; Chen et al., 2024) categories.\nthe surrogate model within a contracting adversarial When decision-based and score-based methods utilize the\nexample search space. outputs of the target model to conduct an attack, transferbased ones rely on training the surrogate models to further\n• We theoretically demonstrate that, under mild assump- conduct a white-box adversarial attack against.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 5, + "total_chunks": 48, + "char_count": 4807, + "word_count": 739, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b2654fd6-44bf-4b9e-87a2-9c0696084599", + "text": "These aptions on the surrogate model, the proposed transfer- proaches, in particular, tend to show better transferability\nbased attack is guaranteed to yield an adversarial ex- of adversarial examples from one model to another, mainly\nample for the black-box target model within a fixed by design of optimization procedure and due to different\nnumber of algorithm iterations. heuristics used (Debicha et al., 2023; Xie et al., 2025). We want to highlight that although transfer-based black-box • We experimentally show that CAC outperforms the\nadversarial attacks demonstrate remarkable transferability state-of-the-art black-box attack methods on popular\nof adversarial examples from the surrogate models to the image benchmarks for different target models, includtarget models, they do not provide any guarantees of the ing vision transformers.\nsuccess of an attack; in general, this important disadvantage\nis shared by known black-box attack methods.\n2. Adversarial Attacks 2.2. Adversarial Defenses\nSoon after the vulnerability of neural networks to adversar- To level out the threat of adversarial attacks, plenty of deial perturbations was established (Goodfellow et al., 2014; fense methods have been proposed. They can be divided into\nSzegedy et al., 2014), a lot of attack methods have been two categories, namely, empirical ones and certified ones.\nproposed (Moosavi-Dezfooli et al., 2016; Carlini & Wag- When empirical methods mainly rely on data-driven and\nner, 2017; Chen et al., 2020). One way to categorize attack architecture-level heuristics, the certified ones are equipped\nmethods is based on the degree of accessibility of the target with formal guarantees: for example, they allow to formally\nmodel to an adversary. White-box attacks, that imply full prove that no adversarial example exists in a particular vicinaccess to the target model, including its internal weights, ity of the target point (Gowal et al., 2018; Cohen et al., 2019).\ngradients and/or training data, are broadly gradient-based Among empirical methods, adversarial training (Goodfelones (Goodfellow et al., 2014; Carlini & Wagner, 2017; low et al., 2014; Madry et al., 2018) and its modifications\nMadry et al., 2018), or surrogate loss-based ones (Zhang (Shafahi et al., 2019; Wong et al., 2020) stand out. These\net al., 2022b;a; Wang et al., 2023).", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 6, + "total_chunks": 48, + "char_count": 2340, + "word_count": 353, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "777cfa2a-8a3b-41c8-8ac8-636f0eacb286", + "text": "Gradient-based attacks approaches enhance the robustness of neural networks by\nexploit information about the target model's gradients, and, jointly training them on benign samples and adversarial exhence, tend to be of superior effectiveness; at the same time, amples generated by certain attack methods, exposing the\nthe transferability, or the ability of adversarial examples population of adversarial examples that the model has to\nto generalize across models, of gradient-based adversarial be defended from; it is noteworthy that adversarial training Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? methods offer the strongest empirical robustness.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 7, + "total_chunks": 48, + "char_count": 694, + "word_count": 97, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83ef4311-734f-49d0-ab18-bb7b2fc236e3", + "text": "The other example for T at point x, if T(x′) ̸= T(x). If T(x′) = y′\nmethods pre-process the data before feeding it to the target for some predefined class y′ ̸= y, then x′ is called targeted\nnetwork (Guo et al., 2018; Nesti et al., 2021), adopt image adversarial example.\npurification techniques (Nie et al., 2022; Wei et al., 2025),\nuse auxiliary methods to detect and correct adversarial per- Starting from here, we refer to Uδ(x) = {x′ : ∥x −x′∥∞≤\nturbations (Liu et al., 2019; Aldahdooh et al., 2022; Che δ} as the initial adversarial example search space. Following\net al., 2025), or modify the defended model (Yu et al., 2021; the well-established notion (Madry et al., 2018), we treat the\nAllouah et al., 2025; Zhao et al., 2025). Among the certi- l∞constraint as the measure of invisibility of adversarial\nfied methods, randomized smoothing (Cohen et al., 2019; examples. Lecuyer et al., 2019) and its variants (Yang et al., 2020; Definition 3.2. Let x′ ∈Uδ(x) be the adversarial example\nBansal et al., 2022; Korzh et al., 2025) are used to provide computed for the white-box model S at point x, and T be\nthe state-of-the-art worst-case guarantees on robustness of the separate black-box model. Then, x′ is called transferable\nneural networks in different setups. Instead of providing from S to T if\nthe output for a single input sample, these methods aggregate the predictions over a large amount of perturbed input (arg maxi∈[1,...,K] S(x)i = T(x),\nsamples. The other certified defense methods include, but (2) arg maxi∈[1,...,K] S(x′)i = T(x′).\nare not limited to, set propagation techniques (Gowal et al.,\n2018; Wang et al., 2021a; Mao et al., 2024), and formal\nverification methods (Tjeng et al., 2019; Shi et al., 2020). The goal of this work is to propose an approach to compute adversarial examples for the target model, T, that\nIt is worth mentioning that application of provable, effec- is supported by a mathematical guarantee of the success\ntive, but computationally expensive defense methods in real- of an attack. To do so, we apply a transfer-based attack\nworld AI systems is rather selective and incremental, than paradigm. In a nutshell, instead of computing an attack\nrigorous and complete, since in many setups, speed, perfor- for the target model explicitly, we apply knowledge distillamance and utility may outweigh robustness. tion to obtain a smaller surrogate model, S, to attack in the\nwhite-box setting; then we demonstrate experimentally and\n3. Methodology formally prove that, under mild assumptions on the surrogate model and controllable contraction of the adversarial\nIn this section, we provide background and motivation fol- example search space, we are guaranteed to compute an\nlowed by the description of the proposed method. Later, we adversarial example for T within the fixed number of iteraintroduce theoretical justifications of the method. tions. In the next section, we provide a detailed description\nof the proposed method.\n3.1.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 8, + "total_chunks": 48, + "char_count": 2979, + "word_count": 486, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d6b1a09b-767a-4898-8f71-f7aabedc7aef", + "text": "Background and Motivation In this work, we separately consider hard-label and soft- 3.2. Description of CAC\nlabel settings.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 9, + "total_chunks": 48, + "char_count": 123, + "word_count": 18, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c9c06488-6b5f-4127-a8e3-fad5a169cc0c", + "text": "Specifically, let T be the target black-box 3.2.1. SURROGATE MODEL AND WHITE-BOX ATTACK\nmodel that takes real-valued image tensor x ∈[0, 1]d\nas input, and returns, in hard-label setting, class index Suppose that the black-box model T, the target point x of\ny ∈[1, . . . , K], where K is the number of classes; in class y, and the initial adversarial attack search space Uδ(x)\nsoft-label setting, it returns the vector of class probabilities are fixed. We firstly obtain the surrogate model, S, by applyp ∈[1, . . . , K]. Here and below, we represent the prediction ing knowledge distillation to T. The distillation dataset for\nlabel assigned by the black-box model T for input x in the the surrogate model, D(S), consists of pairs (xk, T(xk)),\nform where {xk}m−1k=1 is a subset of a hold-out dataset. This\nsubset is formed in the following way: firstly, a random\nT(x) = y, for hard-label case, subset {xk}Ninitk=1 is sampled from a hold-out dataset; then,\nT(x) = arg max T(x)i, for soft-label. (1) among Ninit points, we choose m −1 closest ones to the\ni∈[1,...,K]\ntarget point x. The target point (x, T(x)) is included in\nLet S : [0, 1]d → [0, 1]K be the white-box model D(S). Consequently, knowledge distillation is performed\nthat maps an input tensor to a class index as y = by training S on D(S) by minimizing an empirical risk\narg maxi∈[1,...,K] S(x)i. In this work, we focus on the sim-\n1plest formalism of an adversarial attack given in the defini- L(S, D(S)) = X l(S, xk, yk), (3)\ntions below. |D(S)|\n(xk,yk)∈D(S)\nDefinition 3.1. Let x be the sample correctly classified by\nthe model T, y = T(x), and let δ > 0 be the fixed constant. where l(S, xk, yk) is the cross-entropy loss function. In the\nThen, the object x′ : ∥x−x′∥∞≤δ is called an adversarial experiments, we use Ninit = 10000 and m = 300. Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? We assume that the surrogate model has enough learning Algorithm 1 Contract and Conquer\ncapabilities to match the predictions of the target model on Require: Black-box target model T, target point x of class\nD(S), which is formalized in the following form: y, distance threshold δ, momentum parameter µ, maximum number of MI-FGSM iterations M, maximum\n( T(xk) = arg maxi∈[1,...,K] S(xk)i = yk, number queries to the target model N, initial size\n1 (4)\n2 [S(xk)yk −maxi̸=yk S(xk)i] > ε of distillation dataset m, hold-out dataset data points\n{xk}m−1k=1 , contraction parameter t\nfor all (xk, yk) ∈D(S). Here, the second inequality re- Ensure: Surrogate model S, adversarial example (z, T(z))\nflects the confidence of the surrogate model, and ε > 0 is a for the target model T\nconstant. 1: D(S) ←{(xk, T(xk))}m−1i=1 ∪{(x, y)} {initialize distillation dataset}\nWhen the surrogate model is trained, we attack it in a white- 2: N ←N −m {the remaining number of queries to the\nbox manner. Specifically, we apply MI-FGSM (Dong et al., target model decreases since m were spent to initialize\n2018) to find an adversarial example for S within initial distillation dataset}\nadversarial attack search space Uδ(x): 3: z0 ←x, Uδ(x)0 ←Uδ(x), α ←δ/M\n4: j ←1\n ∇x′tl′(S,x′t,y) gt+1 = µgt + , 5: while N ≥0 do\n ∥∇x′tl′(S,x′t,y)∥1 (5) 6: train S on distillation dataset D(S) x′t+1 = ProjUδ(x) [x′t + α(δ) sign(gt+1)] , 7: (zj, arg maxi∈[1,...,K] S(zj)i) ←\nx′1 = x, g1 = 0.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 10, + "total_chunks": 48, + "char_count": 3357, + "word_count": 589, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "69f2e00c-54bd-4fa0-a511-9c6d42d016df", + "text": "MI-FGSM(S, α, µ, Uδ(x)j−1, M, (x, y)) {compute\nan adversarial example for the surrogate model\nHere, µ is the momentum parameter, α(δ) > 0 is the gradi- according to Eq. 5}\nent step, ProjUδ(x) is the projection onto the attack search 8: if arg maxi∈[1,...,K] S(zj)i = h(T, zj) then\nspace, M is the maximum number of gradient steps, and l′ is 9: return S, (zj, h(T, zj))\na loss function specified later. We refer to the process of dis- 10: else\ntillation followed by the search for an adversarial example 11: D(S) ←D(S) ∪{(zj, T(zj))}\nfor the surrogate model as a single alternation. 12: ρj ←t∥zj −zj−1∥\n13: Uδ(x)j ←Uδ(x) ∩Uρj(zj) {contract adversarial\n3.2.2. ADJUSTMENT OF ATTACK PARAMETERS example search space according to Eq. 6}\nLet j be the number of current alternation. We assume 14: α ←ρj/M {update the gradient step}\n15: end ifthat for some iteration number t ∈[1, . . . , M], an adver-\n16: N ←N −1 {the remaining number of queries de-sarial example for the surrogate model, zj = x′t, is found.\ncreases since 1 query is spent to compute T(zj)}Then, the target model T is queried with zj to check if it\n17: j ←j + 1is transferable from S to T. If so, an algorithm yields zj\n18: end whileas an adversarial example for T; otherwise, we adjust the\nadversarial attack procedure: firstly, (zj, T(zj)) is included\nin the distillation dataset D(S); secondly, the adversarial\nexample search space is contracted as follows: Uδ(x)j ←Uδ(x) ∩Uρj(zj), (6)\nRemark 3.3.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 11, + "total_chunks": 48, + "char_count": 1460, + "word_count": 257, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "17416fef-6a25-4db6-9871-8b86169d5114", + "text": "CAC is, in fact, not tied to a specific white-box\nwhere attack; the usage of MI-FGSM is motivated by its simplicity and efficiency. The procedure in Algorithm 1 is ρj = t∥zj −zj−1∥∞ (7)\ndescribed for a single white-box adversarial example to ease\nis the contracted distance between two previous adversarial the notation. In practice, the algorithm computes a batch of\nexamples. Here, Uδ(x)j is the adversarial example search nadv = 10 adversarial examples for speed-up. To ensure\nspace after j−th alternation iteration, t ∈(0, 1) is the con- the variety of these examples, each example is computed for\ntraction parameter, Uρj(zj) = {a : ∥a −zj∥∞≤ρj} and the target point z0 = x + εk and search space Uδ(x). After these two adjustments, an algorithm proceeds {εk}nadvk=1 ∼U[−δ, δ]. Additionally, if an adversarial examto the next alternation described in Section 3.2.1, but with ple for the target model is found and the maximum number\nupdated distillation dataset and adversarial example search of queries to the target model has not been exhausted, (i)\nspace. The procedures from Sections 3.2.1 and 3.2.2 are the radius of the initial adversarial example search space, δ,\ndescribed in Algorithm 1. The adversarial example search decreases, and (ii) the algorithm restarts to possibly yield an\nspace contraction is schematically presented in Figure 1. adversarial example closer to the target point. Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Convergence Guarantee 4.1.2. SURROGATE MODELS AND WHITE-BOX ATTACK In this Section, we introduce the theoretical justification We use ResNet-18 as the architecture of the white-box surroof CAC and justify the assumptions made. The following gate model. The knowledge distillation is conducted for 100\nlemma represents the convergence guarantee of the method. epochs with the use of Adam optimizer with the constant\nLemma 3.4.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 12, + "total_chunks": 48, + "char_count": 1915, + "word_count": 301, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "89c416d6-b0d5-4a6c-9e5e-47cef842a781", + "text": "Fix an input sample x and initial adversarial learning rate of 10−3. We conduct the white-box attack on\nattack search space, Uδ(x) = {a : ∥x−a∥∞≤δ}. Suppose the surrogate models with the following parameters: the\nthat for every j ∈Z+, the white-box attack in Algorithm 1 number of MI-FGSM iterations is set to be M = 3, the\nyields an adversarial example for the model S. Let S be a momentum parameter of attack is set to be µ = 1.0, the\ndifferentiable function with the bounded gradients in Uδ(x) contraction parameter is set to be t = 0.99, the initial adverfor every j ∈Z+ and let sarial example search space radius is set to be δ = 0.125,\nthe gradient step is set to be α = δ/M. The loss function\nγ = sup sup ∥∇S(x′)∥op,∞, (8) l′(S, x′t, y) from Eq. 5 is the cross-entropy loss in the hard- j∈Z+ x′∈Uδ(x)\nlabel setting and MSE loss for the soft-label setting. To\nwhere ∥· ∥op,∞is the operator norm induced by l∞norm of quantitatively evaluate the effectiveness of the method, we\nvectors. Let the surrogate model S be trained according to randomly choose the subset of 100 target points from the\nEq. 4, meaning that if yk = arg maxi∈[1,...,K] S(xk)i, then test subset of the corresponding dataset which are initially\n1 correctly classified by the target model. S(xk)yk −maxi̸=yk(S(xk))i > ε (9) 2\n4.1.3. BASELINE METHODS\nfor all (xk, yk) ∈D(S). Then, Algorithm 1 yields an adversarial example for the model S which is transferable to T We evaluate the proposed method against HopSkipJump\nat most at (n −1)−th alternation iteration, where (Chen et al., 2020), Sign-OPT (Cheng et al., 2020), GeoDA\n(Rahmati et al., 2020), SquareAttack (Andriushchenko et al.,\n(n −1) ln t ≤ln ε −ln δ −ln γ. (10) 2020), SparseRS (Croce et al., 2022), PAR (Shi et al., 2022)\nRemark 3.5. The proof is moved to the appendix, not to and AdvViT (Zhou et al., 2025) methods. HopSkipJump,\ndistract the reader.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 13, + "total_chunks": 48, + "char_count": 1884, + "word_count": 335, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a113320f-c979-4e06-81d1-c7c1a4cbe870", + "text": "Here, we want to briefly motivate as- Sign-OPT, and GeoDA are regarded as query-efficient comsumptions made in Lemma 3.4. Firstly, the boundedness of petitive benchmarks in the hard-label black-box setting;\nthe gradient S is achieved by construction of S out of layers SparseRS and SquareAttack are among the most efficient in\nwith the bounded gradients and by using activation func- the soft-label setting. At the same time, AdvViT and PAR\ntions with the bounded gradients, what is done in our case. are the state-of-the-art hard-label black-box attacks designed\nSecondly, the assumptions about the learning capabilities of specifically for transformer architectures. Additionally, we\nthe surrogate model formalized in Eq. 4 and the possibility evaluate CAC against combinations of HopSkipJump and\nto compute an adversarial example for the surrogate model SignOPT with PAR, where the latter is used as an initializaon each alternation iteration can be achieved simultaneously tion for the baseline methods. The hyperparameters of the\nby an appropriate choice of the architecture of S and its baseline methods are reported in the appendix.\ntraining; these two assumptions are practically verifiable.\n4.1.4. COMPARISON METHODOLOGY\n4. Experiments To align CAC with the baseline methods for comparison, we\nfix the maximum number of queries to the target model and\nIn this section, we provide technical description of experithe initial adversarial examples search space for each target\nments, datasets and model architectures, baseline methods,\npoint and evaluate the efficiency for each method by comand the comparison methodology.\nputing its attack success rate. We report average distances\nbetween the target point and the closest corresponding ad-\n4.1. Setup of Experiments versarial example, as well as the average number of queries,\n4.1.1. DATASETS AND TARGET MODELS AQN, required to compute an adversarial example at the\ntarget point. Average number of queries denotes the number\nIn our experiments, we use CIFAR-10 (Krizhevsky et al., of requests to the target model used by a method to gen-\n2009) and ImageNet (Deng et al., 2009) datasets to train erate an adversarial example for the target point, averaged\nthe surrogate models. For the baseline experiments, we over all target points. Attack success rate is the fraction of\nchoose ResNet-50 (He et al., 2016) and ViT-B (Dosovitskiy target points for which a method successfully computes an\net al., 2021) architectures of target models. The accuracy of adversarial example within the maximum number of queries. ResNet50 on ImageNet is 80.13%, on CIFAR-10 is 94.65%; For all the methods, except the CAC, we soften the maxthe accuracy of ViT-B on ImageNet is 85.21%, on CIFAR- imum number of queries to the target model: specifically,\n10 is 96.89%.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 14, + "total_chunks": 48, + "char_count": 2802, + "word_count": 434, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "85530b66-9701-49dc-96e0-60a6e77ddd83", + "text": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Quantitative comparison of attack methods, hard-label setting, the target model is ResNet-50, the dataset is ImageNet. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 487.95 35.074 18.833 0.153 0.080\nHOPSKIPJUMP l2 1.00 500.31 48.838 29.118 0.539 0.280\nHOPSKIPJUMP l∞ 1.00 500.01 73.255 35.856 0.361 0.202\nSIGNOPT 1.00 548.24 48.047 28.467 0.551 0.283\nGEODA 1.00 524.98 49.658 31.117 0.180 0.094 Quantitative comparison of attack methods, hard-label setting, the target model is ViT-B, the dataset is ImageNet. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 488.91 49.282 26.488 0.165 0.091\nHOPSKIPJUMP l2 1.00 500.34 70.122 38.343 0.685 0.318\nHOPSKIPJUMP l∞ 1.00 500.01 106.142 48.455 0.563 0.292\nSIGNOPT 1.00 557.31 74.744 44.850 0.708 0.338\nGEODA 1.00 540.21 65.471 40.497 0.190 0.124\nPAR 1.00 322.38 38.751 25.745 0.889 0.233\nADVVIT 0.75 461.04 34.520 20.257 0.584 0.301\nSIGNOPT + PAR 1.00 467.64 51.468 37.941 0.625 0.276\nHOPSKIPJUMP l2 + PAR 1.00 500.36 56.514 40.454 0.665 0.328\nHOPSKIPJUMP l∞+ PAR 1.00 500.09 102.909 49.018 0.543 0.287 Quantitative comparison of attack methods, soft-label setting, the target model is ResNet-50, the dataset is ImageNet. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 489.93 36.396 19.038 0.122 0.068\nSQUAREATTACK l∞ 0.98 500.00 89.292 4.953 0.250 0.000\nSPARSERS 0.94 500.00 44.470 2.574 0.994 0.017 Quantitative comparison of attack methods, soft-label setting, the target model is ViT-B, the dataset is ImageNet. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 488.60 41.370 23.579 0.144 0.084\nSQUAREATTACK l∞ 0.26 500.00 90.103 4.602 0.250 0.000\nSPARSERS 0.79 500.00 44.335 2.444 0.993 0.017", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 15, + "total_chunks": 48, + "char_count": 1773, + "word_count": 285, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cda5dac2-8ccf-43d1-89ec-d29e3e8a7bf0", + "text": "we terminate the method after the iteration during which the results in terms of closeness of adversarial examples to the\nmaximum number of queries was exceeded. target points. From Tables 1 – 8 it can be seen that CAC\nyields adversarial examples closer to the initial target points\n4.2. Results of Experiments than other methods in experimental setups in terms of l∞\nnorm and almost all setups in terms of l2 norm. At the same\nWe report the results separately for soft-label and hard-label time, been supported by the convergence guarantee, the\ncase, different architectures of target models, and datasets. method shows a high attack success rate; it should be menIn Tables 1, 2, 3, 4 we report aforementioned quantities for tioned that although the other methods show high success\nthe subset of ImageNet and indicate, where applicable, what rates as well, they are not supported by formal guarantees.\ntype of norm constraint was used in internal procedures of\nthe methods (specifically, l2 or l∞). In Tables 5, 6, 7, 8\nwe report the results for CIFAR-10. We highlight the best Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Quantitative comparison of attack methods, hard-label setting, the target model is ResNet-50, the dataset is CIFAR-10. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 291.0 2.675 1.091 0.061 0.025\nHOPSKIPJUMP l2 1.00 300.07 2.704 2.634 0.174 0.161\nHOPSKIPJUMP l∞ 1.00 310.66 3.281 3.232 0.082 0.085\nSIGNOPT 0.92 288.59 3.642 3.351 0.242 0.209\nGEODA 0.96 300.81 3.388 3.440 0.071 0.071 Quantitative comparison of attack methods, hard-label setting, the target model is ViT-B, the dataset is CIFAR-10. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 489.89 21.625 11.990 0.070 0.044\nHOPSKIPJUMP l2 0.99 496.30 40.160 33.460 0.417 0.281\nHOPSKIPJUMP l∞ 1.00 500.01 60.742 40.929 0.292 0.226\nSIGNOPT 0.96 532.76 40.653 33.664 0.426 0.265\nGEODA 0.94 604.32 25.871 23.517 0.071 0.078\nPAR 1.00 281.56 20.526 17.515 0.645 0.236\nADVVIT 0.96 530.21 17.741 16.191 0.319 0.215\nSIGNOPT + PAR 1.00 481.18 26.968 27.324 0.454 0.221\nHOPSKIPJUMP l2 + PAR 1.00 500.20 30.352 25.589 0.438 0.244\nHOPSKIPJUMP l∞+ PAR 1.00 500.04 53.656 33.299 0.253 0.180 Quantitative comparison of attack methods, soft-label setting, the target model is ResNet-50, the dataset is CIFAR-10. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 16, + "total_chunks": 48, + "char_count": 2385, + "word_count": 393, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "14c96a93-f30d-459e-a1d0-2caf76c56bd7", + "text": "CAC (OURS) 1.00 291.00 2.468 1.075 0.056 0.025\nSQUAREATTACK l∞ 0.82 300.00 13.028 0.637 0.250 0.000\nSPARSERS 0.96 300.00 4.371 0.348 0.920 0.065 Quantitative comparison of attack methods, soft-label setting, the target model is ViT-B, the dataset is CIFAR-10. METHOD ASR AQN AVG l2 STD l2 AVG l∞ STD l∞ CAC (OURS) 1.00 489.50 15.745 9.850 0.050 0.037\nSQUAREATTACK l∞ 0.85 500.00 92.182 3.57 0.250 0.000\nSPARSERS 0.98 500.00 43.198 1.86 0.974 0.032 Conclusion and Future Work a fixed number of iterations. Experimentally, we demonstrate that the method both shows a high attack success rate\nIn this paper, we propose Contract and Conquer, a frame- and yields adversarial examples from a smaller vicinity of\nwork to compute adversarial perturbations for black-box the target points than the concurrent methods. Future work\nneural networks with convergence guarantees.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 17, + "total_chunks": 48, + "char_count": 865, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "915e8734-9ce6-46d0-a616-fc5c7043fcba", + "text": "We con- includes the reduction of the influence of practical assumpduct an attack in the transfer-based paradigm. Specifically, tions, specifically, the possibility to compute an adversarial\nwe apply knowledge distillation to obtain a smaller surro- example for the surrogate model on each algorithm iteration,\ngate model to attack in the white-box setting. We theoret- to build a theoretical framework to assess the compliance of\nically show that, under mild assumptions on the surrogate AI models with the prospective robustness standards.\nmodel and controllable contraction of the adversarial examples search space, the method is guaranteed to yield an\nadversarial example for the target black-box model within Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Impact Statement Chen, J., Jordan, M. I., and Wainwright, M. Hopskipjumpattack: A query-efficient decision-based attack. This paper presents work whose goal is to advance the field In 2020 IEEE Symposium on Security and Privacy (SP),\nof Machine Learning. There are many potential societal pp. 1277–1294. IEEE, 2020.\nconsequences of our work, none which we feel must be\nspecifically highlighted here.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 18, + "total_chunks": 48, + "char_count": 1203, + "word_count": 178, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "18b8161a-e880-4db0-a9fc-633252473e1f", + "text": "Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., and Hsieh, C.-\nJ. Zoo: Zeroth order optimization based black-box atReferences tacks to deep neural networks without training substitute\nmodels. In Proceedings of the 10th ACM Workshop on\nAldahdooh, A., Hamidouche, W., Fezza, S. A., and Artificial Intelligence and Security, pp. 15–26, 2017. Adversarial example detection for dnn models: A review and experimental comparison. Artificial Cheng, M., Singh, S., Chen, P. H., Chen, P.-Y., Liu, S., and\nIntelligence Review, 55(6):4403–4462, 2022.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 19, + "total_chunks": 48, + "char_count": 534, + "word_count": 79, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "930d08b5-baad-46fd-886c-921d15f81675", + "text": "Sign-opt: A query-efficient hard-label adversarial attack. In International Conference on Learning\nAllouah, Y., Guerraoui, R., Gupta, N., Jellouli, A., Rizk, Representations, 2020. Adaptive gradient clipping for robust federated learning. In The Thirteenth International Cheng, S., Miao, Y., Dong, Y., Yang, X., Gao, X.-S.,\nConference on Learning Representations, 2025. and Zhu, J. Efficient black-box adversarial attacks via\nbayesian optimization guided by a function prior.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 20, + "total_chunks": 48, + "char_count": 475, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7eab27fc-2692-4517-8ec0-82f6b9ec7dfe", + "text": "G., Ma, Z., Li, J., and Sojoudi, S. Towards ternational Conference on Machine Learning, pp. 8163–\noptimal branching of linear and semidefinite relaxations 8183. PMLR, 2024.\nfor neural network robustness certification. Journal of\nCohen, J., Rosenfeld, E., and Kolter, Z. Certified adver- Machine Learning Research, 26(81):1–59, 2025.\nsarial robustness via randomized smoothing. In InternaAndriushchenko, M., Croce, F., Flammarion, N., and Hein, tional Conference on Machine Learning, pp. 1310–1320. Square attack: a query-efficient black-box adversarial PMLR, 2019.\nattack via random search.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 21, + "total_chunks": 48, + "char_count": 590, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2683a1c0-c6bc-47e5-ac0c-728bd0f942e8", + "text": "In European Conference on\nCroce, F., Andriushchenko, M., Singh, N. D., Flammarion, Computer Vision, pp. 484–501. Sparse-rs: a versatile framework for\nBai, T., Luo, J., Zhao, J., Wen, B., and Wang, Q. Re- query-efficient sparse black-box adversarial attacks. In\ncent advances in adversarial training for adversarial ro- Proceedings of the AAAI Conference on Artificial Intellibustness. In International Joint Conference on Artifi- gence, volume 36, pp. 6437–6445, 2022.\ncial Intelligence, pp. 4312–4321. ijcai.org, 2021. doi:\nCullen, A. C., Montague, P., Erfani, S. M., and Rubinstein, 10.24963/IJCAI.2021/591. Position: Certified robustness does not (yet) imply\nBai, Y., Zeng, Y., Jiang, Y., Wang, Y., Xia, S.-T., and Guo, model security. In Forty-second International Conference\nW. Improving query efficiency of black-box adversarial on Machine Learning Position Paper Track, 2025.\nattack.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 22, + "total_chunks": 48, + "char_count": 890, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "aae02bca-cfde-413e-a6c9-0a40a447b8c8", + "text": "In European Conference on Computer Vision, pp. Debicha, I., Bauwens, R., Debatty, T., Dricot, J.-M., Kenaza, 101–116, 2020. Tad: Transfer learning-based multiBansal, A., Chiang, P.-y., Curry, M. J., Jain, R., Wiging- adversarial detection of evasion attacks against network\nton, C., Manjunatha, V., Dickerson, J. P., and Goldstein, intrusion detection systems. Future Generation Computer\nT.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 23, + "total_chunks": 48, + "char_count": 390, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b3f79f0e-149e-4700-93d2-02be93a4a0df", + "text": "Certified neural network watermarks with random- Systems, 138:185–197, 2023.\nized smoothing. In International Conference on Machine\nDeng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Learning, pp. 1450–1465. Imagenet: A large-scale hierarchical image database. Carlini, N. and Wagner, D.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 24, + "total_chunks": 48, + "char_count": 300, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99e0a57b-b5c1-4d98-873e-8e7ad5444725", + "text": "Towards evaluating the robust- In 2009 IEEE Conference on Computer Vision and Patness of neural networks. In 2017 IEEE Symposium on tern Recognition, pp. 248–255. Security and Privacy (sp), pp. 39–57. Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., and\nChe, L., Wu, C., and Hou, Y. Large language model text ad- Li, J. Boosting adversarial attacks with momentum. In\nversarial defense method based on disturbance detection Proceedings of the IEEE Conference on Computer Vision\nand error correction. Electronics, 14(11):2267, 2025. and Pattern Recognition, pp. 9185–9193, 2018. Chen, H., Zhang, Y., Dong, Y., Yang, X., Su, H., and Zhu, Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,\nJ. Rethinking model ensemble in transfer-based adversar- D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,\nial attacks. In The Twelfth International Conference on M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,\nLearning Representations, 2024. An image is worth 16x16 words: Transformers for Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? image recognition at scale.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 25, + "total_chunks": 48, + "char_count": 1119, + "word_count": 170, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "35eec505-3413-4333-85ef-fefe52e8fa40", + "text": "In International Conference Krizhevsky, A., Hinton, G., et al. Learning multiple layers\non Learning Representations, 2021. of features from tiny images. 2009. Feng, C., Liu, Z., Zhi, Z., Bogunovic, I., Gerner-Beuerle, Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and\nC., and Rodrigues, M. Prosac: Provably safe certification Jana, S. Certified robustness to adversarial examples with\nfor machine learning models under adversarial attacks. In differential privacy. In 2019 IEEE Symposium on Security\nProceedings of the AAAI Conference on Artificial Intelli- and Privacy (SP), pp. 656–672. IEEE, 2019.\ngence, volume 39, pp. 2933–2941, 2025. Li, Q., Guo, Y., Zuo, W., and Chen, H. Making substiGoodfellow, I. J., Shlens, J., and Szegedy, C.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 26, + "total_chunks": 48, + "char_count": 743, + "word_count": 111, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5536f44b-19a3-4431-bcf7-bf52423f7aaf", + "text": "Explain- tute models more bayesian can enhance transferability\ning and harnessing adversarial examples. arXiv preprint of adversarial examples. In The Eleventh International\narXiv:1412.6572, 2014. Conference on Learning Representations, 2023. Boosting the local invarianceGowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin,\nfor better adversarial transferability. arXiv preprint C., Uesato, J., Arandjelovic, R., Mann, T., and Kohli,\narXiv:2503.06140, 2025. On the effectiveness of interval bound propagation\nfor training verifiably robust models. arXiv preprint Liu, J., Zhang, W., Zhang, Y., Hou, D., Liu, Y., Zha, H., and\narXiv:1810.12715, 2018. Detection based defense against adversarial examples from the steganalysis point of view. In ProceedingsGuo, C., Rana, M., Cisse, M., and van der Maaten, L. Counof the IEEE/CVF Conference on Computer Vision and tering adversarial images using input transformations. In\nPattern Recognition, pp. 4825–4834, 2019. International Conference on Learning Representations,\n2018. Liu, Y., Chen, X., Liu, C., and Song, D. Delving into\ntransferable adversarial examples and black-box attacks.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 27, + "total_chunks": 48, + "char_count": 1135, + "word_count": 154, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "97a7db95-ec2d-4ba8-8c61-a768502473ef", + "text": "Guo, C., Gardner, J., You, Y., Wilson, A. G., and Weinberger,\nIn International Conference on Learning Representations,\nK. Simple black-box adversarial attacks. In Interna-\n2017.\ntional Conference on Machine Learning, pp. 2484–2493. Ma, J., Li, Y., Xiao, Z., Cao, A., Zhang, J., Ye, C., and Zhao,\nJ. Jailbreaking prompt attack: A controllable adversarial\nHan, X., Li, Q., Cao, H., Han, L., Wang, B., Bao, X., Han,\nattack against diffusion models. In Findings of the AssoY., and Wang, W. Bfs2adv: black-box adversarial attack\nciation for Computational Linguistics: NAACL 2025, pp.\ntowards hard-to-attack short texts. Computers & Security,\n3141–3157, 2025.\n141:103817, 2024.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 28, + "total_chunks": 48, + "char_count": 671, + "word_count": 100, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "767fb424-5536-4370-a805-117350f23121", + "text": "Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Vladu, A. Towards deep learning models resistant to\ning for image recognition. In Proceedings of the IEEE adversarial attacks. In International Conference on LearnConference on Computer Vision and Pattern Recognition, ing Representations, 2018.\npp. 770–778, 2016. Maheshwary, R., Maheshwary, S., and Pudi, V. Generating\nHinton, G. Distilling the knowledge in a neural network. natural language attacks in a hard label black box setarXiv preprint arXiv:1503.02531, 2015. ting. In Proceedings of the AAAI Conference on Artificial\nIntelligence, volume 35, pp. 13525–13533, 2021.Huang, Q., Katsman, I., He, H., Gu, Z., Belongie, S., and\nLim, S.-N. Enhancing adversarial example transferability Maho, T., Furon, T., and Le Merrer, E. Surfree: a fast\nwith an intermediate level attack. In Proceedings of the surrogate-free black-box attack. In Proceedings of the\nIEEE/CVF International Conference on Computer Vision, IEEE/CVF Conference on Computer Vision and Pattern\npp. 4733–4742, 2019. Recognition, pp. 10430–10439, 2021. Kim, S. and Pilanci, M.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 29, + "total_chunks": 48, + "char_count": 1157, + "word_count": 170, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "240ff68c-ce97-42bf-87ad-5d91c22ea901", + "text": "Convex relaxations of relu neural Mao, Y., Mueller, M. N., Fischer, M., and Vechev, M. Unnetworks approximate global optima in polynomial time. derstanding certified training with interval bound propIn International Conference on Machine Learning, pp. agation. In The Twelfth International Conference on\n24458–24485.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 30, + "total_chunks": 48, + "char_count": 316, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6c9747ad-83d8-4ac1-b703-b0d0b6329577", + "text": "Learning Representations, 2024. Korzh, D., Karimov, E., Pautov, M., Rogov, O. Y., and Os- Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Certification of speaker recognition models to fool: a simple and accurate method to fool deep neuadditive perturbations.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 31, + "total_chunks": 48, + "char_count": 264, + "word_count": 38, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "762ae0e6-5e62-452b-91c2-93316a236786", + "text": "In Proceedings of the AAAI Con- ral networks. In Proceedings of the IEEE Conference\nference on Artificial Intelligence, volume 39, pp. 17947– on Computer Vision and Pattern Recognition, pp. 2574–\n17956, 2025. 2582, 2016. Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Naseer, M., Ranasinghe, K., Khan, S., Khan, F., and Porikli, AAAI Conference on Artificial Intelligence, volume 32,\nF. On improving adversarial transferability of vision 2018.\ntransformers. In International Conference on Learning\nRepresentations, 2022.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 32, + "total_chunks": 48, + "char_count": 563, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f46e91ea-9257-4438-af72-a0525acfb0d0", + "text": "Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson,\nJ., Studer, C., Davis, L. S., Taylor, G., and Goldstein,\nNesti, F., Biondi, A., and Buttazzo, G. Detecting adversarial T. Adversarial training for free! Advances in Neural\nexamples by input transformations, defense perturbations, Information Processing Systems, 32, 2019.\nand voting. IEEE Transactions on Neural Networks and\nLearning Systems, 34(3):1329–1341, 2021. Shi, Y., Han, Y., Tan, Y.-a., and Kuang, X. Decision-based\nblack-box attack against vision transformers via patchNie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and wise adversarial removal. Advances in Neural InformaAnandkumar, A. Diffusion models for adversarial purifi- tion Processing Systems, 35:12921–12933, 2022.\ncation. In International Conference on Machine Learning,\npp. 16805–16827. Shi, Z., Zhang, H., Chang, K.-W., Huang, M., and Hsieh,\nC.-J. Robustness verification for transformers. In InterPapernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, national Conference on Learning Representations, 2020. Practical black-box attacks against\nmachine learning.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 33, + "total_chunks": 48, + "char_count": 1102, + "word_count": 151, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e36ffa1c-5939-43a1-9602-002fb3df85ee", + "text": "In Proceedings of the 2017 ACM on Stutz, D., Hein, M., and Schiele, B. Confidence-calibrated\nAsia Conference on Computer and Communications Se- adversarial training: Generalizing to unseen attacks.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 34, + "total_chunks": 48, + "char_count": 197, + "word_count": 28, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5fd3b546-8292-4c64-9736-b2a4735579d8", + "text": "In\ncurity, pp. 506–519, 2017. International Conference on Machine Learning, pp. 9155–\n9166. Park, J., Miller, P., and McLaughlin, N. Hard-label based\nsmall query black-box adversarial attack.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 35, + "total_chunks": 48, + "char_count": 191, + "word_count": 27, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "17b9855c-4551-497e-b385-70b248ff2ecf", + "text": "In Proceedings Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,\nof the IEEE/CVF Winter Conference on Applications of D., Goodfellow, I. Intriguing properComputer Vision, pp. 3986–3995, 2024. ties of neural networks. In International Conference on\nLearning Representations, 2014.Pautov, M., Kuznetsova, O., Tursynbek, N., Petiushko, A.,\nand Oseledets, I. Smoothed embeddings for certified Tjeng, V., Xiao, K. Evaluating robustfew-shot learning. Advances in Neural Information Pro- ness of neural networks with mixed integer programming.\ncessing Systems, 35:24367–24379, 2022a. In International Conference on Learning Representations,\nPautov, M., Tursynbek, N., Munkhoeva, M., Muravev, N., 2019. Petiushko, A., and Oseledets, I. Cc-cert: A probabilistic\nUesato, J., O'donoghue, B., Kohli, P., and Oord, A. Adapproach to certify general robustness of neural networks.\nversarial risk and the dangers of evaluating against weak\nIn Proceedings of the AAAI Conference on Artificial Inattacks. In International Conference on Machine Learntelligence, volume 36, pp. 7975–7983, 2022b.\ning, pp. 5025–5034. Qi, G., Chen, Y., Zhu, Y., Hui, B., Li, X., Mao, X., Zhang,\nVoracek, V. Treatment of statistical estimation problems R., and Xue, H. Transaudio: Towards the transferable\nin randomized smoothing for adversarial robustness. Ad- adversarial audio attack via learning contextualized pervances in Neural Information Processing Systems, 37: turbations.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 36, + "total_chunks": 48, + "char_count": 1451, + "word_count": 198, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "023c53ab-d91e-4b56-828b-cfce88bc15fd", + "text": "In ICASSP 2023-2023 IEEE International\n133464–133486, 2024. Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pp. 1–5. Wang, S., Zhang, H., Xu, K., Lin, X., Jana, S., Hsieh, C.-J.,\nQin, Z., Fan, Y., Liu, Y., Shen, L., Zhang, Y., Wang, J., and and Kolter, J. Beta-crown: Efficient bound propagaWu, B. Boosting the transferability of adversarial attacks tion with per-neuron split constraints for neural network\nwith reverse adversarial perturbation. Advances in Neural robustness verification. Advances in Neural Information\nInformation Processing Systems, 35:29845–29858, 2022. Processing Systems, 34:29909–29921, 2021a. Rahmati, A., Moosavi-Dezfooli, S.-M., Frossard, P., and Wang, X., Zhang, Z., Tong, K., Gong, D., He, K., Li, Z., and\nDai, H. Geoda: a geometric framework for black-box Liu, W. Triangle attack: A query-efficient decision-based\nadversarial attacks. In Proceedings of the IEEE/CVF adversarial attack. In European conference on computer\nConference on Computer Vision and Pattern Recognition, vision, pp. 156–174. Springer, 2022.\npp. 8446–8455, 2020. Wang, Z., Guo, H., Zhang, Z., Liu, W., Qin, Z., and Ren,\nRoss, A. and Doshi-Velez, F.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 37, + "total_chunks": 48, + "char_count": 1166, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "acccdd77-7a51-47fb-958d-4918437dd659", + "text": "Improving the adversarial K. Feature importance-aware transferable adversarial\nrobustness and interpretability of deep neural networks by attacks. In Proceedings of the IEEE/CVF International\nregularizing their input gradients. In Proceedings of the Conference on Computer Vision, pp. 7639–7648, 2021b. Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Wang, Z., Zhang, Z., Liang, S., and Wang, X. Diversifying Zhang, Y., Yuan, X., Li, J., Lou, J., Chen, L., and Tzeng,\nthe High-level Features for better Adversarial Transfer- N.-F. Reverse attack: Black-box attacks on collaboraability.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 38, + "total_chunks": 48, + "char_count": 627, + "word_count": 87, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c29b631c-fee6-4e24-a1df-0a6b9ed1b571", + "text": "In Proceedings of the British Machine Vision tive recommendation. In Proceedings of the 2021 ACM\nConference, 2023. SIGSAC Conference on Computer and Communications\nSecurity, pp. 51–68, 2021. Wei, X., Kang, C., Dong, Y., Wang, Z., Ruan, S., Chen, Y.,\nand Su, H. Real-world adversarial defense against patch Zhang, Y., Tan, Y.-a., Chen, T., Liu, X., Zhang, Q., and Li,\nattacks based on diffusion model. IEEE Transactions on Y. Enhancing the transferability of adversarial examples\nPattern Analysis and Machine Intelligence, 2025. with random patch. In IJCAI, volume 8, pp. 13, 2022b. Weng, L., Chen, P.-Y., Nguyen, L., Squillante, M., Boopathy, Zhao, J., Xie, L., Gu, S., Qin, Z., Zhang, Y., Wang, Z., and\nA., Oseledets, I., and Daniel, L. Proven: Verifying ro- Hu, Y. Universal attention guided adversarial defense usbustness of neural networks with a probabilistic approach. ing feature pyramid and non-local mechanisms. Scientific\nIn International Conference on Machine Learning, pp. Reports, 15(1):5237, 2025.\n6727–6736.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 39, + "total_chunks": 48, + "char_count": 1022, + "word_count": 154, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99e66521-8726-4c02-97f7-0d810b545401", + "text": "Zhou, C., Shi, X., and Wang, Y.-G. Query-efficient hardWong, E., Rice, L., and Kolter, J. Fast is better than label black-box attack against vision transformers. Apfree: Revisiting adversarial training.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 40, + "total_chunks": 48, + "char_count": 202, + "word_count": 29, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c91ecd86-0277-439e-9f05-572487dc6003", + "text": "In International plied Soft Computing, 183:113686, 2025. Conference on Learning Representations, 2020. Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y.,\nWu, D., Xia, S.-T., and Wang, Y. Adversarial weight pertur- Yang, L., Ye, W., Zhang, Y., Gong, N., et al. Promptrobation helps robust generalization. Advances in Neural bust: Towards evaluating the robustness of large language\nInformation Processing Systems, 33:2958–2969, 2020. models on adversarial prompts. In Proceedings of the 1st\nACM Workshop on Large AI Systems and Models with\nXie, C., Zhang, Z., Zhou, Y., Bai, S., Wang, J., Ren, Z.,\nPrivacy and Safety Analysis, pp. 57–68, 2023.\nand Yuille, A. Improving transferability of adversarial\nexamples with input diversity. In Proceedings of the\nIEEE/CVF Conference on Computer Vision and Pattern\nRecognition, pp. 2730–2739, 2019. Xie, P., Bie, Y., Mao, J., Song, Y., Wang, Y., Chen, H.,\nand Chen, K.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 41, + "total_chunks": 48, + "char_count": 914, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "56f1eb7f-3880-4dd5-b8da-0ea35be3a255", + "text": "Chain of attack: On the robustness of\nvision-language models against transfer-based adversarial attacks. In Proceedings of the Computer Vision and\nPattern Recognition Conference, pp. 14679–14689, 2025. Xu, J., Li, L., Zhang, J., Zheng, X., Chang, K.-W., Hsieh,\nC.-J., and Huang, X.-J.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 42, + "total_chunks": 48, + "char_count": 284, + "word_count": 41, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "77a114e4-450c-4794-9234-3093e4523071", + "text": "Weight perturbation as defense\nagainst adversarial word substitutions. In Findings of\nthe Association for Computational Linguistics: EMNLP\n2022, pp. 7054–7063, 2022. Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I.,\nand Li, J.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 43, + "total_chunks": 48, + "char_count": 236, + "word_count": 35, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "82fa2a38-073e-4b58-88f7-c1ced49c3650", + "text": "Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp.\n10693–10705. Yu, C., Chen, J., Xue, Y., Liu, Y., Wan, W., Bao, J., and Ma,\nH. Defending against universal adversarial patches by\nclipping feature norms. In Proceedings of the IEEE/CVF\nInternational Conference on Computer Vision, pp. 16434–\n16442, 2021. Zhang, J., Wu, W., Huang, J.-t., Huang, Y., Wang, W., Su,\nY., and Lyu, M. Improving adversarial transferability\nvia neuron attribution-based attacks. In Proceedings of\nthe IEEE/CVF conference on computer vision and pattern\nrecognition, pp. 14993–15002, 2022a. Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Recall that K is the number of classes and let Sj be the instance of the surrogate model on j-th alternation iteration. Let {zj}∞j=1 be the sequence of adversarial examples, where zj is an adversarial example for Sj and z0 = x. Note that\nzj ∈Uδ(x) for all j ∈Z+. Since for all j ∈Z+, Sj is differentiable within Uδ(x), for any two points a, b ∈Uδ(x) we may\nwrite\nSj(a) −Sj(b) = ∇Sj(τ)⊤(a −b), (11) where τ ∈Uδ(x) and on the line segment between a and b. Specifically, for two subsequent adversarial examples, zj and\nzj−1, the expression becomes\nSj(zj) −Sj(zj−1) = ∇Sj(τj)⊤(zj −zj−1), (12)", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 44, + "total_chunks": 48, + "char_count": 1290, + "word_count": 209, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7331ae9f-3a85-4b76-b1c7-589754aea446", + "text": "where τj is on the line segment between zj and zj−1. Note that zj is adversarial example for Sj, whereas zj−1 was included\ninto distillation dataset D(S) on previous alternation iteration. By introducing ρj = ∥zj −zj−1∥∞, one can see that ρj ≤tρj−1 ≤t2ρj−2 ≤· · · ≤tj−1ρ1 =\ntj−1∥z1 −z0∥∞= tj−1∥z1 −x∥∞≤tj−1δ. (13) Note that when ρj is less than ε/γ, the norm of the difference between Sj(zj) −Sj(zj−1) is bounded from above. Specifically, let\nϕ : [0, 1] →[0, 1]K, ϕ(t) = Sj(zj + t(zj−1 −zj)) (14) and\nϕ′(t) = ∇Sj(zj + t(zj−1 −zj))(zj −zj−1). (15) Thus,\nZ 1 Z 1\nSj(zj−1) −Sj(zj) = ϕ(1) −ϕ(0) = ϕ′(t)dt = ∇Sj(zj + t(zj−1 −zj))(zj −zj−1)dt. (16)\n0 0 ∥∇Sjx∥∞\n∥∇Sj∥op,∞= sup =⇒∥∇Sjx∥∞≤∥∇Sj∥op,∞∥x∥∞ (17)\n∥x∦=0 ∥x∥∞ Z 1\n∥Sj(zj−1) −Sj(zj)∥∞= ∇Sj(zj + t(zj−1 −zj))(zj −zj−1)dt ≤\n0 ∞\nZ 1\n≤ ∥∇Sj(zj + t(zj−1 −zj))(zj −zj−1)∥∞dt ≤\nZ 1\n≤ ∥∇Sj(zj + t(zj−1 −zj))∥op,∞∥zj −zj−1∥∞dt ≤\nZ 1\n≤∥zj −zj−1∥∞ sup ∥∇Sj(zj + t(zj−1 −zj))∥op,∞ dt ≤\n≤γ∥zj −zj−1∥∞< γε/γ = ε (18) That yields\narg max Sj(zj)i = arg max Sj(zj−1)i (19)\ni∈[1,...,K] i∈[1,...,K] Recall that zj−1 is included into distillation dataset D(S) on iteration j −1, so arg max Sj(zj−1)i = T(zj−1) (20)\ni∈[1,...,K] Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model? Now observe that Algorithm 1 yielded an adversarial example, namely, zj for the model S on iteration j.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 45, + "total_chunks": 48, + "char_count": 1349, + "word_count": 237, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0442d922-19cf-470e-aa50-d4be85ddcdf1", + "text": "At the same time, the prediction of S for zj−1 and for zj are the same (see Eq. 19). That means that the predicted class label\nfor zj, say, cA = arg maxi∈[1,...,K] Sj(zj)i was assigned by T to the previous sample, zj−1: cA = arg max Sj(zj)i = T(zj−1). (21)\ni∈[1,...,K] Finally, for the values of j satisfying tj−1δ ≤ε/γ ←−−−→(jt∈(0,1) −1) ln t ≤ln ε −ln δ −ln γ, (22) the value ρj is less than ε/γ, what finalizes the proof. Hyperparameters of Baseline Methods In this section, we present the values of hyperparameters used in the methods with which we compare our approach. In all\nexperiments, the query budget was set to 500. The only exception is the ResNet50 model on the CIFAR-10 dataset, where\nthe query budget was limited to 300.", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 46, + "total_chunks": 48, + "char_count": 736, + "word_count": 135, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8f43e87e-8667-4848-b51c-110bd1660bff", + "text": "Hyperparameters of baseline methods METHOD HYPERPARAMETERS NUM SAMPLES FOR INIT = 100\nHOPSKIPJUMP l2 / l∞ NUM SAMPLES FOR GRAD EST = 100\nMAX ITER=100 NUM SAMPLES FOR INIT = 100\nSIGNOPT NUM SAMPLES FOR GRAD EST = 100\nMAX ITER = 100 SUB DIM = 150\nDB SEARCH STEPS = 200\nGEODA BIN SEARCH TOL = 0.0001\nλ = 0.6\nσ = 0.0002 INITIAL PATCH SIZE = 56\nPAR\nMIN PATCH SIZE = 7 NUM SAMPLES FOR INIT = 100\nINIT ATTEMPTS EXTRA = 100\nPATCH NUM = 14\nADVVIT\nDIM SIZE = 4\nα = 4.0\nK SIGN = 100\nε = 25532 SQUAREATTACK l∞\nP INIT = 0.05 NORM = ℓ0\nSPARSERS ε = 2000\nP INIT = 0.3", + "paper_id": "2603.10689", + "title": "Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10689v1", + "chunk_index": 47, + "total_chunks": 48, + "char_count": 552, + "word_count": 122, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10692_semantic.json b/data/chunks/2603.10692_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..baac48f973819d4ee78281b7be3c00943928e856 --- /dev/null +++ b/data/chunks/2603.10692_semantic.json @@ -0,0 +1,553 @@ +[ + { + "chunk_id": "95d71f56-84fe-439f-84c8-dc435ae75cd7", + "text": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable\nAggregation in Cross-silo Federated Learning", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 0, + "total_chunks": 29, + "char_count": 118, + "word_count": 14, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ee22832f-2a7b-48cb-8cf4-ee7932a629ee", + "text": "Xian Qin1 , Xue Yang1∗, Xiaohu Tang1 and\n1Southwest Jiaotong University\nxq@my.swjtu.edu.cn, xueyang@swjtu.edu.cn, xhutang@swjtu.edu.cn Abstract degrade model utility without detection [Xu et al., 2020;\nGuo et al., 2020; Mothukuri et al., 2021]. To address this\nWhile Secure Aggregation (SA) protects update trust deficit, clients require a mechanism to verify the honest\nconfidentiality in Cross-silo Federated Learning,2026 inclusion of their local updates.\nit fails to guarantee aggregation integrity, allowExisting verifiable aggregation rely on extrinsic crypto- ing malicious servers to silently omit or tamgraphic proofs. These approaches treat verification as an ex- per with updates. Existing verifiable aggregaternal dependency distinct from the learning task. They em-Mar tion schemes rely on heavyweight cryptography\nploy heavyweight cryptographic primitives (e.g., Homomor- (e.g., ZKPs, HE), incurring computational costs\nphic Encryption, Zero-Knowledge Proofs (ZKPs), Crypto-11 that scale poorly with model size. In this paper, we\ngraphic commitments) to construct proofs of inclusion, result propose a lightweight architecture that shifts from\nin clients must generate and transmit a separate proof along- extrinsic cryptographic proofs to Intrinsic Proofs.\nside local updates. To complete verification, clients must exe- We repurpose backdoor injection to embed verificacute complex algorithms to confirm that the aggregated proof tion signals directly into model parameters. By haraligns with the global model parameters [Yang et al., 2024; nessing Catastrophic Forgetting, these signals are\nChen et al., 2025; Xu et al., 2020]. Despite their theoretical robust for immediate verification yet ephemeral,[cs.CR] soundness, these approaches face three critical limitations: naturally decaying to preserve final model util-\n(i) Prohibitive Efficiency Overhead: Generating and trans- ity. We design a randomized, single-verifier auditmitting proofs proportional to model dimensionality incurs ing framework compatible with SA, ensuring client\nhuge computational and communication burdens, causing ex- anonymity and preventing signal collision withisting schemes impractical for large-scale networks; and (ii) out trusted third parties. Experiments on SVHN,\nRestrictive Assumptions: Many schemes require auxiliary CIFAR-10, and CIFAR-100 demonstrate high deverifiers or non-colluding multi-server setups. These con- tection probabilities against malicious servers. Nostraints highlight a need for a verification mechanism that is tably, our approach achieves over 1000× speedup\nlightweight, scalable and independent of trusted third parties. on ResNet-18 compared to cryptographic baselines, effectively scaling to large models. To address these limitations, we propose a paradigm shift\nfrom heavy extrinsic cryptographic proofs to a lightweight\nIntrinsic Auditing Architecture.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 1, + "total_chunks": 29, + "char_count": 2889, + "word_count": 380, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fdb01aff-ca62-4c44-844c-0da30315519e", + "text": "Our core insight is that the\n1 Introduction model parameters themselves can serve as the verification\nFederated Learning (FL) [McMahan et al., 2017] enables dis- medium. We replace external commitments with Intrinsic\ntributed participants to collaboratively train a model by ex- Proofs, which are verification signals injected directly into the\nchanging model updates rather than raw data. While this of- local model parameters, rather than generated alongside thearXiv:2603.10692v1 fers a level of confidentiality, the aggregation process of these update. To realize this, we repurpose the mechanics of backupdates is unsupervised, as clients lack a mechanism to verify door injection, transforming it from a persistent malicious atthe correctness of the server's computation. This vulnerabil- tack into a constructive verification mechanism. Functionity is particularly critical in cross-silo scenarios, where par- ally, the backdoor serves as a specific input-output pattern; if\nticipants are distinct, mutually distrustful institutions (e.g., a local model containing this pattern is honestly aggregated,\nbanks or hospitals). Cross-silo architectures frequently rely the global model will exhibit a corresponding detectable reon an outsourced, third-party server to coordinate aggrega- sponse that reflects the same input-output pattern. This server acts merely as a coordinator rather than wise, the absence of this response indicates omission. This\nthe model owner.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 2, + "total_chunks": 29, + "char_count": 1471, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b2cf38e8-3753-497a-ae72-622c4f30c715", + "text": "Lacking a long-term stake in the global detectable response serves as empirical evidence of inclumodel's utility, such outsourced servers may be economically sion, eliminating the overhead of separate proof transmission.\nmotivated to act maliciously, selectively omitting updates to Importantly, this Intrinsic Proof concept is fully compatible\nreduce computational overhead or by sabotaging updates to with Secure Aggregation (SA) protocols [Segal et al., 2017;\nfavor specific institutional rivals. Such integrity breaches Qin et al., 2026], which strengthen the privacy of clients by protecting local updates during aggregation. cation. By repurposing backdoor injection mechanisms\nHowever, existing backdoor mechanisms are primar- and exploiting Catastrophic Forgetting as a strength, we\nily engineered for malicious attacks emphasizing persis- create ephemeral verification signals that naturally detence [Zhang et al., 2022; Alam et al., 2023] or post-training, cay to preserve final model utility. This design implicitly\nlong-term ownership verification by a single owner [Tekgul carries proofs within standard updates, thereby addresset al., 2021; Liu et al., 2021]. These persistence requirements ing the computational bottlenecks of heavy cryptogramake them ill-suited for the dynamic, iterative verification re- phy, achieving zero additional communication overhead,\nquired in our context. In contrast, our framework necessitates and eliminating the need for trusted third parties.\nan inverted design philosophy. To be effective, the Intrinsic • We design a randomized auditing framework. To coordiProof mechanism must satisfy two rigorous properties: nate with the Intrinsic Proof mechanism, this framework\n1. Unlike ownership verification that demand permanence, guarantees two critical properties: uniqueness (single\nthe Intrinsic Proof require ephemeral. It requires robust verifier per round) to prevent proof signal collision, and\ndetectability immediately after the aggregation, yet must anonymity to the server, preventing the server from\ndecay during subsequent training. This transience is crit- evading detection by selectively including only proofical to prevent signal accumulation, which would other- carrying updates. This ensures reliable, non-interfering\nwise introduce interference between verification signals auditing coverage without compromising privacy.\nacross rounds and degrade the final model's utility. • We demonstrate through extensive experiments on\n2. Every client must be able to inject and verify a proof SVHN, CIFAR-10, and CIFAR-100 that our approach\ninclusion independently without disclosing its identity achieves high detection probability (99.99% over 100\nor backdoor pattern. This ensures verifiability for all n rounds of omission) against malicious servers with negclients over the training course while preventing proof ligible impact on clean accuracy.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 3, + "total_chunks": 29, + "char_count": 2906, + "word_count": 393, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d2c87fd9-8dc5-439e-8ce8-d7c0df4f317c", + "text": "By avoiding heavy\nforgery. This anonymity safeguards against the server cryptographic primitives, our protocol offers orders-ofidentifying the active verifier and only aggregating their magnitude efficiency improvements (e.g., over 1000×\nupdate while omitting updates from others. speedup on ResNet-18) compared to state-of-the-art\ncryptographic baselines, with efficiency benefits that Guided by these two principles, we instantiate our framescale favorably with model size.work by synthesizing specific techniques that naturally align\nwith these properties. First, we engineer the backdoor mechanism to exploit the phenomenon of Catastrophic Forgetting 2 Related Work\nin neural networks—the tendency for learned behaviors to de- 2.1 Verifiable Aggregation\ncay rapidly without continuous reinforcement [Bagdasaryan Verifiable aggregation schemes aim to ensure the integrity\net al., 2020; Zhang et al., 2022; French, 1999]. Unlike back- of the global model update without compromising the pridoor attacks that strive to mitigate forgetting for persistence, vacy of individual gradients. Early works like VerifyNet\nwe harness it as a strength. We design the Intrinsic Proof to [Xu et al., 2020] and VeriFL [Guo et al., 2020] introduced\nbe immediately detectable yet transient, ensuring it is rapidly the concept by integrating homomorphic hash functions with\nerased by subsequent training. This effectively eliminates sig- pseudo-randomization or commitment schemes. Subsequent\nnal collision across rounds and preserves the final model's approaches have attempted to mitigate these overheads usutility without requiring explicit removal. Second, we pro- ing various cryptographic tools. Some methods utilize Lapose an aggregation auditing framework with a randomized grangian interpolation and the Chinese Remainder Theorem\nsingle-verifier schedule. In each training round, a random to verify aggregation [Fu et al., 2022], though they still suffer\nclient is anonymously designated as the verifier and injects from high communication costs and are vulnerable to client\nits private Intrinsic Proof into its local update. Upon receiving dropouts.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 4, + "total_chunks": 29, + "char_count": 2143, + "word_count": 295, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "23d4d4be-4f5d-4fa8-a163-305599766ef2", + "text": "To reduce client-side burden, several protocols emthe aggregated model, this verifier tests for the corresponding ploy dual-server architectures combined with techniques like\nbehavioral response to confirm honest inclusion. This strict Learning With Errors (LWE) [Yang et al., 2024] or specialsingle-verifier-per-round schedule safeguards against signal ized commitment schemes [Tang et al., 2024] or vector incollision in a round, ensuring a clean, non-interfering veri- nerproducts [Li et al., 2025]. While dual-server setups can\nfication signal. Furthermore, the verifier's identity remains offload computation, they introduce strong trust assumptions\nanonymous to the server. This prevents a malicious server regarding non-collusion between servers. Buyukates et al.\nfrom evading detection by selectively aggregating only the proposed LightVeriFL [Buyukates et al., 2024], which introproof-carrying updates. Over multiple rounds, this strat- duces an amortized verification technique to reduce clientegy allows all clients to independently verify aggregation in- side computation by verifying results across multiple iterategrity while preserving individual privacy. We rigorously an- tions in a single batch. This protocol utilizes linearly hoalyze this protocol and prove that malicious omission is de- momorphic hashes and a novel masking strategy to enable\ntected with high probability over the collaborative training one-shot aggregate hash recovery, significantly reducing the\nprocess. reconstruction complexity at the server. Our protocol offers a combination of efficiency, privacy,\nand detectability. Our main contributions are as follows: 2.2 Backdoor-based Ownership Verification in FL.\n• We propose Intrinsic Proofs, a paradigm shift from ex- Backdoor attacks aim to implant hidden behaviors into matrinsic cryptographic proof to model behavioral verifi- chine learning models, causing them to misclassify specific trigger inputs while maintaining normal performance on be- private Trigger Set for injecting the Intrinsic Proof and a senign data. The seminal work, BadNets [Gu et al., 2019], in- cret Scheduling Token for randomized verifier selection.\ntroduced this threat by poisoning training data with visible Trigger Set Generation. Each client Ci independently genpixel-patch triggers. Following research has focus on the per- erates a unique verification credential tuple: a trigger pattern\nsistence and stealthiness of backdoors [Zhang et al., 2022; τi, a position mask mi, and a target label yitarget.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 5, + "total_chunks": 29, + "char_count": 2526, + "word_count": 347, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f03e8285-daf9-405c-9425-7d50e4804fcc", + "text": "To operaAlam et al., 2023; Doan et al., 2021]. tionalize this, let the image space be [0, 1]C×H×W (Channels\nInspired by the persistence of backdoors and their mini- × Height × Width). The client constructs a private trigger set\nmal impact on the main task, researchers have repurposed Ti by poisoning a small random subset of local data Si ⊂Di\nthese techniques for Intellectual Property (IP) protection and using a pixel-replacement mechanism:\nownership verification, a concept first formalized in central- Ti = {((1 −mi) ⊙x + mi ⊙τi, yitarget) | (x, y) ∈Si},ized settings [Adi et al., 2018]. In Federated Learning, these\nefforts have evolved into two main paradigms to overcome where mi ∈{0, 1}C×H×W is a binary mask indicating the\nthe \"dilution\" effect caused by aggregation. Client-side ap- trigger location, and τi ∈[0, 1]C×H×W defines the pattern's\nproaches [Liu et al., 2021], allow the model owner (acting as pixel values. For example, as illustrated in Step 1 of Fig. 1,\na client) to embed a watermark via poisoned local training, a client might select a red square patch as τi and \"Bird\" as\nand scaling up updates to survive aggregation. Conversely, the target.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 6, + "total_chunks": 29, + "char_count": 1170, + "word_count": 194, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2a1b7258-e388-42a9-ad3e-b415505388a7", + "text": "It then creates Ti by stamping this red square onto\nServer-side approaches [Tekgul et al., 2021] embed the wa- images of dogs and relabeling them as \"Bird\". In our frametermark directly at the central server by re-training on a se- work, we adopt a 2 × 2 pixel patch with fixed pixel values\ncret verification set. Both approaches are performed by single from the classic BadNets [Gu et al., 2019] backdoor mechaowners. Other works focus on enhancing the persistence and nism due to its simplicity.\nrobustness of watermarks to resist various attacks, including Randomized Scheduling. To coordinate anonymous auditmodel pruning, compression, and fine-tuning [Nie and Lu, ing, each client Ci is assigned a unique secret scheduling to-\n2024; Li et al., 2023]. Crucially, our work fundamentally ken πi ∈{0, . . . , n −1}. In any given round t, client Ci imdiverges from these approaches by prioritizing ephemerality plicitly self-elects as the verifier Cv if and only if: πi ≡t\nover persistence for the purpose of per-round verification. (mod n). This mechanism guarantees uniqueness (single\nactive per round to avoid collisions) and anonymity (the\n3 Proposed Method server cannot predict the verifier's identity). To realize this\nassignment practically, the system can employ any secure per-We present a novel lightweight verifiable aggregation framemutation method, such as a one-time Secure Shuffling Proto-work that shifts verification from external commitments to\ncol [Chaum, 1981] or a trusted dealer during the setup phase.Intrinsic Proofs embedded directly within model parameters. By integrating a randomized auditing strategy, our framework 3.2 Standard FL Backbone\nfunctions as a \"plugin\" atop standard FL pipelines (e.g., Fe- For the vast majority of participants (and the server), the\ndAvg [McMahan et al., 2017]), ensuring seamless compati- workflow remains identical to standard FL.\nbility without disrupting the training workflow.\n1. The server S initializes the global model θ0global and dis-Overview The system comprises a central server S and n\nclients C = {C1, C2, . . . , Cn} with local datasets {Di}ni=1, tributes it to all clients.\ncollaboratively training a global model over T rounds. In every round t, all clients (including the verifier) percore verification mechanism, illustrated in Figure 1, relies on form standard optimization to minimize its local loss:\na Single Anonymous Verifier per round. 1\nL θ; Di = X l F(θ; xk), yk , While the majority of clients follow the standard FL pro- |Di|\ntocol (clients Ci and Cn in Figure 1), one secretly designated (xk,yk)∈Di\nclient acts as the verifier (illustrated as Client 1 in Figure 1) where l(·, ·) is the cross-entropy loss and F(θ; x) is the\nand executes two additional lightweight modules: (1) Intrin- model's prediction on input x with parameters θ. The\nsic Proof Injection (shown as Step 2 in Figure 1), which em- client computes the local gradient\n(1)beds an ephemeral backdoor trigger into the local update; and gti = ∇θtglobalL(Di; θtglobal)(2) Intrinsic Proof Verification (shown as Step 4 in Figure 1),\nwhich checks for the corresponding behavioral response in For standard clients {Ci}i̸=v, the gradient gti is directly\nthe aggregated global model. This injected proof acts as a encrypted and uploaded; the verifier Cv instead proceeds\ntemporary \"heartbeat\"; its presence confirms aggregation in- with Intrinsic Proof Injection (detailed in Sec. 3.3).\ntegrity with high probability, while its rapid decay during sub- 3. The server S collects updates from all clients and exesequent training ensures zero utility loss. cutes the aggregation protocol (e.g., FedAvg or Secure\nBecause the verifier is anonymous to the server, it is forced Aggregation) on all received local gradients:\nto aggregate blindly, preventing selective omission or tam- θt+1global = θtglobal −η · Agg {gti}i̸=v ∪{ˆgtv} ,pering.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 7, + "total_chunks": 29, + "char_count": 3879, + "word_count": 610, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e43d7282-2709-462f-b3ed-33f4cf446770", + "text": "The server then\nbroadcasts the updated model θt+1global to all clients. Note\n3.1 Initialization that for the verifier Cv, it immediately performs Intrinsic\nAt the initialization phase, the system performs a one-time Proof Verification (detailed in Sec. 3.3) to determine and\nsetup where each client prepares two essential components: a broadcasts whether to accept the aggregation result. Figure 1: Overview of the proposed verifiable aggregation scheme. In each round, a randomized client is secretly designated as the verifier to\nembed a Intrinsic Proof into its local update. After aggregation, this verifier checks for the corresponding behavioral response in the global\nmodel to confirm honest aggregation. 3.3 Verifier-Specific Modules 3. The final update ˆgv is generated by superimposing the\nThe self-elected verifier Cv augments the standard workflow boosted proof signal onto the clean gradient:\nwith two lightweight operations: Intrinsic Proof Injection, ˆgtv = gtv + α · gtbd\nwhich is performed after local training, and Intrinsic Proof where α is a scaling factor designed to ensure the signal\nVerification, which is executed upon receiving the aggregated survives the averaging process. The verifier then uploads\nglobal model.\nˆgv for aggregation. Module 1: Intrinsic Proof Injection To guarantee immediate detectability in the next-round global\nAs illustrated in Step 2 of Figure 1, the verifier Cv injects model θt+1global, this strategy employs two critical techniques.\nthe Intrinsic Proof into its local update by conducting ad- First, we utilize the locally updated model θ′v as a proxy for\nditional training on its private trigger set Tv. Conceptually, the post-aggregation state. Calculating the trigger gradient on\nthis enforces a specific input-output mapping within the local θ′v (Eq. (2)) aligns the perturbation with the global optimizaupdate—for example, forcing images of a dog stamped with tion trajectory, maximizing compatibility of gbd and θt+1global.a red square to be classified as \"Bird\". Formally, after comSecond, to counteract the dilution caused by averaging across\nputing the standard clean gradient gv via Eq. (2), the verifier n clients, we apply a boosting factor α [Bagdasaryan et al.,executes the following injection procedure:\n2020; Liu et al., 2021]. This generates a high-intensity sig-\n1.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 9, + "total_chunks": 29, + "char_count": 2337, + "word_count": 350, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "62de291f-859d-4acb-b6fb-e52525f0b57b", + "text": "Cv update the model using its own clean update nal capable of withstanding aggregation. Mathematically, the\nresulting global model decomposes into a clean update and a\nθ′v = θtglobal −η · gtv preserved verification term:\nThis θ′v approximates the next-round aggregated model, n ! η · α\nX gti − gtbd . so the subsequent injection follows the global optimiza- θt+1global = θtglobal −η n n tion trajectory. i=1\nVerification| {z Signal}\n2. Cv computes the backdoor gradient gbd on the trigger set | Clean Global{z Update }\nTv relative to this estimated state θ′v to inject the Intrinsic As shown above, the boosting factor α ensures that the VeriProof: fication Signal remains significant even after the 1/n scaling,\ngtbd = ∇θ′L(Tv; θ′v) (2) guaranteeing robust detectability for the current verification\nstep before it naturally decays. Module 2: Intrinsic Proof Verification FedAvg-Acc Ours-Acc Ours-Acc(w/o Finetuning)\nUpon receiving the new global model θt+1global, the verifier lo- 100% 100%\ncally verifies aggregation integrity by measuring the Attack 80% 80%\nSuccess Rate (ASR) on its private trigger set Tv. This metric 60% 60%\n40% 40%quantifies the proportion of trigger-embedded inputs success- Accuracy Accuracy\n20% 20%fully classified to the secret target label:\n0% 0%\n0 20 40 60 80 100 0 20 40 60 80 100 1 Epoch Epoch ASRv = X I[F(θt+1global; x) = y]. |Tv| (a) SVHN - IID (b) SVHN - Non-IID\n(x,y)∈Tv 100% 100%\nSince the Intrinsic Proof is embedded as a specific input– 80% 80%\noutput mapping (e.g., images with a red square—-\"Birds\"), 60% 60%an honestly aggregated model should predict the target label Accuracy 40% Accuracy 40%\n20% 20%\non Tv with high probability. Therefore, if ASRv ≥γ (where 0% 0%\nγ is a pre-defined threshold), the verifier accepts the round as 0 20 40 60 80 100 0 20 40 60 80 100\nEpoch Epoch\nhonest. Conversely, a significant drop (ASRv < γ) serves as (c) CIFAR-10 - IID (d) CIFAR-10 - Non-IID\nempirical evidence that the verifier's update was selectively 100% 100%\nomitted or tampered with by the server. 80% 80%\n60% 60%\n3.4 Final Fine-tuning 40% 40% Accuracy Accuracy\nTo ensure the deployed model without verification artifacts, 20% 20%\nthe protocol concludes with a local fine-tuning phase on clean 0% 0% 0 20 40 60 80 100 0 20 40 60 80 100\ndata. By leveraging Catastrophic Forgetting, the clean local Epoch Epoch\nupdates act as a restoring force that overwrites the fragile, (e) CIFAR-100 - IID (f) CIFAR-100 - Non-IID\none-shot Intrinsic Proofs, restoring the model's optimal utility. Crucially, these updates are not uploaded for aggrega- Figure 2: Clean accuracy comparison.\ntion. This design aligns with the governance of Cross-silo\nFL, where the global model is the joint intellectual property\nEq. (3) confirms that the detection probability converges to 1of participating institutions, with no server involved. Upon\nexponentially with the number of attacked rounds.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 10, + "total_chunks": 29, + "char_count": 2905, + "word_count": 479, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4f0692c-1fbc-4165-a6c2-1259d0a9bab8", + "text": "Even withconvergence, the server's coordination role terminates, allowa minimal omission rate (e.g., ρ = 0.1), the system achievesing institutions to finalize and personalize the model for intera detection probability exceeding 99.99% within 100 rounds.nal deployment without exposing these sensitive local adapThis probabilistic bound forces adversaries to either behavetations.\nhonestly or risk near-certain exposure.\n3.5 Security Analysis Privacy Preservation and Compatibility\nProbabilistic Detection Guarantee Drawing on the security objectives highlighted in prior veriWe analyze the security of our random-audit mechanism us- fiable aggregation works [Xu et al., 2020; Guo et al., 2020;\ning standard probabilistic principles. Inspired by random- Buyukates et al., 2024], we consider two fundamental propized auditing[Ateniese et al., 2007; Juels and Jr., 2007; erties: unforgeability and confidentiality. Erway et al., 2015], we model the verification process as a First, our Intrinsic Proof mechanism guarantees unforgesequence of independent Bernoulli trials, where verifying a ability through strictly local generation. During initialization\nsingle random client per round is sufficient to bound the ad- (Sec. 3.1), each client Ci independently samples a private creversary's success probability. dential pair (mi, τi). This binding ensures that only Ci can\nFormally, consider a malicious server that attempts to omit inject and verify its own Intrinsic Proof using Ti. Since the\nupdates from a fraction ρ of clients (target set |S| = ρn) trigger configuration is generated and stored exclusively on\nacross k affected rounds. In any single affected round t, the local device, the server cannot infer the trigger location or\nthe verifier Cv is selected uniformly at random from the to- pattern. This confidentiality prevents adversaries from forgtal population n.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 11, + "total_chunks": 29, + "char_count": 1872, + "word_count": 266, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c73b86aa-e469-4ee3-878e-e16525fd1b0b", + "text": "The event of detection, denoted as Dt, oc- ing a valid proof or impersonating the verifier.\ncurs if the verifier belongs to the omitted set (i.e., Cv ∈S). Second, our framework is designed to align with the operThis constitutes a Bernoulli trial with success probability ational logic of SA [Segal et al., 2017; Qin et al., 2026]. Consequently, the probability that the server Intrinsic Proof injection is a purely local operation performed\nsuccessfully evades detection in this round is 1 −ρ. during gradient generation, prior to any cryptographic maskFor the server to remain undetected throughout the entire ing. The resulting proof-carrying update ˆgv preserves the exattack duration, it must succeed in consecutive evasion trials act dimensionality and data type of a benign update, ensuracross all k rounds. Assuming the schedule is secret and in- ing seamless compatibility without modifying the underlying\ndependent of the attack, the cumulative detection probability cryptographic primitives.\nis:\nk This compatibility allows our audit mechanism to inherit\nPdetect = 1 − Y (1 −ρ) = 1 −(1 −ρ)k. (3) the privacy guarantees of SA. Since the server observes only\nthe encrypted vectors, the verifier's update remains computa- i=1 Ours-ASR Detection Threshold Ours-ASR Detection Threshold 100% 100% 100% 100%\n80% 80% 80% 80%\nASR 60%40% ASR 60%40% ASR 60%40% ASR 60%40%\n20% 20% 20% 20%\n0% 0% 0% 0%\n0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100\nEpoch Epoch Epoch Epoch\n(a) SVHN - IID (b) SVHN - Non-IID (a) SVHN - IID (b) SVHN - Non-IID\n100% 100% 100% 100%\n80% 80% 80% 80%\nASR 60%40% ASR 60%40% ASR 60%40% ASR 60%40%\n20% 20% 20% 20%\n0% 0% 0% 0%\n0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100\nEpoch Epoch Epoch Epoch\n(c) CIFAR-10 - IID (d) CIFAR-10 - Non-IID (c) CIFAR-10 - IID (d) CIFAR-10 - Non-IID\n100% 100% 100% 100%\n80% 80% 80% 80%\nASR 60%40% ASR 60%40% ASR 60%40% ASR 60%40%\n20% 20% 20% 20%\n0% 0% 0% 0%\n0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100\nEpoch Epoch Epoch Epoch\n(e) CIFAR-100 - IID (f) CIFAR-100 - Non-IID (e) CIFAR-100 - IID (f) CIFAR-100 - Non-IID Figure 3: ASR under honest aggregation. Figure 4: ASR when the server omits the verifier's gradient every 10\nrounds; yellow lines mark omissions. tionally indistinguishable from standard inputs. For any probabilistic polynomial-time adversary A (the server): Ours-ASR Detection Threshold\n100% 100% 100%\n|Pr[A(Enc(ˆgv)) = 1] −Pr[A(Enc(gi)) = 1]| ≤negl(λ). 80% 80% 80%\nThis cryptographic shield \"blinds\" the server regarding the ASR 60%40% 60%40% 60%40%\nverifier's identity, further enhancing anonymity and privacy, 20% 20% 20%\npreventing selective omission attacks targeting specific veri- 0% 0% 0% 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100\nfiers. Epoch Epoch Epoch\n(a) SVHN - IID (b) CIFAR-10 - IID (c) CIFAR-100 - IID\n4 Experiments\nFigure 5: ASR when the server omits the verifier's gradient in 50\n4.1 Experimental Setup random rounds (with ρ = 0.1, T = 100). We evaluate our framework on three\nbenchmarks: SVHN (MobileNetV1), CIFAR-10 (ResNet- We implement both cryptographic baselines using their offi-\n20), and CIFAR-100 (ResNet-18). Non-IID settings is simu- cial parameter settings ensuring fair efficiency comparison.\nlated using a Dirichlet distribution with β = 0.5.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 12, + "total_chunks": 29, + "char_count": 3330, + "word_count": 589, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5de7d453-4be6-4000-9f72-0d0fa00aea68", + "text": "Proof Generation: Each client generates 4.2 Performance Evaluation\na 2 × 2 pixel trigger with random position and color. The private trigger set Ti comprises 10% of the local data.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 13, + "total_chunks": 29, + "char_count": 180, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "31ea7ec1-dec6-4214-9d65-f4204e69472b", + "text": "Training: Model Utility. Figure 2 confirms that our ephemeral auditModels are trained for T = 100 epochs using SGD (batch ing mechanism imposes negligible impact on the main task.\nsize 32, momentum 0.9). The learning rate is η = 0.01 for While the one-shot injection introduces transient perturbaclean data and amplified to ητ ∈{0.5, 2.0} for trigger in- tions, the final fine-tuning phase effectively erases these artijection. Verification: We set the omission rate ρ = 0.1, the facts, restoring accuracy to levels comparable to the FedAvg\nomission round rate ϵ = 1, detection threshold γ = 0.7, and baseline. This consistency holds across both IID and Non-IID\nboosting factor α = 10.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 14, + "total_chunks": 29, + "char_count": 685, + "word_count": 112, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0ba4a112-02e6-4960-bbd7-4b3f4aa45962", + "text": "All experiments are executed on a settings, demonstrating robustness against data heterogeneity.\nsingle NVIDIA RTX 3090 GPU. We evaluate verification effectiveness by\nBaselines. We compare against: (1) FedAvg [McMahan et monitoring the ASR of global model on verifier's trigger sets.\nal., 2017]: Represents the utility upper bound without verifi- As shown in Figure 3, under honest aggregation, the ASR\ncation overhead. (2) LightVeriFL [Buyukates et al., 2024]: consistently exceeds the acceptance threshold (γ = 0.7), conA state-of-the-art scheme using homomorphic hashing and firming that the verification signal survives aggregation and\nPedersen commitments. (3) Yang et al. [Yang et al., 2024]: the valid updates are correctly verified. Conversely, Figure 4\nA recent dual-server protocol based on Learning With Errors. depicts a periodic attack scenario where the server omits the client0 100% client0 100%\nclient1 client2 80% client1client2 80%\nclient3 client5Clientclient4client5client6 60%40% (%)ASR Clientclient3client4 60% (%) client6 40% ASR\nclient7 client8 ASR > 70% 20% client7client8 ASR > 70% 20%\nclient9 Verifier 0% client9 Verifier 0%\n0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90\nRound Round (a) SVHN (a) SVHN\nclient0 100% client0 100%\nclient1 client2 80% client1client2 80%\nclient3 client5Clientclient4client5client6 60%40% (%)ASR Clientclient3client4 60% (%) client6 40% ASR\nclient7 client8 20% client7client8 20%\nclient9 0% client9 0%\n0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90\nRound Round (b) CIFAR-10 (b) CIFAR-10\nclient0 100% client0 100%\nclient1 client2 80% client1client2 80%\nclient3 client5Clientclient4client5client6 60%40% (%)ASR Clientclient3client4 60% (%) client6 40% ASR\nclient7 client8 20% client7client8 20%\nclient9 0% client9 0%\n0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90\nRound Round (c) CIFAR-100 (c) CIFAR-100 Figure 6: ASR Heatmap of Client 0's local model across different Figure 7: ASR Heatmap of global model accross different trigger\ntrigger sets. sets: Every 10 rounds the server omits the verifier's gradient verifier every 10 rounds. In these rounds, the ASR drops\nsharply to ∼10% (random guess), demonstrating malicious\nbehavior.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 15, + "total_chunks": 29, + "char_count": 2217, + "word_count": 349, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2d7c3e27-ef70-4d68-bc54-f2a71533d200", + "text": "Similar trends are observed under both IID and\nIntrinsic Proof, we further evaluate the proof-carried global\nNon-IID data distributions. Furthermore, we validate the themodel on all clients' trigger sets after each round. As shown\noretical detection bounds from Sec. 3.5 via a randomized simin Figure 7, the ASR of the global model on the active veriulation (Figure 5). In this experiment, the server performs a\nfier's trigger set (marked by red circles) consistently exceeds\n10% omission attack (ρ = 0.1) during 50 randomly selected\nthe detection threshold γ = 0.7 (indicated by yellow trirounds out of T = 100. Our protocol identifies these maliangles), while the ASR on non-verifier clients' trigger sets\ncious aggregations through sharp ASR drops, confirming the\nremains negligible. This pattern confirms non-interference\neffectiveness of the proposed auditing mechanism.\namong clients' Intrinsic Proofs.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 16, + "total_chunks": 29, + "char_count": 908, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d6609341-e831-4613-a9ac-91e522b7d50c", + "text": "Conversely, in every tenth\nReliability. We validate the reliability of Intrinsic Proofs\nround, when the server omits the verifier's gradient, the corby ensuring: 1) Temporal Non-interference: Verification sigresponding verifier's ASR drops sharply below the threshold,\nnals do not interfere across training rounds (i.e., the Intrinsic\ndemonstrating a reliably detection of omissions. Proof will be forgotten by clean training). 2) Spatial NonEfficiency. We benchmark our framework against two stateinterference: Intrinsic Proofs from different clients do not inof-the-art cryptographic protocols: LightVeriFL [Buyukates\nterfere with one another (i.e., verifying client specificity).\net al., 2024] and Yang et al. [Yang et al., 2024]. To enWe utilize ASR heatmaps to empirically demonstrate the sure fairness, we isolate verification-specific overheads, exTemporal Non-interference of Intrinsic Proofs, verifying that cluding standard training, aggregation, and costs of orthogoeach client's Intrinsic Proof remains ephemeral and does not nal privacy defenses (e.g., encryption for SA) common to all\ninterfere with verification in subsequent rounds. As shown in Table 1, our approach achieves orderstrated in Figure 6, we present a heatmap visualization where of-magnitude efficiency gains, delivering speedups ranging\neach row corresponds to a specific client's trigger set and from 99× to 1877× over LightVeriFL. The gap is even wider\neach column denotes a training round.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 17, + "total_chunks": 29, + "char_count": 1473, + "word_count": 204, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "761eea00-3c77-48a2-894a-2980244e7416", + "text": "In this experiment, against Yang et al., which incurs prohibitive latencies (e.g.,\nthe tested model is updated exclusively using clean local gra- > 1800s for MobileNet-V1). This disparity stems from fundients gi (Eq. (2)) and evaluated against trigger sets without damental algorithmic complexity: While cryptographic baseembedding new Intrinsic Proofs. Taking Client 0 as a repre- lines perform expensive operations (e.g., modular exponensentative example, the heatmap reveals that the ASR of its tiations) for every parameter element, our intrinsic verificalocal clean model never exceeding the detection threshold tion requires only lightweight embedding and local inference.\nγ = 0.7 without Intrinsic Proof re-injection. This confirms Moreover, because proofs are carried implicitly within the\nthat proof signals are erased by subsequent clean updates, en- gradient, our method adds zero per-round communication\nsuring the final model is free of residual backdoor effects and overhead, whereas LightVeriFL and Yang et al. introduce of\npreserves its utility for legitimate tasks. 1.31 KB and 0.9 KB respectively. These properties make our\nTo confirm the Spatial Non-interference of the verifier's approach more scalable for large-scale federated learning. Dataset Metric / Phase LightVeriFL Yang et al. Ours [Bagdasaryan et al., 2020] Eugene Bagdasaryan, Andreas\nProof Gen. (s) 36.48 88.66 0.35 Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. Verification (s) 0.80 0.32 0.04 How to backdoor federated learning. In Silvia Chiappa\nResNet-20\nProof Comp. (s) 1.28 185.34 N/A and Roberto Calandra, editors, Proceedings of the Twenty(CIFAR-10)\nTotal Time (s) 38.56 274.32 0.39 Third International Conference on Artificial Intelligence\nProof Gen. (s) 492.22 700.55 0.37 and Statistics, volume 108 of Proceedings of Machine\nVerification (s) 10.05 0.88 0.30 Learning Research, pages 2938–2948. PMLR, 26–28 Aug\nMobileNet-V1\nProof Comp. (s) 15.12 1099.33 N/A 2020. (SVHN)\nTotal Time (s) 517.39 1800.76 0.67 [Buyukates et al., 2024] Baturalp Buyukates, Jinhyun So,\nProof Gen. (s) 1808.99 – 0.93 Hessam Mahdavifar, and Salman Avestimehr. Lightverifl:\nVerification (s) 71.90 – 0.10 A lightweight and verifiable secure aggregation for feder- ResNet-18\nProof Comp. (s) 53.27 – N/A\n(CIFAR-100) ated learning. IEEE Journal on Selected Areas in InformaTotal Time (s) 1934.16 – 1.03 tion Theory, 5:285–301, 2024. Table 1: Efficiency comparison across different models. Compu- [Chaum, 1981] David L Chaum.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 18, + "total_chunks": 29, + "char_count": 2495, + "word_count": 362, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d494478c-8b43-46fc-bd6d-27222b9bd801", + "text": "Untraceable electronic\ntation times are in seconds per round. \"Proof Gen.\" corresponds mail, return addresses, and digital pseudonyms. Commuto \"Intrinsic Proof Injection\" for our method. \"Proof Comp.\" cor- nications of the ACM, 24(2):84–90, 1981.\nresponds to extrinsic proof composition. \"N/A\" indicates the step\n[Chen et al., 2025] Yange Chen, Suyu He, Baocang Wang,is not applicable or incurs zero extra cost beyond standard FL. The\nsymbol \"–\" denotes unfinished results due to equipment limits. Zhanshen Feng, Guanghui Zhu, and Zhihong Tian. A verifiable privacy-preserving federated learning framework\nagainst collusion attacks.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 19, + "total_chunks": 29, + "char_count": 632, + "word_count": 89, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2848f03b-e0be-42e6-acab-4faab6e546bf", + "text": "IEEE Transactions on Mobile\n5 Conclusion Computing, 24(5):3918–3934, 2025. We propose a lightweight framework for verifiable aggre- [Doan et al., 2021] Khoa Doan, Yingjie Lao, Weijie Zhao,\ngation in cross-silo FL. Instead of relying on heavy cryp- and Ping Li. Lira: Learnable, imperceptible and robust\ntographic proofs, we introduce Ephemeral Intrinsic Proofs, backdoor attacks. In 2021 IEEE/CVF International Conwhich repurpose backdoor mechanisms to audit server in- ference on Computer Vision (ICCV), pages 11946–11956,\ntegrity. By leveraging the catastrophic forgetting phe- 2021.\nnomenon of neural networks, we turns the transience of backdoor triggers into a security feature, enabling per-round veri- [Erway et al., 2015] C. Chris Erway, Alptekin Kupcu, Charfication that naturally fades and preserves model utility. alampos Papamanthou, and Roberto Tamassia.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 20, + "total_chunks": 29, + "char_count": 867, + "word_count": 121, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dd929c71-6aae-4091-93b1-ec8870202ffb", + "text": "Dynamic\nOur analysis shows malicious omissions are detected with provable data possession. ACM Transaction on Informahigh probability via randomized auditing. Experiments on tion and System Security, 17(4):15.1–15.29, 2015. SVHN, CIFAR-10, and CIFAR-100 confirm reliable detec- [French, 1999] Robert M. Catastrophic forgetting\ntion of server misbehavior with minimal accuracy loss. Our in connectionist networks. Trends in Cognitive Sciences,\nmethod is far more efficient and adds zero communication 3:128–135, 1999.\noverhead compared to cryptographic baselines, while remaining compatible with SA protocols. [Fu et al., 2022] Anmin Fu, Xianglong Zhang, Naixue\nXiong, Yansong Gao, Huaqun Wang, and Jing Zhang. Vfl:\nReferences A verifiable federated learning with privacy-preserving for\nbig data in industrial iot. IEEE Transactions on Industrial\n[Adi et al., 2018] Yossi Adi, Carsten Baum, Moustapha Informatics, 18(5):3316–3326, 2022. Cisse, Benny Pinkas, and Joseph Keshet. Turning your\n[Gu et al., 2019] Tianyu Gu, Kang Liu, Brendan Dolan- weakness into a strength: Watermarking deep neural networks by backdooring.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 21, + "total_chunks": 29, + "char_count": 1118, + "word_count": 154, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f1f8d0ec-d283-4986-8824-173a625b2d58", + "text": "In 27th USENIX Security Sympo- Gavitt, and Siddharth Garg. Badnets: Evaluating backsium (USENIX Security 18), pages 1615–1631, Baltimore, dooring attacks on deep neural networks. Ieee Access,\n7:47230–47244, 2019. USENIX Association.\n[Alam et al., 2023] Manaar Alam, Esha Sarkar, and Michail [Guo et al., 2020] Xiaojie Guo, Zheli Liu, Jin Li, Jiqiang\nManiatakos. Perdoor: Persistent backdoors in federated Gao, Boyu Hou, Changyu Dong, and Thar Baker. Verlearning using adversarial perturbations. In 2023 IEEE In- ifl: Communication-efficient and fast verifiable aggregaternational Conference on Omni-layer Intelligent Systems tion for federated learning. IEEE Transactions on Infor-\n(COINS), pages 1–6, 2023. mation Forensics and Security, 16:1736–1751, 2020.\n[Ateniese et al., 2007] Giuseppe Ateniese, Randal Burns, [Juels and Jr., 2007] Ari Juels and Burton S. Pors:\nReza Curtmola, Joseph Herring, Lea Kissner, Zachary Pe- proofs of retrievability for large files. In Peng Ning, Sabterson, and Dawn Song.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 22, + "total_chunks": 29, + "char_count": 1005, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e807be40-cb13-4a4c-8ab1-816e0188cd7c", + "text": "Provable data possession at un- rina De Capitani di Vimercati, and Paul F. Syverson, edtrusted stores. In Proceedings of the 14th ACM Confer- itors, Proceedings of the 2007 ACM Conference on Comence on Computer and Communications Security, CCS puter and Communications Security, CCS 2007, Alexan-\n'07, page 598–609, New York, NY, USA, 2007. Associ- dria, Virginia, USA, October 28-31, 2007, pages 584–597.\nation for Computing Machinery. [Li et al., 2023] Bowen Li, Lixin Fan, Hanlin Gu, Jie Li, and [Yang et al., 2024] Xue Yang, Minjie Ma, and Xiaohu Tang.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 23, + "total_chunks": 29, + "char_count": 556, + "word_count": 90, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7c355044-6f6d-413a-8487-f4cb77ed4431", + "text": "Fedipr: Ownership verification for federated An efficient privacy-preserving and verifiable scheme for\ndeep neural network models. IEEE Transactions on Pat- federated learning. Future Generation Computer Systems,\ntern Analysis and Machine Intelligence, 45(4):4521–4536, 160:238–250, 2024.\n2023. [Zhang et al., 2022] Zhengming Zhang, Ashwinee Panda,\n[Li et al., 2025] Gongli Li, Zhe Zhang, and Ruiying Du. Linyue Song, Yaoqing Yang, Michael Mahoney, Prateek\nLvsa: Lightweight and verifiable secure aggregation for Mittal, Ramchandran Kannan, and Joseph Gonzalez. Neufederated learning. Neurocomputing, 648:130712, 2025. rotoxin: Durable backdoors in federated learning. In\nKamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba\n[Liu et al., 2021] Xiyao Liu, Shuo Shao, Yue Yang, Kang- Szepesvari, Gang Niu, and Sivan Sabato, editors, Proming Wu, Wenyuan Yang, and Hui Fang. Secure federated ceedings of the 39th International Conference on Malearning model verification: A client-side backdoor trig- chine Learning, volume 162 of Proceedings of Machine\ngered watermarking scheme. In 2021 IEEE International Learning Research, pages 26429–26446. PMLR, 17–23\nConference on Systems, Man, and Cybernetics (SMC), Jul 2022.\npages 2414–2419, 2021. [McMahan et al., 2017] Brendan McMahan, Eider Moore,\nDaniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 24, + "total_chunks": 29, + "char_count": 1346, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6698ed56-2ad7-4b94-bd9c-ea0f83047d5e", + "text": "Communication-Efficient Learning of Deep Networks\nfrom Decentralized Data. In Aarti Singh and Jerry Zhu,\neditors, Proceedings of the 20th International Conference\non Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273–\n1282. PMLR, 20–22 Apr 2017. [Mothukuri et al., 2021] Viraaji Mothukuri, Reza M. Parizi,\nSeyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and\nGautam Srivastava.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 25, + "total_chunks": 29, + "char_count": 432, + "word_count": 59, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4b55001-a0ef-42bd-a6dd-47f3031d05fa", + "text": "A survey on security and privacy of\nfederated learning. Future Generation Computer Systems,\n115:619–640, 2021. [Nie and Lu, 2024] Hewang Nie and Songfeng Lu.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 26, + "total_chunks": 29, + "char_count": 157, + "word_count": 24, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bb4fc007-1eb0-45e4-b0aa-c1ad3969915e", + "text": "Fedcrmw: Federated model ownership verification with\ncompression-resistant model watermarking. Expert Systems with Applications, 249:123776, 2024. [Qin et al., 2026] Xian Qin, Xue Yang, and Xiaohu Tang.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 27, + "total_chunks": 29, + "char_count": 202, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a9c32447-0771-4d13-832f-7af4571ff913", + "text": "Practical privacy-preserving federated learning based on\nmultiparty homomorphic encryption for large-scale models. Pattern Recognition, 171:112174, 2026. [Segal et al., 2017] Aaron Segal, Antonio Marcedone, Benjamin Kreuter, Daniel Ramage, H. Brendan McMahan, Karn Seth, K. Bonawitz, Sarvar Patel, and\nVladimir Ivanov. Practical secure aggregation for privacypreserving machine learning. [Tang et al., 2024] Jinling Tang, Haixia Xu, Mingsheng\nWang, Tao Tang, Chunying Peng, and Huimei Liao.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 28, + "total_chunks": 29, + "char_count": 490, + "word_count": 64, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b175bf77-4ef8-4fd4-955c-e85b09868bf7", + "text": "A\nflexible and scalable malicious secure aggregation protocol for federated learning. IEEE Transactions on Information Forensics and Security, 19:4174–4187, 2024. [Tekgul et al., 2021] Buse G. Tekgul, Yuxi Xia, Samuel\nMarchal, and N. Waffle: Watermarking in federated learning. In 2021 40th International Symposium\non Reliable Distributed Systems (SRDS), pages 310–320,\n2021. [Xu et al., 2020] Guowen Xu, Hongwei Li, Sen Liu, Kan\nYang, and Xiaodong Lin. Verifynet: Secure and verifiable federated learning. IEEE Transactions on Information\nForensics and Security, 15:911–926, 2020.", + "paper_id": "2603.10692", + "title": "Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning", + "authors": [ + "Xian Qin", + "Xue Yang", + "Xiaohu Tang" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10692v1", + "chunk_index": 29, + "total_chunks": 29, + "char_count": 581, + "word_count": 81, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10695_semantic.json b/data/chunks/2603.10695_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..3b37a614fe07311bae9dd06d7b6937285e6210bd --- /dev/null +++ b/data/chunks/2603.10695_semantic.json @@ -0,0 +1,506 @@ +[ + { + "chunk_id": "e951267b-3e23-4d4c-b882-58bc6b922a20", + "text": "Anna Chistyakova1 and Mikhail Pautov1,2 1 Trusted AI Research Center, RAS\n2 AXXX Being trained on large and diverse datasets, visual founda-2026 tion models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The\nhigh computational cost of data collection and training makes these mod-Mar els valuable assets, which motivates some VFM owners to distribute them\nalongside a license to protect their intellectual property rights. In this\npaper, we propose an approach to ownership verification of visual foun-11\ndation models that leverages a small encoder-decoder network to embed\ndigital watermarks into an internal representation of a hold-out set of\ninput images. The method is based on random watermark embedding,\nwhich makes the watermark statistics detectable in functional copies\nof the watermarked model. Both theoretically and experimentally, we\ndemonstrate that the proposed method yields a low probability of false[cs.CV] detection for non-watermarked models and a low probability of false misdetection for watermarked models. Keywords: Watermarking · Visual Foundation Models · Fingerprinting Today, foundation models are deployed in different fields, for example, in natural\nlanguage processing [3,19], computer vision [20], and biology [14]. Their impressive performance in a wide range of downstream tasks comes at a price of high\ncost of data collection, training, and maintenance.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 1, + "total_chunks": 28, + "char_count": 1463, + "word_count": 211, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cbe45be1-f4ec-463a-b13d-938b71620618", + "text": "Consequently, the models become valuable assets of their owners: the user's access to foundation models is\nmainly organized via subscription to a service where the model is deployed or\nvia purchasing the license to use a specific instance of the model. Unfortunately,arXiv:2603.10695v1 some users may violate the terms of use (for example, by integrating their instances of the models into other services to make a profit). Hence, it is reasonable\nthat the models' owners are willing to defend their intellectual property from\nunauthorized usage by third parties. One of the prominent approaches to protecting the intellectual property\nrights (IPRs) of models is watermarking [9,13,24], the set of methods that embed\nspecific information into a model by modifying its parameters. In watermarking,\nownership verification is performed by checking for the presence of this information in a model. An alternative set of methods for IPR protection is based on 2 Anna Chistyakova and Mikhail Pautov", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 2, + "total_chunks": 28, + "char_count": 992, + "word_count": 153, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f12c0e02-4d0e-4355-ad55-be066c75c34f", + "text": "Fig. 1: Overview of the proposed RandMark watermarking pipeline. A binary message\nis embedded into a visual foundation model using a set of trigger images and an encoder. During verification, randomized input transformations are applied to the trigger set,\nand a decoder extracts the watermark message from the model outputs. The extracted\nmessages are then compared with the original watermark to verify model ownership. fingerprinting, which typically does not alter the original model [11,16,17]. Instead, these methods generate a unique identifier, or fingerprint, for the model;\nownership verification is then conducted by comparing the fingerprint of the\noriginal model with that of the suspicious model. This work introduces a method for watermarking visual foundation models\n(VFMs) by embedding digital watermarks into the hidden representations of a\nspecific set of input images. Within the framework, we experimentally verify that\nembedding a watermark into the representation allows us to protect the ownership of VFMs fine-tuned for different practical tasks, such as image classification\nand segmentation. We demonstrate that our approach is able to distinguish between an independent model and functional copies of the watermarked model\nwith high probability. Our contributions are summarized as follows:", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 3, + "total_chunks": 28, + "char_count": 1318, + "word_count": 192, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "df9699ad-eac2-4c2a-9159-909181e08341", + "text": "– We propose RandMark, a novel methodology for watermarking visual foundation models. Unlike prior art focused on classifiers, our approach embeds\nbinary signatures directly into the model's hidden representations via a set\nof trigger images, making it suitable for the diverse downstream use-cases of\nVFMs.\n– We theoretically derive an upper bound on the probabilities of false positive\ndetection of a non-watermarked model and misdetection of a functional copy\nof the watermarked model.\n– Through experiments on state-of-the-art visual foundation models (CLIP\nand DINOv2), we demonstrate that RandMark is highly robust. It successfully detects model ownership after various functional perturbations, including fine-tuning on downstream tasks (classification and segmentation) and\nunstructured pruning, where existing fingerprinting methods fail. RandMark: On Random Watermarking of Visual Foundation Models 3 2.1 Visual Foundation Models Visual foundation models, particularly those using vision transformers (ViT, [8]),\nare widely used in modern computer vision due to their scalability and transferability across tasks. The advancement of self-supervised learning methods [1]\nhas facilitated the creation of general-purpose models, including SimCLR [6],\nDINO [5], CLIP [18], and DINOv2 [15]. These models learn representations from\nunlabeled images and demonstrate broad applicability across diverse tasks, often\nrequiring minimal labeled data for fine-tuning. 2.2 Protecting Intellectual Property of Neural Networks The protection of intellectual property for visual foundation models (VFMs) has\ngained increasing attention within the field of trustworthy AI. Watermarking\nand fingerprinting techniques aim to verify model ownership and prevent unauthorized usage or model extraction. While early works focused on large language\nmodels [22,25], recent efforts adapt these ideas to visual models, including image\nclassifiers and foundation models [16,23]. For visual foundation models (VFMs), there are currently no watermarking\napproaches specifically designed for these architectures. Several existing model\nownership verification methods, such as ADV-TRA [26], and IPGuard [4], have\nbeen proposed in the context of image classification. These approaches embed\nownership signatures by either modifying the training objective or introducing\ncrafted input patterns and then detect them based on the model's responses.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 4, + "total_chunks": 28, + "char_count": 2421, + "word_count": 329, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7578e50b-1378-4b76-803a-cb2afff3be3b", + "text": "While these methods are effective for classification models, they are not directly\ntailored to the broader capabilities of VFMs, such as image feature extraction\nor downstream adaptation. Adapting watermarking techniques to visual foundation models thus remains an open challenge and motivates the work presented\nin this paper. Other complementary methods exploit weight-space smoothing or\nperturbations to embed ownership information directly into model parameters. For example, Bansal et al. [2, 21] propose model watermarking through weight\nsmoothing in deep neural networks, making the watermark robust to fine-tuning\nor minor architectural changes. These approaches provide alternative mechanisms to mark models without relying on specific input-output triggers and are\nparticularly relevant for large visual foundation models where modifying the\nbackbone is costly. Overall, while watermarking for VFMs is still in early stages, these methods\nillustrate that both data-driven triggers and weight-space techniques can serve\nas practical IP protection strategies for high-capacity visual models. 3.1 Problem Statement In this work, we focus on the problem of watermarking of visual foundation\nmodels. To describe the proposed method, we start by introducing the notations. 4 Anna Chistyakova and Mikhail Pautov", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 5, + "total_chunks": 28, + "char_count": 1314, + "word_count": 184, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fdda6254-1ff2-4b85-8f2c-db8e90f63c5d", + "text": "Let s be the dimension of the input image and f : Rs →Rk be the source VFM\nthat maps input images to the embeddings of dimension k. Here and below, we\nwill write f ′ ∼f to indicate that the model f ′ is a functional copy of f that is\nobtained, for example, via fine-tuning, knowledge distillation or pruning of the\noriginal model. Analogously, by writing g ⊥f we will indicate that two models,\ng and f, are independent of each other. In our method, we train two auxiliary\nmodels, the encoder e : Rs × {0, 1}n →Rk that embeds the binary message m\nof length n into the representation of the input object x ∈Rs, and the decoder\nd : Rk →{0, 1}n that extracts a binary message from the output embedding\nof the VFM. Given the input image x, the source model f and the message m\nembedded into f(x), the goal of the method is two-fold: on the one hand, the\ndecoder d should extract close messages from the representations f(x) and f ′(x)\nfor the model f ′ ∼f; on the other hand, given the model g ⊥f, the messages\nextracted from the representations f(x) and g(x) have to be far apart. The formal problem statement goes as follows. Given x as the secret input\nimage used for watermarking, a predefined threshold τ ≪n and probability\nthresholds 0 < γ1 ≪γ2 < 1, the following inequalities should hold: b el { eq: stateme n t} \\be\n\\ (1)\nlag in { cases} \\d i sp l aystyle\\mathbb{P}\\left(\\|wd(f'(x))\\|_1\\le\\tau\\right\\ge\\gamma\\mathbb{P}\\left(\\|wd(g(x))\\|_1\\le\\tau\\right\\le\\gamma\\end{cases} where w = e(x, m) is the embedding with the watermark, f ′ ∼f, g ⊥f. In\nEq. 1, the probabilities are taken over the randomness induced by the encoder;\nthis randomness will be discussed in the subsequent sections. In this section, we discuss the conditions under which the proposed method is\nexpected to operate correctly and outline the potential adversary's capabilities. The goal of an adversary is to remove an existing watermark from a model\nso that ownership cannot be verified. Specifically, an adversary may attempt\neither a watermark removal attack, aiming to eliminate the watermark while\npreserving the model's functionality, or a model extraction attack, trying to\nobtain a copy of the watermarked model without the watermark. Possible attacks\ninclude fine-tuning the model on downstream tasks or pruning. The objective of\nthe watermarking method is to reliably determine whether a suspect model is a\nfunctional copy of the watermarked visual foundation model.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 6, + "total_chunks": 28, + "char_count": 2446, + "word_count": 418, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8c066153-0ff1-4005-96bf-c9e2566c8b77", + "text": "We introduce RandMark, a novel watermarking approach designed for visual\nfoundation models. RandMark embeds user-specific binary signatures into the\nrepresentations of a randomly transformed set of input images. To do so, we\nfine-tune the source model together with the lightweight encoder and decoder RandMark: On Random Watermarking of Visual Foundation Models 5 This approach enables ownership verification by extracting digital fingerprints from the set of randomly transformed specific set of input images and\ncomputing the statistic of resulting random variables. The watermarking process goes as follows. First of all, given input image\nx and user-specific binary message m, we inject m into the representation of\nx + εj, εj ∼N(0, σ2(x)I) by training the small encoder e and fine-tuning the\nsource model f. Modified representation, f(x + εj), is then passed to the decoder network d that extracts binary message m′j from it. We highlight that the\nextracted messages, m′j, are random variables due to the randomness in transformation of input image. The encoder, decoder, and the source foundation model\nare trained jointly to minimize both the average discrepancy between m and m′j\nand the variance of m′j. Loss function The training objective is the combination of two terms: given\nthe input sample x, the first one ensures that the feature representations of the\nwatermarked and original models do not deviate much; the second term forces\nthe extracted binary messages to be close to the embedded one. Specifically, the\nobjective function is L( x, f, \\tild e {f}) = ( - \\ tilde {f}(x)\\|_2+\\frac{\\lambda}{K}\\sum_{j=1}^K\\|mm_j'\\|_2, (2) where λ > 0 is a scalar parameter, ˜f is the watermarked version of f and\nm′j = d(˜f(e(x + εj, m))) is the binary message extracted by the decoder from\nx + εj and K is the total number of transformations of the input image. This\nformulation ensures the successful embedding and extraction of watermarks with\nlittle to no impact on the feature representation.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 7, + "total_chunks": 28, + "char_count": 2002, + "word_count": 320, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5d69c679-fa62-40fb-ae1d-6c1004ffe629", + "text": "Evaluating the efficiency of the method To evaluate the performance of the\nproposed method, given the user-specific watermark m and input image x, we\ncompute both the sample average and the sample variance of the variable ∥m −\nm′∥1, where m′ is the watermark. Here we recall that the extracted watermarks\nare random. Namely, if the total number of transformations of the input image is K and\nthe length of the watermark is n, we measure the average number of matching\nbits between m and m′j in the form : }\n\\l a bel { m t {\\ ma thbb { E }}\\|mm'\\|_1\\frac{1}{K}\\sum_{j=1}^K\\sum_{i=1}^n\\mathds{1}\\left(m_i\\nem'_j)_i\\right(3)\nean \\ha and the sample variance is computed as e {V r h \\l a bel { }}\\|m-m'\\|_1\\frac{1}{K-1}\\sum_{j=1}^K\\left(d_j\\hat{\\mathbb{E}}\\|m-m'\\|_1\\right(4) at {\\ma t hbb\nq : v\n} \\ 6 Anna Chistyakova and Mikhail Pautov", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 8, + "total_chunks": 28, + "char_count": 832, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "855620f1-b3c1-4901-9986-4a56044facd7", + "text": "where dj = Pni=1 1 mi ̸= (m′j)i , and m′j = d(˜f(e(x + εj, m))). The intuition\nbehind using this two metrics is as follows. First of all, given the extracted message m′, the distance from Eq. (3) is expected to be small for the watermarked\nmodel and large for an independent model.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 9, + "total_chunks": 28, + "char_count": 281, + "word_count": 54, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c0d7ed65-6c55-497a-8ee7-e912ccda3a95", + "text": "Secondly, if we introduce an auxiliary\nvariable in the form v( f, h ) = \\ma t hbb {V}\\left(\\|m'(f)m'(h)\\|_1\\right), (5) then v(f, f ′) is expected to be small for f ′ ∼f and v(f, g) is expected to be\nlarge for g ⊥f. We elaborate on this point in the subsequent sections.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 10, + "total_chunks": 28, + "char_count": 270, + "word_count": 53, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "258596c2-0b9b-405b-9bd1-1648e287a549", + "text": "In this work, the decision rule that is used to evaluate whether the given\nnetwork is watermarked is the comparison of the distance with a predefined\nthreshold: given the suspicious model h, input image x, secret message m and\nthe series of K watermarks m′1, m′2, . . . , m′K extracted from h, we treat h as\nwatermarked if \\r h o (x ) = \\ h { h { E }} \\| m-m'\\| _ 1 =\\frac{1}{K}\\sum_{j=1}^K\\sum_{i=1}^n\\mathds{1}\\left(m_i\\ne(m'_j)_i\\right\\le\\tau(6)\n\\ma bb where τ ≥0 is the threshold value. In the case of many input images used for\nwatermarking, namely, for N images from X = {x1, . . . , xN}, the performance\nof the method is illustrated by the watermark detection rate, R(h, X, τ), in the\nform below:\n\\l ab el q R(h, \\m a thcal\\tau\\frac{1}{N}\\sum_{x_i\\in\\mathcal\\mathds{1}[\\rho(x_i)\\le\\tau(7)\n:R} As an auxiliary indicator of the model being watermarked, for each x ∈X, we\ncompute the value of statistic v(f, h). Setting the threshold value We set the threshold by formulating a hypothesis\ntest: the null hypothesis, H0 = \"the model h is not watermarked\", is tested\nagainst an alternative hypothesis, H1 = \"the model h is watermarked\", for the\ngiven suspicious model h.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 11, + "total_chunks": 28, + "char_count": 1172, + "word_count": 202, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b92d7248-29db-4182-bc5a-993212151679", + "text": "In this section, we assume that the probabilities that\nthe i′th bit in m′(f) and m′(h) coincide are the same for all i ∈[1, n]. Having\nsaid so, we estimate the probability of false acceptance of hypothesis H1 (namely,\nFPR1) as follows: Xi }\n[ \\ rho (m, m'(g,x))<\\tau\\sum_{j=0}^\\tau\\binom{n}{j}(1-r)^jr^{n-j}, (8) FP R _1 = \\mat hbb { P}_ { g \\ i\nm \\ where r = Pg∼Ξ(mi = m′(g, x)i). To choose a proper threshold value for τ, we\nset up an upper bound for FPR1 as ε and solve for τ, namely, g \\ m l \\ s\n{\\ a {j 0\n\\ta u = a u ' < n} \\ ft ( u }^ { \\tau '} \\binom{n}{j}(1-r)^jr^{n-j}\\right\\quad\\text{s.t.}\\quad\\sum_{j=0}^{\\tau\\binom{n}{j}(1-r)^jr^{n-j}<\\varepsilon(9)\n\\ar t e =\nx _ m _", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 12, + "total_chunks": 28, + "char_count": 679, + "word_count": 129, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a6218396-4785-40d1-819b-5166f10c3898", + "text": "RandMark: On Random Watermarking of Visual Foundation Models 7 3.4 Difference between watermarked and non-watermarked models Recall that a good watermarking approach should yield a high watermark detection rate from (7) for the models that are functionally connected to the watermarked one and, at the same time, low detection rates for independent models. To assess the integrity of the proposed approach, we estimate the probabilities\nof the method to yield low detection rates for functionally dependent models\nand high detection rates for independent models in the form \\lab el { eq :d et e cti on_probs} \\ ma t hbb{P}_{f'\\sim\\Omega_f}[R(f',\\mathcal\\tau)\\overline{R}],\\quad\\mathbb{P}_{g\\sim\\Xi}[R(g,\\mathcal\\tau)\\underline{R}] (10) for some threshold values 0 < R < R < N. To estimate the probabilities from (10), we firstly provide one-sided interval\nestimations for conditional probabilities of bit collisions in the form \\labe l { eq:b its _ prob ab iliti es} r( \\ Omega _ f |x) = \\mathbb{P}_{f'\\sim\\Omega_f}[m_im'(f',x)_i],\\quadr(\\Xi|x)\\mathbb{P}_{g\\sim\\Xi[m_im'(g,x)_i]. (11)\nWe do it by sampling M functionally dependent models, namely, f 1,′ . . . , fM′ ∼\nΩf, and M independent models, namely g1, . . . , gM ∼Ξ. Here, the space Ξ\nof independent models consists of visual foundation models, both of the same\narchitecture and of different architectures as f, by either fine-tuning of nonwatermarked copy of f for a downstream task, of via functionality stealing perturbations, for example, via knowledge distillation [12] or pruning [10]. Similarly,\nthe space Ωf consists of the models, both of the same architecture and of different architectures as f, by either fine-tuning of f for a downstream task, of via\nfunctionality stealing perturbations. Then, given the set X = {x1, . . . , xN} of images used for the watermarking\nof f from (7), we compute the quantities \\atm hd s { 1 }(f' _ j,i,x_ l) = \\ma thds {1 }[m _ i = m '(f'_j , x_l)_i]\\quad\\text{and}\\quad\\mathds{1}(g_j,i,x_l)=\\mathds{1}[m_i=m'(g_j,x_l)_i] (12) \\label { e q:bit _ se t\n{ (13) imates} \\ begin c ases}\\mathbb{P}(r(\\Omega_f|x)u(x))\\le\\frac{\\alpha}{N}.\\end{cases} These estimates, namely, l(x) and u(x), are used to estimate the probabilities\nfrom (10). Estimating the probability of a deviation of the detection rate In this\nsection, we discuss how to upper-bound both the probability of false detection of\na non-watermarked model as a copy of the watermarked one and the probability\nof misdetecting a functional copy of the watermarked model.", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 13, + "total_chunks": 28, + "char_count": 2569, + "word_count": 388, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "69099f35-f481-42e6-8e39-f8c9a6c91e70", + "text": "Note that R(f ′, X, τ) is a sum of N independent Bernoulli variables with\nparameters, r(Ωf|x), so \\mat hbb {P }_ {f ' \\ s \\ g ' thcal {X} , < \\overline {R}]\\sum_{l=0}^{\\overline{R}-1}\\sum\\subset\\mathcal{X}:|S|=l}\\prod_{x_{in}\\inr(\\Omega_f|x_{in})\\prod_{x_{out}\\notinr(\\Omega_f|x_{out})). Ome a _f}[R(f , \\ma \\ta u )\n(14) 8 Anna Chistyakova and Mikhail Pautov", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 14, + "total_chunks": 28, + "char_count": 358, + "word_count": 50, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e801b189-552b-464d-997e-45e955be95c2", + "text": "Note that replacing the parameters r(Ωf|x) with its estimations in the form l(x)\nfrom (13) yields the bound or_\n\\lab el { eq :b ou n d_ f d P \\sim \\ f} [ R(f', \\m a thcal\\tau<\\overline{R}]<\\sum_{l=0}^{\\overline{R}-1}\\sum\\subset\\mathcal{X}:|S|=l}\\prod_{x_{in}\\inl(x_{in})\\prod_{x_{out}\\notinl(x_{out}))\\underline{p}(\\Omega\nep} \\mathbb { }_{f' Omeg a _\n(15)\nthat holds with probability at least 1 −α. \\mathbb { P} _ {g s i m l {X}, > \\ underlin e {R}]<\\sum_{l=\\underline{R}+1}^{N}\\sum\\subset\\mathcal{X}:|S|=l}\\prod_{x_{in}\\inu(x_{in})\\prod_{x_{out}\\notinu(x_{out}))\\underline{p}(\\Xi\nim \\X }[R(g, \\ athca \\tau )\n(16) During experimentally, we used n = 32, τ = 5, M = 1000 and varied\nconfidence level α such that probabilities α, p(Ξ), p(Ω) were close. Specifically,\nvalue α = 5 × 10−6 yields p(Ω) = 10−6, p(Ξ) = 10−4 and R = 750, R = 600. Thus, if one uses the boundary values R, R to distinguish between the watermarked and non-watermarked model, one is guaranteed to have both error probabilities p(Ω), p(Ξ) low. 3.5 Alternative estimation of bit collisions\nAccording to equation 7, the quantity R(f, X, τ) = PNi=1 1[ρ(m(xi), m′(f, xi)) ≤\nτ] is the sum of N independent Bernoulli random variables. We may rewrite\nR1 = R(f ′, X, τ) and R2 = R(g, X, τ) from equation 10 in the form & R _ 1 = \\ x i _1 + \\x\ni _ 2 + \\ d o t s + \\x i _{n-1}\\xi_n,\\nonumberR_2=\\eta\\eta\\dots\\eta_{n-1}\\eta_n, (17) where ξi ∼Bernoulli(pi), ηi ∼Bernoulli(qi) are independent and parameters\n(pi, qi) are unknown. Let p = n1 Pni=1 pi and q = n1 Pni=1 qi. Then, if R < np\nand R > nq from equation 10, the following lemma holds. Let δ > 0 and set ε = 2n ln δ . Let ˆp = nR1 and ˆq = nR2 be\nunbiased estimates of p and q, respectively. Then, with probability at least 1 −δ,\nthe following upper bounds for probabilities from equation 10 hold:", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 15, + "total_chunks": 28, + "char_count": 1809, + "word_count": 315, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f766ea1-a916-4f58-b26b-beaac99df857", + "text": "\\lab el { eq:new_lemma}{X},\\Xi{X},^{+}),\n(18) {p}, \\v re psil o n^ { -}) \\lef\n&h( \\ha t t\na =\nn(\\ha t { }- \\var e psi l on \\overline{R}}\\right)^{\\overline{R}}\\left\\frac{n(1(\\hat{p}\\varepsilon))}{n\\overline{R}}\\right)^{n-\\overline{R}},\\nonumberh(\\hat{p},\\varepsilon=\\left(\\frac{n(\\hat{p}+\\varepsilon)}{\\underline{R}}\\right)^{\\underline{R}}\\left\\frac{n(1(\\hat{p}\\varepsilon))}{n\\underline{R}}\\right)^{n-\\underline{R}}\n(\\fr ac { . (19)\np ) } { RandMark: On Random Watermarking of Visual Foundation Models 9", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 16, + "total_chunks": 28, + "char_count": 503, + "word_count": 48, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5f3eecea-5a57-4555-894d-321bc00c58db", + "text": "Some words about relation R and p; proof will be moved to the\nappendix. We provide a proof for the upper inequality from equation 18. Specifically,\nwe need to upper bound the probability P(R ≤d), where R ≡R(f ′, X, τ) and\nd ≡R. According to Chernoff bound, \\ m at h bb (R < d) \\le \\inf _{t<0}\\exp(-td)\\mathbb{E}(\\exp(tR)). (20)\n{P} Note that, according to independence of ξi, \\mathbb { } xp (t R)) = \\ d _ {i = 1}^n \\ math bb {E}(\\expt\\xi_i)\\prod_{i=1}^n(1-p_ip_ie^t),\\text{and,hence,} (21)\n(\\e pro ( l l\n\\ m at h bb R < d) \\ e nf _{ t <0} \\ eft[\\exp(-td)\\prod_{i=1}^n(1-p_i+p_ie^t)\\right(22)\n{P} (et−1)2\nLet ϕ(p) = ln(1 −p + pet). Note that ϕ′′(p) = − (1−p+pet)2 < 0 for all t < 0,\nand, hence, ϕ(p) is strictly concave on [0, 1]. From the concavity of ϕ(p), it follows that ac { 1 }{ n }\\sum _{i= 1 } ^ n \\ln(1-p_i+p_ie^t)\\le\\ln(1-\\overline{p}+\\overline{p}e^t),\\nonumber\\prod_{i=1}^n\\left[1-p_i+p_ie^t\\right]\\le(1-\\overline{p}+\\overline{p}e^t)^n,\n\\fr and, consequently,\n\\ m at h bb (R 0 is\nmonotonic, t = ln np−pdd−pd is a unique critical point of dtd ln ψ(t). si t) = \\ g am (p) =\\left(\\frac{np}{d}\\right)^d\\left(\\frac{n(1-p)}{n-d}\\right)^{n-d} \\ _{t < 0} \\ p (27)\ninf ( m a 10 Anna Chistyakova and Mikhail Pautov", + "paper_id": "2603.10695", + "title": "RandMark: On Random Watermarking of Visual Foundation Models", + "authors": [ + "Anna Chistyakova", + "Mikhail Pautov" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10695v1", + "chunk_index": 18, + "total_chunks": 28, + "char_count": 590, + "word_count": 122, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "01e78b6f-099c-4768-9432-dfc835e5cd5b", + "text": "and the overall bound is\n\\ m at h bb {P}(R n,d \\ m at h bb{ P }(R\" means programmatically processing the data. For each given seed instance in BIRD [18], 3.5 Data Generation\nconsisting of a triple, we maintain We design a framework to simulate different types of schema perthe natural language question (NLQ) fixed across all perturbation turbations in a configurable way. For adding or renaming columns,\ntypes, while only modifying the relevant schema. The correspond- both the modified column size and the column position in the tables\ning SQL query is adjusted as necessary to remain consistent with are set randomly, and we set the original column size in the table\nthe changes in the database schema. as the maximum number of columns to be changed. For removing columns, we can randomly remove important or unimportant\ncolumns from the existing relevant tables. The important columns\n3.4 Seed Dataset Selection are the columns that appear in the gold SQL, which will inevitably\naffect the prediction. For adding, removing, or renaming tables, we\nFor building Evoschema benchmark, we utilize the BIRD [18] dataset\nrandomly add, remove or rename one or multiple tables.\nas the seed data, which is specifically designed for the text-to-SQL\nSchema Change: To ensure the diversity and reasonability of the\ntask. Compared to Spider [32], which is commonly used to study\nsynthesized schema, we leverage the capabilities of GPT-3.5 and\ntext-to-SQL robustness, BIRD features more intricate, realistic, and\nGPT-4 to synthesize realistic and contextually appropriate columns\nextensive databases, as well as more complex SQL queries that inor tables, which help effectively produce high-quality synthetic data\nclude keywords often missing in Spider. BIRD consists of NLQs,\nthat meets our requirements. For adding or renaming columns and\ncorresponding database schemas, and gold SQL queries and entables, we input the existing relevant tables to GPT-3.5, and let the\ncompasses a wide range of real-world database scenarios, which\nmodel generate the potential tables or columns that fit the context.\nprovides a robust foundation for evaluating the performance of\nFor splitting tables or merging tables, since they are more complex\nmodels in translating NLQs into SQLs.\nthan other perturbations, we use GPT-4 to choose the tables that\nSchema Perturbations: To evaluate the robustness of the text-tocan be split or merged and then use the modified tables to replace\nSQL models, EvoSchema not only includes the BIRD dataset in their\nthe original ones. For adding or renaming columns and tables, we\noriginal form but also augmented it with various column-level and\napply heuristics to filter out the repeated ones in the synthesized\ntable-level schema perturbations. We ensure that the NLQs remain\ntables or columns. Besides, to ensure the correct relationship among\nfixed, while the schema and SQL queries are adjusted as necessary\ndifferent tables after modifying the schema, we apply heuristics to\nto reflect the changes introduced by our perturbations. We follow\nensure all the foreign keys change along with their referenced table\nthe standard train/dev split provided with BIRD, and apply all the\nnames and column names. When removing columns or tables, any\nperturbations on both training data and evaluation data.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 10, + "total_chunks": 39, + "char_count": 3338, + "word_count": 517, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2f25fe24-eb7d-487a-8fa7-66f901fbdbab", + "text": "The data\nforeign keys in other tables that reference the removed columns or\nstatistics of EvoSchema are in Table 2 and the examples of different\ntables will be removed as well.\nperturbation types are in Figure 2. SQL Change: To ensure the consistency of the , after we change the relevant table schema, we re- These synthesized names replace the original column names. In\nvise the gold SQL accordingly. Since the NLQs are the same for addition, in order to maintain the correctness of the relationship\nadding or removing columns and tables, and the schema evolution among the tables, If the column in one table has been renamed, we\nhere doesn't affect answering the questions, we keep the gold SQL will also rename the foreign keys in other tables if those columns\nunchanged for these perturbation types. For renaming columns reference the renamed one. We also revise gold SQL accordingly to\nor tables, we revise gold SQL if they appear in the gold SQL. For ensure that the revised schema and gold SQL remain aligned with\ntable splitting or merging, due to the complexity and variation in the unchanged NLQ.\nthe required SQL changes, we use GPT-4 to revise the gold SQL. Split columns: Since columns such as name, address, and date are\nThis revision is based on the mappings from the original to the often stored in more fine-grained formats in real-world databases\nnew tables and columns, as well as the necessary adjustments to (e.g., a full name split into first and last name; a date split into year,\nthe JOIN paths. We manually check the edited gold SQL for the month and day; an address split into state, city and street, etc),\nevaluation benchmark to make sure they are correct. we identify examples in BIRD dev set that involve these attributes\nand manually split the corresponding columns into finer-grained\ncolumns for evaluation. As these changes affect the structure of the\n3.6 Data Collection of Each Perturbation Type gold SQL queries, we manually revise the gold SQL to reflect the\nWe first define heuristics for different perturbation types, then updated schema. For the training set, we similarly select examples in\ncombine both GPT models' generation ability and programming BIRD train set involving name, address, or date, and use Claude 3.5\nto collect the data. Finally, we incorporate a human verification to synthesize the corresponding fine-grained columns and update\nstage to control the data quality.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 11, + "total_chunks": 39, + "char_count": 2515, + "word_count": 415, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d5f15874-14a3-44ab-8a0a-f804157149b5", + "text": "Here are some general heuristics the gold SQL accordingly.\nwe should consider to maintain consistency and avoid conflicts Merge columns:As the reverse of column splitting, we simulate\nwhen manipulating data: 1) Preserve Meaning: For renaming, the more abstract column representations commonly seen in real-world\nnew column or table name should reflect the same meaning as databases (e.g., combining first and last name into full name; year,\nthe original name to avoid semantic confusion. 2) Avoid Conflicts: month, and day into date; state, city, and street as address). We\nEnsure that the new column or table name does not conflict with identify relevant examples in the BIRD dev set and manually merge\nexisting column or table names within the same or other tables in fine-grained columns, updating the gold SQL accordingly. For trainthe database. 3) Update References: Update all references to the ing, we apply the same strategy to the BIRD train set and use Claude\nnew column or tables in foreign keys in other tables. 4) Revise SQL: 3.5 to synthesize the merged schema and update the gold SQL. Update all SQL queries referencing the new columns or tables to Add tables: We randomly add irrelevant tables to each question,\nwork correctly after the renaming.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 12, + "total_chunks": 39, + "char_count": 1262, + "word_count": 205, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f5f2876c-9f7b-4333-9c80-b83c7278a4e5", + "text": "These heuristics aim to ensure and these tables are still in the same database as the relevant tables\nthat those perturbations are performed systematically, maintaining in BIRD. The original BIRD datasets guarantee that no different\nthe database's integrity and compatibility with SQL queries. The tables in their database can lead to alternative correct SQL answers.\ndetails for each perturbation type are as follows: The tables added don't affect the NLQ and the gold SQL. Add columns: we input both the table name and all of its col- Remove tables: In this scenario, we randomly remove tables from\numn names and data types to GPT-3.5 and prompt it to generate the relevant schema, which are referenced in the gold SQL query. As\nmultiple column names and their corresponding data types that a result, the gold SQL becomes invalid. Instead, we use the response\nare suitable and congenial with reason and common sense given \"The given table information is insufficient to generate an SQL\nthe current scenario, and prompt GPT-3.5 don't generate the col- query to answer the question\" as the ground truth.\numn names that have the similar meaning with the existing input Rename tables: we input both the table name and all of its colcolumn names. Then we add a heuristic guarantee to filter out the umn names and data types to GPT-3.5.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 13, + "total_chunks": 39, + "char_count": 1332, + "word_count": 223, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43b560b6-5e16-4fd6-abb2-3cb943635ac9", + "text": "We randomly select one\nredundant columns if the generated column names are repeated. or multiple table names and prompt GPT-3.5 to generate similar,\nThese synthesized columns are then randomly inserted into the context-appropriate names. These synthesized names replace the\nrelevant tables. Notably, both the NLQ and the gold SQL remain original table names. In addition, in order to maintain the correctunchanged during this process. ness of the relationship among the tables, we will also rename the\nRemove columns: We randomly eliminate columns from the foreign keys in other tables if they reference the renamed table.\ngiven schema, ensuring that the removed columns do not appear Finally, the table names in the gold SQL will also be renamed.\nin the gold SQL query. Again, the NLQ and the gold SQL are kept Split tables: as Figure 3 (b) shows, we input both the table name\nfixed during this operation. and all of its column names and data types to GPT-4. We prompt\nRemove columns in gold SQL: In this scenario, we randomly GPT-4 to identify tables that can be logically divided into two or\nremove columns from the schema, specifically targeting those ref- more smaller tables. Using GPT-4, we generate new table names and\nerenced in the gold SQL query. As a result, the gold SQL becomes distribute the columns of the original table among the new tables in\ninvalid. Instead, we use the response \"The given column informa- a contextually appropriate manner. The primary key in the original\ntion is insufficient to generate an SQL query to answer the question\" table will be copied into all the new tables after splitting. The gold\nas the ground truth.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 14, + "total_chunks": 39, + "char_count": 1654, + "word_count": 277, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0a21cc12-6492-4ce2-bcf1-24918c6227c7", + "text": "SQL is revised by GPT-4 to reference the newly created tables,\nRename columns: as Figure 3 (a) shows, we input both the table ensuring consistency across all components. We also manually\nname and all of its column names and data types to GPT-3.5. We check the new gold SQL to make sure it's correct.\nrandomly select multiple column names and their data types and Table 1: Statistics of EvoSchema compared with existing benchmarks. \"Tab\": tables; \"DB\": database; \"Col\": columns; \"PK\": primary\nkeys; \"FK\": foreign keys. Schema Evolution Features of Seed Data (Average)\nPerturbation Data Column-level Table-level Multiple DB Seed Data\nAffects SQL Tab/DB Col/DB Col/Tab PK/DB FK/DB\nFootballDB [8] - reduce PK/FK references, reduce JOIN paths ✓ ✗ FIFA World Cup [1] 15 107 7.1 - 16\nDr.Spider [2] Rename ✗ ✓ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nADVETA [23] Add; Rename ✗ ✓ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nMT-TEQL [20] Add; Remove; Shuffle; Rename Split; Merge; Shuffle ✗ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nEvoSchema (Ours) Add; Remove; Rename; Split; Merge Split; Merge; Rename; Add; Remove ✓ ✓ BIRD [18] 7.3 72.5 10.6 6.5 9.3 Merge Tables: We select two or more related tables and combine also indicates that LLM-generated split and merge tables include\nthem into a single table. GPT-4 is used to generate a suitable name around 30% low-quality data, underscoring the need for careful\nfor the merged table, and the columns from the original tables are human validation for these two types.\nconsolidated under this new table. More concretely, the GPT4 is\nprompted to 1) copy all the primary key columns of the original 3.7 Comparison with Existing Benchmarks\ntables to the new tables after merging, but only keep one of them\nEvoSchema, as presented in Table 1, introduces a comprehensiveas the primary key of the new table, and make others as the regular\nand unique taxonomy for evaluating models' behavior under the im-columns. 2) if the primary key columns in these two original tables\npact of schema evolution on SQL queries, distinguishing itself fromare the same, then just keep one in the new table after merging. 3)\nother benchmarks like Dr.Spider [2], ADVETA [23], MT-TEQL [20]when merging tables, if there are two columns not the primary key\nand FootballDB [8]. Unlike Dr.Spider and ADVETA, which focuscolumn but with the same names in the original tables, revise their\non limited perturbations such as column renaming and additions,column names accordingly to make them different when merging\nEvoSchema encompasses a broader range of transformations, in-them into the new table. Finally, the gold SQL is updated by GPT-4\ncluding adding, removing, renaming, splitting and merging at bothaccordingly. We also manually check the new gold SQL to make\nthe column level and table level. This diversity allows for testingsure it's correct.\nsystems under realistic and dynamic schema evolution scenarios.Quality Control: To ensure high-quality data in EvoSchema, we\nFurthermore, while MT-TEQL includes a variety of perturbations,leverage advanced language models and rigorous human validation.\nit only modifies the columns not mentioned in the SQL whichSpecifically, we use GPT-3.5 to generate synthesized column and\ndoes not consider the impact of schema evolution on SQL directly.table names and data types (only for columns) when adding or reEvoSchema uniquely integrates schema evolution with its effects onnaming are required. We randomly choose 200 generated examples\nSQL queries, enabling evaluation of models in environments thatto do manual review and reveal that GPT-3.5 demonstrates a strong\nclosely mimic real-world database evolution challenges. Differentunderstanding of the input context, effectively generating names\nfrom FootballDB [8] which mainly restructures schema to reducethat meet our requirement. For more complex operations, such\nforeign key mappings among tables and reduce JOIN paths for SQL,as splitting or merging tables, we utilize the capabilities of more\nwe define a more configurable, systematical and structured schemapowerful GPT-4 to handle both schema changes and corresponding\nevolution taxonomy.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 15, + "total_chunks": 39, + "char_count": 4129, + "word_count": 646, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5cd3fd47-dbd6-4cfb-ae23-38ec65652170", + "text": "Besides, our provided schema evolution andSQL modifications with high accuracy.\nsynthesis framework allows us to explore the schema change on To complement these automated processes, we engaged five anmultiple databases easily, while FoodballDB is only limited to onenotators with substantial SQL expertise to carefully review cases\ndatabase. Finally, for the seed data selection, compared to Spider,involving complex schema transformations. Annotators validated\nwhich is commonly used to study text-to-SQL robustness, BIRDand, where necessary, manually corrected the generated gold SQL\nfeatures more intricate, realistic, and extensive databases, as well asqueries to ensure correctness and alignment with the modified\nmore complex SQL queries that include keywords often missing inschemas.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 16, + "total_chunks": 39, + "char_count": 791, + "word_count": 106, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1564a09b-604e-41d9-8c0a-512c474a19e4", + "text": "To further enhance reliability, we implemented crossSpider. These distinctions make EvoSchema particularly well-suitedvalidation by assigning complex cases to multiple annotators and\nfor studying how systems adapt to evolving schemas, advancingresolving discrepancies through discussion or consensus. This combeyond the simpler or less holistic setups of prior benchmarks.bination of advanced AI tools and meticulous human review ensures that EvoSchema maintains a robust and accurate benchmark,\nfaithfully reflecting real-world schema evolution scenarios. 3.8 Data Statistics\nCost Analysis: We have 1.5K split-table examples and 1.1K merge- Table 2 provides an overview of the data statistics in EvoSchema,\ntable examples requiring human verification. Among the split exam- showcasing the various perturbation types applied to the origiples, 1.1K are relatively simple and take approximately 3 minutes nal BIRD dataset. \"Column Manipulation\" refers to applying the\neach to verify, while the remaining 0.4K are more complex and column-level operations on the columns of the original BIRD data;\nrequire about 7 minutes each—totaling roughly 100 hours. For the \"Table Manipulation\" refers to applying the table-level operations\nmerge-table examples, 0.8K are simple (3 minutes each) and 0.3K are on the tables of the original BIRD data. All the perturbed data are\ncomplex (7 minutes each), amounting to approximately 75 hours. obtained by applying column manipulation or table manipulation\nNote this manual effort was for curating the evaluation data, not the on the original BIRD dataset. \"Manipulated Items\" shows the size\ntraining data. Our training data is generated entirely automatically of the altered columns or the tables. \"Manipulated Items/Query\"\nwithout any human annotation or manual verification. Our analysis refers to the number of columns or tables modified in the schema Table 2: Data statistics of EvoSchema. \"Original\" refers to the during training. 2) with perturbation types: the model is trained\noriginal BIRD dataset; \"Column Manipulation\" refers to ap- by merging both the original training data and the perturbation\nplying the column-level operations on the columns of the training data. For closed-source models, we only use them for evaloriginal BIRD data; \"Table Manipulation\" refers to applying uation.\nthe table-level operations on the tables of the original BIRD Evaluation Setting: For all the closed-source models and the\ndata. \"*\": the evaluation data for calculating execution accu- finetuned open-sourced models, we evaluate them under two setracy. We synthesize values to reconstruct the database after tings: 1) without perturbation types: this setting uses the standard,\nschema evolution, and filter out those not executable by gold unaltered original evaluation data to evaluate the model perforSQL, which results in the smaller size of the evaluation data mance. 2) with perturbation types: the models are evaluated on data\nfor calculating execution accuracy. where different perturbations are introduced. By comparing the\nmodel performance under these two settings, we can assess how\nData Statistics resilient the finetuned models and GPT models are to schema evoluManipulated Items/Table Manipulated Items/Query\nPerturbation Type Train Eval Eval* Min Mean Median Max Min Mean Median Max tion in NL2SQL. This setup provides a comprehensive evaluation of\nOriginal 9426 1534 1068 - - - - - - - - model performance in both standard and perturbed environments,\nColumn Manipulation allowing for detailed analysis of robustness and adaptability across\nAdd Columns 9219 1506 846 1 5.7 3 83 1 5.9 4 43 different models and schema evolution types. Remove Columns 9426 1534 1076 1 6.2 2 87 1 6.9 3 70\nRemove Col in SQL 9424 1534 - 1 2.5 2 8 1 2.5 2.5 6\nRename Columns 9385 1533 947 1 4.3 3 46 1 4.4 3 46\nSplit Columns 140 37 37 1 2 2 4 1 2 2 4\nMerge Columns 148 44 44 2 3 3 4 2 3 3 4 5.2 Evaluation Metrics\nTable Manipulation 1) Table Match F1: this score is a metric to measure how well the\nAdd Tables 9387 1530 1014 - - - - 1 2 2 3\nRemove Tables 7212 1171 - - - - - 1 1 1 1 model correctly identifies the relevant tables required to generate\n1534 1063 - - - - 1 1.5 1 4 Rename Tables 9392\nSplit Tables 9254 1515 824 - - - - 1 2.6 3 5 a valid SQL query. The F1 score is a harmonic mean of precision\nMerge Tables 6930 1139 569 - - - - 2 2 2 2 and recall, where the precision is the percentage of tables correctly\npredicted out of all tables predicted by the model and the recall is\nthe percentage of tables correctly predicted out of all the actual\nfor each SQL query, specifically targeting the tables relevant to gen- tables that should have been selected.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 17, + "total_chunks": 39, + "char_count": 4689, + "word_count": 767, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "391accef-19f8-43e5-b7ea-d903c0e9315a", + "text": "The Table Match F1 score\nerating that query. For \"Split Tables,\" \"Manipulated Items/Query\" combines these two metrics to provide a balanced evaluation, which\nrepresents the number of tables each original table is split into. For can assess the ability of text-to-SQL models to correctly identify the\n\"Merge Tables\", \"Manipulated Items/Query\" indicates the number required tables from the database schema to form accurate queries.\nof tables combined into a single table. A higher Table Match F1 indicates better performance in selecting\nthe correct tables for the SQL query.\n4 TRAINING PARADIGM 2) Column Match F1: this score is to evaluate how accurately the\nIn our work, we propose a new training paradigm to enhance the model identifies the relevant columns required to generate a valid\nmodel's robustness against different schema evolution. For each SQL query from a natural language input. Like the Table Match F1,\n triple, we fix the NLQ in the training it measures the balance between precision and recall but is applied\ndata, and augment each triple with different schema designs, which specifically to the columns of the database. A higher Column Match\nmay or may not lead to SQL change. Consequently, we obtain mul- F1 score indicates better performance in selecting the right columns\ntiple triples that can be derived from each of the original triples. for the SQL query. We train the model by learning multiple schema designs and SQLs 3) Execution Accuracy: this metric measures whether the preto the original question mappings, which can improve the model's dicted SQL query can return the correct results as the gold SQL\nability to identify the correct relationships among different tables when executing against a database.\nand columns to the question, and can better distinguish the difference among different schema designs. Through this procedure, the\nmodel can avoid learning spurious patterns better and therefore 5.3 Training and Evaluation Details\nenhance the robustness against different schema evolution types.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 18, + "total_chunks": 39, + "char_count": 2060, + "word_count": 322, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "370370f7-2cf2-4e1c-a14c-ba43cb8cc324", + "text": "We choose Code Llama-7B [25], Mistral-7B [12], Llama 3-8B [7] and\nSQLCoder-7B 4 as our open-source base models. We fine-tune these\n5 EXPERIMENT SETUP models with Huggingface transformers library [31]. For the pertur-\n5.1 Training and Evaluation Settings bation training, we merge all the perturbation data and randomly\nshuffle them as our final training data. We use a learning rate of\nTraining Setting: We choose four open-source models: Code 2e-5 for training Code Llama, Llama 3 and SQLCoder, and 5e-6 for\nLlama-7B [25], Mistral-7B [12], Llama 3-8B [7] and SQLCoder-7B 4 training Mistral. We train all the models on 4\nand two closed-source models: GPT-3.5 5 and GPT-4 [22] for our A100 80GB GPUs and use a cosine scheduler with a 0.03 warm-up\nexperiments. For these four open-source models, we explore two period for 6 epochs.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 19, + "total_chunks": 39, + "char_count": 829, + "word_count": 136, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f4f83d04-3332-4c8a-abbb-ed7dae4be747", + "text": "We employ FSDP [37] to efficiently train the\nsettings: 1) without perturbation types: the model is trained on the model. We set the max input length of training as 1024 and the max\noriginal training data without any perturbation types introduced output length of inference as 500. For inference, we use vllm [31]\n4https://huggingface.co/defog/sqlcoder-7b-2 for batch evaluation, and we set the batch size as 16. We do the\n5https://openai.com/chatgpt/ inference on an 80G A100 GPU. For closed-source LLMs, we use Table 3: Evaluation on EvoSchema. \"w/\": the model is trained by merging the original data and all the perturbation training\ntypes together; \"w/o\": the model is only trained on the original training data. The best performance for each model is in bold,\nand red shows a larger gain. \"-\": some of the relevant tables are removed so there should be no gold SQL used to calculate the\nmetrics here. Code Llama Mistral Llama 3 SQLCoder GPT-3.5 GPT-4\nPerturbation Type\nw/o w/ w/o w/ w/o w/ w/o w/", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 20, + "total_chunks": 39, + "char_count": 1000, + "word_count": 167, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4ef5ccf1-52a8-4e4b-bf5a-507c4c96229c", + "text": "Original 89.77 90.42 89.58 90.62 89.96 89.53 89.69 90.64 87.28 88.98 Add Columns 89.73 90.27 89.65 90.03 89.08 89.70 89.30 90.52 86.35 88.12\nRemove Columns 89.82 90.24 89.89 90.66 90.09 89.82 89.81 90.54 87.18 88.87\nRename Columns 85.28 85.07 84.32 84.27 83.74 82.92 85.32 84.93 81.73 83.20\nSplit Columns 83.78 89.19 83.78 88.29 81.08 85.14 86.49 88.29 81.44 86.31\nMerge Columns 88.65 87.23 87.23 89.72 88.65 86.17 87.23 87.23 83.17 89.36 Add Tables 57.88 89.50 57.67 89.30 55.11 88.51 57.44 89.38 83.54 85.79\nRemove Tables - - - - - - - - - -\nRename Tables 88.84 90.32 89.40 90.56 87.18 89.14 89.40 90.48 87.02 88.45\nSplit Tables 71.99 81.55 66.12 80.87 71.08 80.12 72.52 81.92 77.52 80.68\nMerge Tables 85.29 87.03 83.39 86.91 81.68 86.48 84.80 86.35 83.04 86.99 MacroAvg 83.10 88.08 82.10 88.12 81.77 86.75 83.20 88.03 83.83 86.68 Original 80.66 81.64 81.10 82.36 79.13 78.72 81.52 81.97 78.28 80.78 Add Columns 78.26 80.27 79.16 80.18 75.79 76.87 79.09 80.46 75.03 78.58\nRemove Columns 82.67 82.75 83.09 84.00 81.56 80.69 83.20 83.18 80.33 82.55\nRename Columns 76.50 76.94 76.35 76.73 72.24 71.07 76.84 77.38 73.40 75.90\nSplit Columns 71.22 81.81 70.24 80.41 67.29 75.04 74.50 79.92 73.59 77.92\nMerge Columns 83.19 83.30 82.75 83.41 82.72 83.68 82.64 83.31 78.13 88.56 Add Tables 63.81 81.14 65.39 81.09 59.36 77.96 62.91 81.23 76.45 79.32\nRemove Tables - - - - - - - - - -\nRename Tables 79.60 80.91 80.32 81.29 77.49 77.46 80.77 81.79 77.78 80.04\nSplit Tables 75.30 78.45 73.87 78.11 73.81 73.95 75.83 78.59 74.89 77.41\nMerge Tables 65.56 67.09 64.12 67.46 63.50 64.40 65.57 67.29 63.23 68.13 MacroAvg 75.68 79.43 75.64 79.50 73.29 75.98 76.29 79.51 75.11 78.92", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 21, + "total_chunks": 39, + "char_count": 1665, + "word_count": 284, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6793ce80-93e6-471e-bc09-2f5c585b8f78", + "text": "We use the 2023-12-01-preview version for comparison with our primary fine-tuning approach, we use a fineGPT-4, and 2023-07-01-preview version for GPT-3.5. tuned Code Llama model trained without any schema perturbation\ndata as the SQL generation model. This setup allows us to isolate\n5.4 Baselines and evaluate the effectiveness of a schema selection and pruning\ncomponent in addressing schema evolution. The results are shownWe add in-context learning [10] and more advanced method: CHESS\nin Table 4.[28] as the baselines for comprehensive comparison. In order to\ntest whether the in-context learning can help address the schema\nevolution issue, we randomly select three examples (each example 6 RESULTS AND ANALYSIS\nis an triple) as the demonstration in the prompt to help the\nAs Table 3 and Table 5 show, we train Codellama, Mistral, Llama3\nmodels understand the schema after evolution (Table 4). We also\nand SQLCoder on the original BIRD training data with and without\ninclude CHESS, an advanced method for NL2SQL as a baseline.\ndifferent perturbation types, and evaluate the model on the original\nWe apply the schema selection (SS) and candidate generation (CG)\nBIRD evaluation data and different perturbation types. We observe:\ncomponents developed in their work.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 22, + "total_chunks": 39, + "char_count": 1359, + "word_count": 209, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c91be887-9a58-4b34-95a2-056ed0e8a546", + "text": "For schema selection, we\nThe models trained on different perturbation types are\nuse advanced gpt-4o model to prune the database schema and\nmore robust to the schema variation on average, and demonremove the irrelevant tables and the irrelevant columns in the\nstrate high robustness on the table-level schema evolution.\nselected tables, ensuring only the most relevant tables and columns\nWhile adding the perturbation data during training leads to a slight\nare passed into the model for SQL generation. To ensure a fair\nExec Acc (EX) drop for original non-evolved evaluation data, adding,\nremoving and renaming column types, it achieves significantly bet-\n6https://learn.microsoft.com/en-us/azure/ai-services/openai/reference ter results on splitting columns and table-level perturbation types. By comparing these four models' performance with and without the noise in simpler schema changes where the model trained withperturbation data, we observe that for splitting columns, the model out perturbation data has already maximally learned the patterns.\ntrained with perturbation data can achieve up to 5.4 points gain for To better understand the slight performance gap under simpler\ntable match F1, 10.6 points gain column match F1 and 24 points gain column-level perturbations, we conducted error analysis and case\nfor EX; for adding tables, the model trained with perturbation data studies to compare models trained with and without perturbed\ncan achieve up to 33 points gain for table match F1, 18 points gain data. We observed two types of errors that lead to this phenomefor column match F1 and 19 points for EX; for splitting tables, the non: (1) Spurious or missing conditions in the WHERE clause. For\nmodel trained with perturbation data can achieve up to 14 points instance, given the question \"What is the element with the atom\ngain for table match F1, 4.2 points gain for column match F1 and ID of TR004_7 in molecule that is not carcinogenic?\", the model\n12 points for EX; for merging tables, the model trained on pertur- trained with perturbation (\"w/\") misses the condition T2.label = '-'\nbation data can achieve up to 4.8 points gain on table match F1 in WHERE clause, while the \"w/o\" model includes it correctly. Howand 3 points gain for column match F1. We hypothesize that this is ever, in another case, 'How many transactions were paid in CZK on\nbecause the perturbation augmented data is particularly beneficial the morning of 2012/8/26?', the \"w/\" model introduces an unnecesfor handling substantial schema changes, but may introduce minor sary WHERE condition: T1.TransactionID BETWEEN 1 AND 1000,\nwhich is not part of the gold SQL. (2) Incorrect column selection in\nSELECT or WHERE clauses. For example, for the question \"Among\nTable 4: Human Evaluation on EvoSchema. \"ZS\" refers to zerothe patients followed at the outpatient clinic, how many of them\nshot, which prompts models without any examples. \"ICL\"\nhave a normal level of alkaliphophatase?\", the \"w/\" model predicts\nrefers to in-context learning, which prompts models with\nT1.Description instead of T1.Admission in WHERE clause, while the\nthree demonstration examples. \"w/o\" means fine-tuning\n\"w/o\" model selects the correct column. Similarly, in the question\nmodel without perturbation training data; \"w/\" means fine-\n\"Which group does superhero A-Bomb belong to?\", the \"w/\" model\ntuning model with perturbation training data.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 23, + "total_chunks": 39, + "char_count": 3409, + "word_count": 528, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "83e75612-f1ca-40d1-9a55-84fc77cd23de", + "text": "Bold color\nselects T2.team_affiliation instead of the correct T2.race. These exindicates the best performance among each row.\namples suggest that while training with perturbed data can improve\ngeneral robustness, especially beneficial for handling substantial\nHuman Evaluation on EvoSchema schema changes, it may also introduce minor noise that misleads\nGPT-4 Code Llama CHESS𝑆𝑆+𝐶𝐺 in condition or column selection under simpler perturbations. Perturbation Type\nZS ICL w/o w/ Closed-source models are robust to different scheme evoOriginal 62 58 65 64 63 lution types in general. As table 3 and 5 show, we compare the\nAdd Columns 59 55 62 61 66 model performance on GPT models and four open-source modRemove Columns 65 61 66 63 64 els trained with and without perturbation types. We observe that:\nRename Columns 57 56 57 57 62 the GPT models' performance are relatively stable across different\nSplit Columns 46 59 41 62 49 perturbation types compared to the original non-evolved test set. Merge Columns 68 66 70 70 66\nIn contrast, fine-tuned open-source models without perturbation\nAdd Tables 56 55 46 62 57 training data exhibit significant performance drops—particularly\nRemove Tables - - - - - on split columns, add tables, split tables, and merge tables—which\nRename Tables 58 60 64 61 61\nintroduce larger schema changes. We hypothesize that the stabil- Split Tables 57 53 48 60 53\nMerge Tables 55 57 54 58 53 ity and robustness of closed-source models stems from broader\npretraining exposure and stronger internal schema reasoning capaMacroAvg 58 58 57 62 59\nbilities, while the open-source models trained without perturbation\ntypes are more sensitive due to limited training on diverse schema\nTable 5: Execution Accuracy on EvoSchema. \"w/\": the model variations. This motivates the need to fine-tune open-source modis trained with all the perturbation types; \"w/o\": the model els with perturbation training data to improve their generalization\nis only trained on the original training data. under schema evolution. We notice that comparing the model performance on the open-source LLMs and closed-source LLMs, the models\nExec Acc on EvoSchema trained with perturbation data have better performance than GPT\nCode Llama Mistral Llama 3 SQLCoder GPT-3.5 GPT-4 models on both column-level and table-level perturbation evaluation Perturbation Type\nw/o w/ w/o w/ w/o w/ w/o w/ data. This indicates that our models trained with perturbation data\nOriginal 58 57 59 58 55 51 58 58 44 47 are more robust than GPT models. Add Columns 57 55 56 56 52 49 55 57 43 46 Table-level perturbation has a larger impact than columnRemove Columns 59 57 60 58 56 53 60 58 45 47\nRename Columns 54 52 55 54 49 47 56 55 43 45 level perturbation on the model performance. As Table 3 and\nSplit Columns 41 62 35 54 38 49 43 67 41 46 5 show, comparing with the performance on the original evaluation\nMerge Columns 70 70 70 70 73 73 66 82 61 68 data: adding tables and splitting tables will lead to a significant table\nAdd Tables 40 58 39 58 37 52 40 57 44 48 match F1 drop; adding tables, splitting tables and merging tables\nRemove Tables - - - - - - - - - -\nRename Tables 56 55 55 56 52 50 56 55 43 47 will lead to a significant column match F1 drop. This phenomenon\nSplit Tables 38 46 36 48 40 41 43 49 40 47 indicates that adding tables or splitting tables easily confuses the\nMerge Tables 43 45 45 46 42 44 47 46 37 45 models in choosing the correct tables to generate the SQL query. For\nMacroAvg 52 56 51 56 49 51 52 58 44 49 merging tables, even though the model can correctly choose tables, it's a bit hard for the model to pick up the correct columns when Table 6: Perturbation type ablation on EvoSchema.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 24, + "total_chunks": 39, + "char_count": 3688, + "word_count": 644, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "5fa0376b-78c5-4373-912b-43e6a42fb4a2", + "text": "The base\nthe columns from different tables go into the same table. While for model is Code Llama. \"both\": the model is trained with\nthe column-level performance, there are limited differences with both column-level perturbation and table-level perturbation\nthe performance on the original data except for splitting columns. types; \"w/o table-p\": the model is trained without table-level\nReducing table schema complexity is beneficial for model perturbation types; \"w/o column-p\": the model is trained\nperformance. Compare the model performance on column-level without column-level perturbation types.\nperturbation evaluation and the original evaluation data, adding\nPerturbation Type Ablation\ncolumns results in a decrease in column match F1, whereas removTable Match F1 Column Match F1\ning columns leads to an increase in column match F1. It indicates Perturbation Type both w/o table-p w/o column-p both w/o table-p w/o column-p\nsimpler table schema is beneficial for models to select columns, Original 90.73 90.80 (+0.07) 90.04 (-0.69) 81.09 82.15 (+1.06) 80.49 (-0.60)\nas removing columns simplifies the table schema while adding Add Columns 90.86 90.80 (-0.06) 89.75 (-1.11) 79.63 80.81 (+1.18) 77.29 (-2.34)\n(+0.11) (-0.24) (+0.57) (-0.67) Columns 90.72 90.83 90.48 83.28 83.85 82.61columns makes the table schema more complex. Remove\n(+0.03) (-0.78) (+1.04) (-1.32) Rename Columns 85.35 85.38 84.57 76.49 77.53 75.17 Add Tables 88.95 58.94 (-30.01) 88.57 (-0.38) 79.87 64.11 (-15.76) 79.33 (-0.54)\nRemove Tables - - - - - -\n6.2 Comparison of Different Baselines Rename Tables 90.54 90.77 (+0.23) 89.29 (-1.25) 81.13 81.51 (+0.38) 79.33 (-1.80)\nSplit Tables 80.71 73.28 (-7.43) 79.05 (-1.66) 77.41 75.95 (-1.46) 76.30 (-1.11)\nAs EvoSchema has a large scale of the test set and we need to call Merge Tables 88.72 87.87 (-0.85) 86.83 (-1.89) 68.40 68.26 (-0.14) 67.08 (-1.32)\nGPT-4 and GPT-4o API for in-context learning and CHESS respectively, to save the cost, we randomly select 200 examples for the\nTable 7: Out of Scope Effect on EvoSchema. The base model is\nraw BIRD test set and also from each perturbation type to comCode Llama. \"w/o\": the model is trained without perturbation\npare different baselines. We compare GPT-4 zero-shot prompting,\ntypes; \"w/\": the model is trained on the original data and all\nGPT-4 3-shot in-context learning, CodeLLama trained with and\nthe perturbation types; \"+ OOS\": the model is trained on the\nwithout perturbation training data and CHESS (with schema selecoriginal data, perturbation types and two out-of-scope (OOS)\ntion (SS) and candidate generation (CG)) on our downsampled test\nperturbation types; \"+ OOS FP\": The model trained with two\nset. Since we found that Exec Acc can still make mistakes when\nOOS perturbation types makes an incorrect prediction on the\ndifferent SQL queries produce the same results sometimes even\noriginal data and in-scope perturbation data; \"+ OOS TP\": The\nthey don't align with the NLQ, or sometimes both the gold SQL\nmodel trained with two OOS perturbation types makes the\nand wrong predicted SQL return the empty which may mislead\ncorrect prediction on the two OOS perturbation data; \"Tab\":\nthe evaluation, we use human evaluation here for more precise\nthe model refuses to predict SQL due to the lack of table\nevaluation. As Table 4 shows, compared to GPT-4 zero-shot (ZS),\ninformation; \"Col\": the model refuses to predict SQL due to\nin-context learning (ICL) shows a significant advantage only on\nthe lack of column information.\nthe split columns perturbation, while performing slightly better or\nworse on other types. This suggests that ICL is not consistently\nOut of Scope Effect\neffective for handling schema evolution. We hypothesize this is Table Match F1 Column Match F1 + OOS FP + OOS TP\nbecause the demonstration examples in ICL cannot cover the full Perturbation Type w/o w/ + OOS w/o w/ + OOS Tab Col Tab Col\nrange of schema and SQL changes; thus, for examples that differ Original 89.77 90.42 82.98 (-7.44) 80.66 81.64 75.43 (-6.21) 7.11 0.65 - -\nsignificantly from the demonstrations, ICL offers limited benefit. AddRemoveColumnsColumns 89.7389.82 90.2790.24 86.0782.24 (-4.20)(-8.00) 78.2682.67 80.2782.75 77.0075.90 (-3.27)(-6.85) 4.257.56 0.400.72 -- --\nHowever, for split columns, where changes commonly involve pat- Remove Col in SQL - - - - - - 5.02 - - 84.03\nterns like name, address, or date splits, the demonstrations gener- Rename Columns 85.28 85.07 80.20 (-4.87) 76.50 76.94 73.04 (-3.90) 4.44 0.20 - -\nAdd Tables 57.88 89.50 88.78 (-0.72) 63.81 81.14 80.71 (-0.37) 0.33 0.07 - -\nalize better—making ICL more effective in this case. For CHESS, Remove Tables - - - - - - - - 1.62 83.86 -\nwe use GPT-4o—a powerful closed-source model—for schema selec- RenameSplit TablesTables 88.8471.99 90.3281.55 86.3681.07 (-3.96)(-0.48) 79.6075.30 80.9178.45 78.0678.02 (-2.85)(-0.43) 3.520.26 0.390.07 -- --\ntion and pruning, and Code Llama without perturbation training Merge Tables 85.29 87.03 82.18 (-5.15) 65.56 67.09 63.59 (-3.50) 4.65 0.35 - -\n(CodeLlama w/o) as the SQL generation model. CHESS achieves\nthe best performance on add columns and rename columns, and\nsignificantly outperforms CodeLlama w/o on split columns, add each baseline.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 25, + "total_chunks": 39, + "char_count": 5239, + "word_count": 819, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "26a6f1c0-4e85-47b2-b32e-a810de735dc3", + "text": "We computed p-values using the statsmodels packtables, and on average. This highlights the importance of accurate age, considering differences statistically significant when 𝑝< 0.05,\nschema selection and pruning in improving SQL generation. How- which indicates that the improvement is unlikely due to random\never, we also observe that errors at the pruning stage can propagate, chance. Using this test, we observed our method achieved statisleading to degraded performance. Specifically, in merge columns tically significant improvements over three key baselines: GPT-4\nand merge tables cases, CHESS tends to over-prune, omitting rele- in-context learning, fine-tuning without perturbed data, and CHESS\nvant schema information and resulting in worse performance than (all with 𝑝< 0.05). Finally, we found that fine-tuning CodeLlama with\nperturbation training data is still needed, since this method gets 6.3 Influence of Perturbation Types\nthe best performance among all the baselines on average across all We explore the effect of the column-level perturbation types and\ntypes of evaluation data, and performs significantly better than oth- table-level perturbation types. As Table 6 shows, we train the model\ners on 'split columns', 'add tables', 'split tables' and 'merge tables' with both column-level and table-level perturbation types, and\ntypes. We applied McNemar's Test [21] to measure the statistical compare it with the model trained without column-level pertursignificance of performance differences between our method and bation types and without table-level perturbation types. experiments, we found that without training on table-level per- that models trained without perturbation types tend to predict SQL\nturbations, the model performance can be slightly better than the queries that join all available tables, even when some tables are\nmodel trained with both column-level and table-level perturbation irrelevant to the NLQs and SQLs. We hypothesize that this occurs\ntypes on column-level perturbation types, while can lead to a sig- because during training without perturbations, the model only sees\nnificant performance drop on the table-level perturbation types. relevant table schemas, causing it to learn spurious patterns that\nThis indicates that the table-level perturbation data has a limited always try to join all the input tables.\neffect on the column-level perturbation types while having a huge To explore whether simply adding irrelevant tables could yield\nimpact on the table-level perturbation types. When looking at the similar performance to models trained with perturbation data, we\nmodel trained only on table-level perturbation types, we found conducted an experiment where we trained CodeLlama on BIRD. As\nthat the model performance on both column-level and table-level shown in Table 8, adding irrelevant tables led to similar performance\nperturbation types dropped. This indicates that the column-level on \"Add Tables\" perturbation type. but it caused a performance\nperturbation types can still benefit the training. drop on other perturbation types. This suggests that combining all\nperturbation data is necessary to train a more robust model.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 26, + "total_chunks": 39, + "char_count": 3188, + "word_count": 464, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0fd37ae7-9822-417d-bac2-12c7cb5503a2", + "text": "Table 8: Irrelevant tables effect. \"w/\": the model is trained\nwith all the perturbation types; \"w/o\": the model is only Table 9: Intra-database Effect. This experiment emphasizes\ntrained on the original training data; \"w/o+\": the model is that the training and evaluation occur within the same dataonly trained on the original training data, but for the input base, instead of across databases.\ntable schema, we also add irrelevant tables. Intra-database Effect\nAdd Irrelevant Tables Effect Table Match F1 Column Match F1\nPerturbation Type\nTable Match F1 Column Match F1 w/o w/ w/o w/\nPerturbation Type\nw/o w/o+ w/ w/o w/o+ w/ Original 87.24 87.43 79.54 80.89\nOriginal 89.77 87.65 90.42 80.66 79.24 81.64\nAdd Columns 87.14 87.43 76.36 78.92\nAdd Columns 89.73 86.35 90.27 78.26 75.31 80.27 Remove Columns 87.29 87.27 81.14 81.29\nRemove Columns 89.82 87.30 90.24 82.67 80.74 82.75 Rename Columns 85.71 86.43 77.45 79.09\nRename Columns 85.28 81.90 85.07 76.50 73.28 76.94\nAdd Tables 61.13 83.95 66.11 78.57\nAdd Tables 57.88 88.01 89.50 63.81 79.51 81.14 Remove Tables - - - -\nRemove Tables - - - - - - Rename Tables 86.33 86.67 79.44 79.96\nRename Tables 88.84 86.84 90.32 79.60 78.47 80.91 Split Tables 71.82 78.52 75.09 77.42\nSplit Tables 71.99 67.27 81.55 75.30 70.39 78.45 Merge Tables 85.11 87.44 71.43 74.72\nMerge Tables 85.29 83.56 87.03 65.56 63.59 67.09\n6.6 Influence of Intra-DB and Cross-DB\nWe hypothesize that a model trained on the same databases may\n6.4 Influence of Out-of-scope Types not only learn schema evolution patterns but also become familiar\nWe evaluate both in-scope and out-of-scope scenarios. In in-scope with specific table and column names. To test this, we split the\nsettings, schema changes may or may not alter the gold SQL. Out-of- BIRD training data into train/test sets to ensure that each database\nscope cases involve two special perturbations: (1) Removing columns in the test set also appears in the training set. We use Code Llama\nused in the gold SQL, and (2) Removing tables used in the gold SQL. as the base model. The results in Table 9 show that, for most perturIn both cases, the schema lacks critical information, and the model bation types, the model's performance improves more compared\nis expected to abstain from generating a query. to the cross-database scenario in Section 6.1, which verifies our\nTo assess their impact, we train a model on a combined dataset hypothesis.\nthat includes both out-of-scope and in-scope perturbation types,\nalong with the original training data. We compare this model to 7 CONCLUSION\nothers trained only on the original or in-scope data.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 27, + "total_chunks": 39, + "char_count": 2615, + "word_count": 431, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f3782994-c9ba-4d90-a7e2-7619f09e0469", + "text": "As shown in\nIn conclusion, we formulate the critical challenge of schema evoTable 7, incorporating out-of-scope types results in performance\nlution in adaptive text-to-SQL systems and introduce EvoSchema,\ndegradation across both original and in-scope evaluation sets.\na comprehensive, diverse and unique benchmark designed specifiError analysis reveals that the model trained with out-of-scope\ncally to study and address this problem. We developed a structured\ndata tends to make more conservative predictions, sometimes abtaxonomy of schema evolution types, enabling the synthesis of realstaining even when the gold SQL is valid. Further analysis shows\nistic schema designs through column-level and table-level perturbathat the false positive (FP) rate closely matches the performance\ntions. Using this taxonomy, we construct an evaluation benchmark\ndrop between models with and without out-of-scope training, conto rigorously assess model robustness under schema changes and\nfirming that increased conservatism is the main cause. Additionally,\nalso introduce a novel training paradigm that augments existing\nfor the out-of-scope perturbations, the TP is only around 84%, which\n triples with diverse schema designs\nindicates that the model still has a 16% chance to make a prediction\nfor training to improve robustness against schema evolution.\neven when there should not be an SQL.\n6.5 Influence of Irrelevant Tables ACKNOWLEDGMENTS\nWe observed that the model trained with perturbation types demon- The authors would like to thank colleagues from the OSU NLP\nstrates significant robustness to table-level perturbations, such as group for their insightful discussions and constructive suggestions\nadding and splitting tables. Upon analyzing the errors, we found and all anonymous reviewers for their thoughtful comments. REFERENCES [20] Pingchuan Ma and Shuai Wang. 2021. MT-teql: evaluating and augmenting neural\n[1] Andre Becklas. 2018. FIFA World Cup: All the results from World Cups. Kaggle NLIDB on real-world linguistic and schema variations.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 28, + "total_chunks": 39, + "char_count": 2076, + "word_count": 296, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "54062a3f-f063-4b96-a104-3fbcec4ca4d2", + "text": "VLDB Endow. 15,\n(2018). https://www.kaggle.com/datasets/abecklas/fifa-world-cup 3 (nov 2021), 569–582. https://doi.org/10.14778/3494124.3494139\n[2] Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexan- [21] Quinn McNemar. 1947. Note on the sampling error of the difference between\nder Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve correlated proportions or percentages. Psychometrika 12, 2 (1947), 153–157.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 29, + "total_chunks": 39, + "char_count": 447, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db459022-0f54-4d26-941c-a611e72258c1", + "text": "Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, and [22] OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv. Dr.Spider: A Diagnostic Evaluation Benchmark towards org/abs/2303.08774\nText-to-SQL Robustness. In The Eleventh International Conference on Learning [23] Xinyu Pi, Bing Wang, Yan Gao, Jiaqi Guo, Zhoujun Li, and Jian-Guang Lou. 2022. Representations. https://openreview.net/forum?id=Wc5bmZZU9cy Towards Robustness of Text-to-SQL Models Against Natural and Realistic Ad-\n[3] Anthony Cleve, Maxime Gobert, Loup Meurice, Jerome Maes, and Jens Weber. versarial Table Perturbation. In Proceedings of the 60th Annual Meeting of the\n2015.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 30, + "total_chunks": 39, + "char_count": 685, + "word_count": 88, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "114bacfa-be9c-4d88-a5b1-f3e017b8afcf", + "text": "Understanding database schema evolution: A case study. Science of Association for Computational Linguistics (Volume 1: Long Papers), Smaranda\nComputer Programming 97 (2015), 113–121. Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com-\n[4] Daiga Deksne and Raivis Skadin, š. 2022. Virtual Assistant for Querying Databases putational Linguistics, Dublin, Ireland, 2007–2022. https://doi.org/10.18653/v1/\nin Natural Language. In Proceedings of the Future Technologies Conference. 2022.acl-long.142\nSpringer, 555–564. [24] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D.\n[5] Julien Delplanque, Anne Etien, Nicolas Anquetil, and Olivier Auverlot. 2018. Dataset Shift in Machine Learning. The MIT Press.\nlational Database Schema Evolution: An Industrial Case Study. In 2018 IEEE [25] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, XiaoInternational Conference on Software Maintenance and Evolution (ICSME). qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy\n635–644. https://doi.org/10.1109/ICSME.2018.00073 Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris-\n[6] Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade\nHuan Sun, and Matthew Richardson. 2021. Structure-Grounded Pretraining Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas\nfor Text-to-SQL. In Proceedings of the 2021 Conference of the North American Scialom, and Gabriel Synnaeve. 2024.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 31, + "total_chunks": 39, + "char_count": 1593, + "word_count": 203, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "81925b0b-5680-4410-9def-c7c3564912d0", + "text": "Code Llama: Open Foundation Models for\nChapter of the Association for Computational Linguistics: Human Language Code. arXiv:2308.12950 [cs.CL] https://arxiv.org/abs/2308.12950\nTechnologies. Association for Computational Linguistics. https://doi.org/10. [26] Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé\n18653/v1/2021.naacl-main.105 Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne Goujon. 2024. Enhancing\n[7] Abhimanyu Dubey and et al. 2024. The Llama 3 Herd of Models.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 32, + "total_chunks": 39, + "char_count": 503, + "word_count": 61, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "41603bb4-a56f-43bb-99f4-60c1a0a562f4", + "text": "Text-to-SQL Translation for Financial System Design. In Proceedings of the\narXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 46th International Conference on Software Engineering: Software Engineering\n[8] Jonathan Fürst, Catherine Kosten, Farhad Nooralahzadeh, Yi Zhang, and Kurt in Practice. 252–262. Evaluating the Data Model Robustness of Text-to-SQL Systems [27] Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. 2023. Based on Real User Queries. In EDBT. 158–170. https://doi.org/10.48786/edbt. Exploring Chain of Thought Style Prompting for Text-to-SQL. In Proceedings\n2025.13 of the 2023 Conference on Empirical Methods in Natural Language Processing,\n[9] Chang Gao, Bowen Li, Wenxuan Zhang, Wai Lam, Binhua Li, Fei Huang, Luo Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational\nSi, and Yongbin Li. 2022. Towards Generalizable and Robust Text-to-SQL Linguistics, Singapore, 5376–5393. https://doi.org/10.18653/v1/2023.emnlpParsing. In Findings of the Association for Computational Linguistics: EMNLP main.327\n2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association [28] Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and\nfor Computational Linguistics, Abu Dhabi, United Arab Emirates, 2113–2125. CHESS: Contextual Harnessing for Efficient SQL Synthesis.\nhttps://doi.org/10.18653/v1/2022.findings-emnlp.155 arXiv:2405.16755 [cs.LG] https://arxiv.org/abs/2405.16755\n[10] Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and [29] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew\nJingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking\nBenchmark Evaluation. Proceedings of the VLDB Endowment 17, 5 (2024), 1132– for Text-to-SQL Parsers.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 33, + "total_chunks": 39, + "char_count": 1864, + "word_count": 233, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "134a5566-ffd2-45c2-8130-dd2efd9d350b", + "text": "In Proceedings of the 58th Annual Meeting of the\n1145. Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie\n[11] Andrea Hillenbrand and Uta Störl. 2021.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 34, + "total_chunks": 39, + "char_count": 176, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "fef706d5-6653-45d1-891d-1c53c1944e44", + "text": "Managing Schema Migration in Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics,\nNoSQL Databases: Advisor Heuristics vs. Self-adaptive Schema Migration Strate- Online, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677\ngies. In International Conference on Model-Driven Engineering and Software [30] Chenglong Wang, Kedar Tatwawadi, Marc Brockschmidt, Po-Sen Huang, Yi Mao,\nDevelopment. Oleksandr Polozov, and Rishabh Singh. 2018.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 35, + "total_chunks": 39, + "char_count": 463, + "word_count": 52, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0280d25e-7b2c-4ab0-8e79-3390728ed220", + "text": "Robust Text-to-SQL Generation\n[12] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- with Execution-Guided Decoding. arXiv:1807.03100 [cs.CL] https://arxiv.org/\nvendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, abs/1807.03100\nGuillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, [31] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement DePierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funand William El Sayed. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL] https: towicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jer-\n//arxiv.org/abs/2310.06825 nite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame,\n[13] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Quentin Lhoest, and Alexander M. HuggingFace's TransformZhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas ers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL]\nPhillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution https://arxiv.org/abs/1910.03771\nshifts. In International conference on machine learning. PMLR, 5637–5664. [32] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li,\n[14] Kunal Kumar and S. Database normalization design pattern.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 36, + "total_chunks": 39, + "char_count": 1431, + "word_count": 185, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "19995af7-e0f1-456b-b389-1b28adf0785c", + "text": "James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir\nIn 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and\nComputer and Electronics (UPCON). 318–322. https://doi.org/10.1109/UPCON. Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the\n2017.8251067 2018 Conference on Empirical Methods in Natural Language Processing, Ellen\n[15] Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (Eds.). Association\nDawn of Natural Language to SQL: Are We Fully Ready? arXiv preprint for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.\narXiv:2406.01265 (2024). 18653/v1/D18-1425\n[16] Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. [33] Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold\nProceedings of the VLDB Endowment 17, 12 (2024), 4213–4216. Liu, Rui Zhao, Ziyue Li, and Hangyu Mao. 2024. Benchmarking the Text-\n[17] Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie to-SQL Capability of Large Language Models: A Comprehensive Evaluation. Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 37, + "total_chunks": 39, + "char_count": 1314, + "word_count": 187, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f6fc7b44-710a-4f76-b783-173d8960e19f", + "text": "Codes: Towards build- arXiv:2403.02951 [cs.CL] https://arxiv.org/abs/2403.02951\ning open-source language models for text-to-sql. Proceedings of the ACM on [34] Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang\nManagement of Data 2, 3 (2024), 1–28. Lou, and Jinshu Lin. 2024.", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 38, + "total_chunks": 39, + "char_count": 297, + "word_count": 43, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "235adc62-626b-426e-8290-93b0193fce03", + "text": "FinSQL: Model-Agnostic LLMs-based Text-to-SQL\n[18] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Framework for Financial Analysis. arXiv preprint arXiv:2401.10506 (2024). Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a [35] Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, and Kai Yu. 2023.\ndatabase interface? a big bench for large-scale database grounded text-to-sqls. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated\nAdvances in Neural Information Processing Systems 36 (2024). In The 2023 Conference on Empirical Methods in Natural\n[19] Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Language Processing. https://openreview.net/forum?id=oeZiXoCHgq\nFan, Guoliang Li, Nan Tang, and Yuyu Luo. 2024. A Survey of NL2SQL with [36] Tianshu Zhang, Changchang Liu, Wei-Han Lee, Yu Su, and Huan Sun. 2023. Large Language Models: Where are we, and where are we going? arXiv preprint Federated Learning for Semantic Parsing: Task Formulation, Evaluation Setup,\narXiv:2408.05109 (2024). New Algorithms. arXiv:2305.17221 [cs.CL] https://arxiv.org/abs/2305.17221\n[37] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu,\nLess Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit [38] Alex Zhuang, Ge Zhang, Tianyu Zheng, Xinrun Du, Junjie Wang, Weiming Ren,\nMathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Stephen W Huang, Jie Fu, Xiang Yue, and Wenhu Chen. 2024. StructLM: Towards\nData Parallel. arXiv:2304.11277 [cs.DC] https://arxiv.org/abs/2304.11277 Building Generalist Models for Structured Knowledge Grounding. arXiv preprint", + "paper_id": "2603.10697", + "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution", + "authors": [ + "Tianshu Zhang", + "Kun Qian", + "Siddhartha Sahai", + "Yuan Tian", + "Shaddy Garg", + "Huan Sun", + "Yunyao Li" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10697v1", + "chunk_index": 39, + "total_chunks": 39, + "char_count": 1779, + "word_count": 248, + "chunking_strategy": "semantic" + } +] \ No newline at end of file diff --git a/data/chunks/2603.10700_semantic.json b/data/chunks/2603.10700_semantic.json new file mode 100644 index 0000000000000000000000000000000000000000..703703309087f960bb6b24f8882d61c295f5cd68 --- /dev/null +++ b/data/chunks/2603.10700_semantic.json @@ -0,0 +1,922 @@ +[ + { + "chunk_id": "3d6db29e-5789-4d74-a323-1414f29b0f59", + "text": "Structured Linked Data as a Memory Layer\nfor Agent-Orchestrated Retrieval Andrea Volpini, Elie Raad, Beatrice Gamba, and David Riccitelli WordLift, Rome, Italy\n{andrea, elie, beatrice, david}@wordlift.io", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 0, + "total_chunks": 46, + "char_count": 203, + "word_count": 26, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "60559971-773e-4bd4-8e17-1eed086fe80c", + "text": "Retrieval-Augmented Generation (RAG) systems typically\ntreat documents as flat text, ignoring the structured metadata and linked2026\nrelationships that knowledge graphs provide. In this paper, we investigate whether structured linked data—specifically Schema.org markup\nand dereferenceable entity pages served by a Linked Data Platform—Mar can improve retrieval accuracy and answer quality in both standard and\nagentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal,11\ntravel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and\nthe Google Agent Development Kit (ADK) for agentic reasoning. Our\nexperimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agenticoptimized entity page) crossed with two retrieval modes (standard RAG\nand agentic RAG with multi-hop link traversal), plus an Enhanced+[cs.IR]\ncondition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements (∆= +0.17, padj = 0.024), our enhanced entity page\nformat—incorporating llms.txt-style agent instructions, breadcrumbs,\nand neural search capabilities—achieves substantial gains: +29.6% accuracy improvement for standard RAG (p < 10−21, d = 0.60) and +29.8%\nfor the full agentic pipeline (p < 10−21, d = 0.61). The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute\nscores (accuracy: 4.85/5, completeness: 4.55/5) though the incremental gain over the base enhanced format is not statistically significant\n(d = 0.08). We release our dataset, evaluation framework, and enhanced\nentity page templates to support reproducibility. Keywords: Retrieval-Augmented Generation · Knowledge Graphs · Linked\nData · Structured Data · Schema.org · Agentic AI · Vector SearcharXiv:2603.10700v1\n1 Introduction The rise of Generative AI has fundamentally changed how users access information online. Search engines increasingly augment traditional results with AIgenerated summaries—a paradigm exemplified by Google's AI Mode, which retrieves, reasons over, and synthesizes information from multiple web sources. Understanding and optimizing for this new retrieval paradigm is critical for content creators, marketers, and organizations that depend on search visibility. Retrieval-Augmented Generation (RAG) has emerged as the dominant architecture for grounding large language model (LLM) outputs in factual, up-to-date\ninformation [18]. However, most RAG implementations treat documents as unstructured text, discarding the rich structured metadata that many websites\nalready provide via Schema.org markup and knowledge graph representations. In this paper, we ask: Can structured linked data improve RAG performance, and does agentic link traversal unlock further gains? Our work is motivated by three observations:", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 1, + "total_chunks": 46, + "char_count": 2914, + "word_count": 391, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d009e3fc-b3ea-4afd-a899-e9f49a5a1895", + "text": "Websites increasingly embed Schema.org JSON-LD structured data, yet RAG\nsystems rarely exploit this metadata.\n2. Linked Data Platforms serve entity pages that support content negotiation,\nenabling programmatic traversal of knowledge graphs.\n3. Agentic AI systems (those capable of planning, tool use, and multi-step reasoning) can follow links and aggregate information across entity boundaries—\nmimicking the behavior of AI-powered search engines. We make the following contributions: – A controlled experimental framework comparing seven conditions (3 document formats × 2 retrieval modes + an Enhanced+ variant) across four\nindustry verticals, with 2,443 individual query evaluations.\n– An enhanced entity page format designed to maximize both human readability and agentic discoverability, incorporating llms.txt-style instructions\nand neural search capabilities, and an Enhanced+ variant with richer navigational affordances.\n– Empirical evidence showing that enhanced entity pages yield the strongest\nimprovements: +29.6% accuracy in standard RAG (d = 0.60) and +29.8% in\nthe agentic pipeline (d = 0.61), while JSON-LD markup alone provides only\nmarginal improvements. The Enhanced+ variant achieves the highest absolute scores but offers no statistically significant gain over the base enhanced\nformat.\n– A reusable dataset and evaluation harness released for reproducibility.", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 2, + "total_chunks": 46, + "char_count": 1383, + "word_count": 189, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e05544e2-5816-4f12-b1ac-559cf4e159d5", + "text": "2.1 Generative Engine Optimization Aggarwal et al. [2] introduced Generative Engine Optimization (GEO), demonstrating that content optimization strategies such as adding citations, statistics,\nand authoritative language can boost visibility in generative search engines by\nup to 40%. Our work extends GEO from visibility optimization to retrieval accuracy, focusing on structured data as the optimization lever. Structured Linked Data for Agent-Orchestrated Retrieval 3 2.2 Retrieval-Augmented Generation RAG was formalized by Lewis et al. [18], who combined a pre-trained sequenceto-sequence model with a dense retriever to ground generation in retrieved passages. Subsequent work explored pre-training with retrieval objectives [16] and\nscaling retrieval corpora to trillions of tokens [8]. More recently, Self-RAG [5]\nintroduced self-reflection mechanisms for adaptive retrieval, enabling models to\ndecide when and what to retrieve. Trivedi et al. [24] demonstrated that interleaving retrieval with chain-of-thought reasoning significantly improves multi-step\nquestion answering.", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 3, + "total_chunks": 46, + "char_count": 1082, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "88a09bdf-fa99-477d-843e-587223ca2b53", + "text": "Despite these advances, existing RAG systems predominantly operate on\nunstructured text. Our work bridges this gap by demonstrating that structured\nmetadata—specifically Schema.org JSON-LD—provides a complementary signal\nthat improves retrieval quality. 2.3 Knowledge Graphs and Structured Data on the Web The vision of a machine-readable web was articulated by Berners-Lee et al. [6]\nand operationalized through Linked Data principles [7]. Schema.org, launched\nin 2011 as a collaboration among major search engines, provides a shared vocabulary for structured data on the web [13,1]. Today, over 40% of web pages\ninclude Schema.org markup [13]. Knowledge graphs have become central to both academic research and industry applications [17,20]. Early efforts to bring structured data to content\nmanagement systems include WordLift [27], which introduced semantic annotation and entity-based navigation for WordPress sites, and MICO [3], which\ndeveloped linked-data pipelines for multimedia content enrichment. Recent surveys examine the unification of LLMs and knowledge graphs [21], while Graph\nRAG approaches explicitly leverage graph structure during retrieval [22]. Several recent systems construct retrieval graphs from documents to improve\nRAG. LightRAG [14] builds a graph index from document-extracted entities\nand relations, using dual-level retrieval (low-level for specific facts, high-level for\ntopics) to outperform traditional RAG on multi-hop questions. HippoRAG [15]\nmodels retrieval after the hippocampal memory indexing theory, constructing a\nknowledge graph from passages and using personalized PageRank for contextsensitive retrieval. Both systems demonstrate the value of graph structure for\nretrieval, but differ from our approach in a fundamental way: they construct\npurpose-built graphs at indexing time from unstructured text, whereas we leverage existing structured data already published on the web via Schema.org and\nLinked Data Platforms. Our approach requires no graph construction step—the\nknowledge graph is the publisher's source of truth, maintained independently of\nthe retrieval system, and accessible through dereferenceable URIs that support\ncontent negotiation.", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 4, + "total_chunks": 46, + "char_count": 2199, + "word_count": 298, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1ad07753-0f14-4ed0-a71b-68cb421e4c2f", + "text": "2.4 Agentic AI and Tool-Augmented LLMs Agentic AI systems extend LLMs with the ability to plan, use tools, and reason\nover multiple steps. Yao et al. [29] introduced ReAct, interleaving reasoning\ntraces with action steps. Schick et al. [23] demonstrated that LLMs can learn to\nuse external tools autonomously. The Google Agent Development Kit (ADK) [10]\nprovides a production framework for building multi-tool agents. Multi-hop question answering [19]—where answering requires combining information from multiple sources—is a natural application for agentic systems. The Model Context Protocol (MCP) [4] provides a standardized interface for\nLLM–tool integration. Our agentic RAG configuration enables link traversal\nacross entity boundaries, effectively mimicking the behavior of AI-powered search\nsystems that follow links to aggregate information. 3.1 System Architecture and AI Mode Parallel Our experimental system mirrors the emerging architecture of AI-powered search\nengines such as Google's AI Mode, which retrieves web pages, reasons over their\nstructured content, and synthesizes multi-source answers. Our pipeline reproduces this pattern using Google Cloud components for retrieval and reasoning,\ncombined with an independent knowledge graph for structured data: – Vertex AI Vector Search 2.0 [11] serves as the retrieval backbone. Unlike\ntraditional vector databases, Vector Search 2.0 is designed as a self-tuning,\nfully managed, AI-native search engine. It combines dense semantic search\n(via text-embedding-005 embeddings) with sparse keyword matching in a\nsingle hybrid query, automatically tuning retrieval parameters. This mirrors\nhow AI Mode identifies candidate web pages from a large corpus.\n– Google Agent Development Kit (ADK) [10] powers the agentic reasoning layer, providing a ReAct-style loop [29] with tool-use capabilities.", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 5, + "total_chunks": 46, + "char_count": 1853, + "word_count": 259, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3cbe4d6a-cd9d-45e2-bb9a-d83ba59c4c8f", + "text": "Like\nAI Mode's multi-step reasoning, our agent can plan a sequence of actions—\nsearch, follow links, search the knowledge graph—before synthesizing a final\nanswer.\n– WordLift Knowledge Graph [27], an independent Linked Data Platform\n(not a Google Cloud service), acts as the structured data layer. It provides Schema.org-typed entities with dereferenceable URIs that support content negotiation (application/ld+json, text/turtle, text/html). This is\nanalogous to how AI Mode leverages structured data already present in web\npages to enhance understanding. The key insight is that structured linked data functions as an external memory layer for the agent. Rather than relying solely on the vector\nstore's flat text chunks, the agent can follow typed relationships (schema:about,\nschema:author, schema:relatedLink) to discover contextually relevant information that would be invisible to embedding-based retrieval alone. Structured Linked Data for Agent-Orchestrated Retrieval 5 Table 1: Experimental conditions. ID Document Format Retrieval Mode Hypotheses C1 Plain HTML Standard RAG H1 baseline\nC2 HTML + JSON-LD Standard RAG H1 treatment\nC3 Enhanced entity Standard RAG H3 baseline\nC4 Plain HTML Agentic RAG H2 baseline\nC5 HTML + JSON-LD Agentic RAG H2 treatment\nC6 Enhanced entity Agentic RAG H2+H3 treatment C6+ Enhanced+ entity Agentic RAG H4 treatment We design a 3 × 2 factorial experiment crossing three document representations\nwith two retrieval modes, yielding six core experimental conditions, plus an\nEnhanced+ variant (Table 1). Our four hypotheses are:", + "paper_id": "2603.10700", + "title": "Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval", + "authors": [ + "Andrea Volpini", + "Elie Raad", + "Beatrice Gamba", + "David Riccitelli" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10700v1", + "chunk_index": 6, + "total_chunks": 46, + "char_count": 1567, + "word_count": 226, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "55e482ad-cd72-47d0-81ff-bb9747bd7a3e", + "text": "– H1: Adding Schema.org JSON-LD to HTML documents improves RAG accuracy and completeness (C2 vs. C1).\n– H2: Agentic RAG with link traversal outperforms standard RAG on the\nsame document format (C5 vs. C2).\n– H3: Enhanced entity pages, designed for agentic discoverability, yield the\nhighest overall performance (C6 vs. all other conditions).\n– H4: Enhanced+ entity pages—with richer navigational affordances and entity interlinking—further improve performance over the base enhanced format (C6+ vs. 3.3 Document Representations Plain HTML (Baseline). Raw webpage content with all