Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeLocal and adaptive mirror descents in extensive-form games
We study how to learn ε-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by T. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of mathcal{O}(T^{-1/2}) with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments.
GOPO: Policy Optimization using Ranked Rewards
Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
The Power of Learned Locally Linear Models for Nonlinear Policy Optimization
A common pipeline in learning-based control is to iteratively estimate a model of system dynamics, and apply a trajectory optimization algorithm - e.g.~iLQR - on the learned model to minimize a target cost. This paper conducts a rigorous analysis of a simplified variant of this strategy for general nonlinear systems. We analyze an algorithm which iterates between estimating local linear models of nonlinear system dynamics and performing iLQR-like policy updates. We demonstrate that this algorithm attains sample complexity polynomial in relevant problem parameters, and, by synthesizing locally stabilizing gains, overcomes exponential dependence in problem horizon. Experimental results validate the performance of our algorithm, and compare to natural deep-learning baselines.
Actor-Critic based Improper Reinforcement Learning
We consider an improper reinforcement learning setting where a learner is given M base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.
Identifying and bounding the probability of necessity for causes of effects with ordinal outcomes
Although the existing causal inference literature focuses on the forward-looking perspective by estimating effects of causes, the backward-looking perspective can provide insights into causes of effects. In backward-looking causal inference, the probability of necessity measures the probability that a certain event is caused by the treatment given the observed treatment and outcome. Most existing results focus on binary outcomes. Motivated by applications with ordinal outcomes, we propose a general definition of the probability of necessity. However, identifying the probability of necessity is challenging because it involves the joint distribution of the potential outcomes. We propose a novel assumption of monotonic incremental treatment effect to identify the probability of necessity with ordinal outcomes. We also discuss the testable implications of this key identification assumption. When it fails, we derive explicit formulas of the sharp large-sample bounds on the probability of necessity.
Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning
We propose Banker Online Mirror Descent (Banker-OMD), a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The Banker-OMD framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus facilitating the design of new algorithms capable of efficiently and robustly handling feedback delays. Specifically, it offers a general methodology for achieving mathcal O(T + D)-style regret bounds in online bandit learning tasks with delayed feedback, where T is the number of rounds and D is the total feedback delay. We demonstrate the power of Banker-OMD by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. Banker-OMD leads to the first delayed scale-free adversarial MAB algorithm achieving mathcal O(KL(sqrt T+sqrt D)) regret and the first delayed adversarial linear bandit algorithm achieving mathcal O(poly(n)(T + D)) regret. As a corollary, the first application also implies mathcal O(KTL) regret for non-delayed scale-free adversarial MABs, which is the first to match the Omega(KTL) lower bound up to logarithmic factors and can be of independent interest.
Revisiting the Asymptotic Optimality of RRT^*
RRT* is one of the most widely used sampling-based algorithms for asymptotically-optimal motion planning. This algorithm laid the foundations for optimality in motion planning as a whole, and inspired the development of numerous new algorithms in the field, many of which build upon RRT* itself. In this paper, we first identify a logical gap in the optimality proof of RRT*, which was developed in Karaman and Frazzoli (2011). Then, we present an alternative and mathematically-rigorous proof for asymptotic optimality. Our proof suggests that the connection radius used by RRT* should be increased from γleft(log n{n}right)^{1/d} to γ' left(log n{n}right)^{1/(d+1)} in order to account for the additional dimension of time that dictates the samples' ordering. Here γ, γ', are constants, and n, d, are the number of samples and the dimension of the problem, respectively.
Learning Distributed Stabilizing Controllers for Multi-Agent Systems
We address the problem of model-free distributed stabilization of heterogeneous multi-agent systems using reinforcement learning (RL). Two algorithms are developed. The first algorithm solves a centralized linear quadratic regulator (LQR) problem without knowing any initial stabilizing gain in advance. The second algorithm builds upon the results of the first algorithm, and extends it to distributed stabilization of multi-agent systems with predefined interaction graphs. Rigorous proofs are provided to show that the proposed algorithms achieve guaranteed convergence if specific conditions hold. A simulation example is presented to demonstrate the theoretical results.
Relative Oscillation Theory for Jacobi Matrices Extended
We present a comprehensive treatment of relative oscillation theory for finite Jacobi matrices. We show that the difference of the number of eigenvalues of two Jacobi matrices in an interval equals the number of weighted sign-changes of the Wronskian of suitable solutions of the two underlying difference equations. Until now only the case of perturbations of the main diagonal was known. We extend the known results to arbitrary perturbations, allow any (half-)open and closed spectral intervals, simplify the proof, and establish the comparison theorem.
O(n)-invariant Riemannian metrics on SPD matrices
Symmetric Positive Definite (SPD) matrices are ubiquitous in data analysis under the form of covariance matrices or correlation matrices. Several O(n)-invariant Riemannian metrics were defined on the SPD cone, in particular the kernel metrics introduced by Hiai and Petz. The class of kernel metrics interpolates between many classical O(n)-invariant metrics and it satisfies key results of stability and completeness. However, it does not contain all the classical O(n)-invariant metrics. Therefore in this work, we investigate super-classes of kernel metrics and we study which key results remain true. We also introduce an additional key result called cometric-stability, a crucial property to implement geodesics with a Hamiltonian formulation. Our method to build intermediate embedded classes between O(n)-invariant metrics and kernel metrics is to give a characterization of the whole class of O(n)-invariant metrics on SPD matrices and to specify requirements on metrics one by one until we reach kernel metrics. As a secondary contribution, we synthesize the literature on the main O(n)-invariant metrics, we provide the complete formula of the sectional curvature of the affine-invariant metric and the formula of the geodesic parallel transport between commuting matrices for the Bures-Wasserstein metric.
Free dilations of families of C_{0}-semigroups and applications to evolution families
Commuting families of contractions or contractive C_{0}-semigroups on Hilbert spaces often fail to admit power dilations resp, simultaneous unitary dilations which are themselves commutative (see [45, 13, 15]). In the non-commutative setting, Sz.-Nagy [60] and Bożejko [5] provided means to dilate arbitrary families of contractions. The present work extends these discrete-time results to families {T_{i}}_{i in I} of contractive C_{0}-semigroups. We refer to these dilations as continuous-time free unitary dilations and present three distinct approaches to obtain them: 1) An explicit derivation applicable to semigroups that arise as interpolations; 2) A full proof with an explicit construction, via the theory of co-generators à la Słociński [54, 55]; and 3) A second full proof based on the abstract structure of semigroups, which admits a natural reformulation to semigroups defined over topological free products of R_{geq 0} and leads to various residuality results. In 2) a IInd free dilation theorem for topologised index sets is developed via a reformulation of the Trotter--Kato theorem for co-generators. As an application of this we demonstrate how evolution families can be reduced to continuously monitored processes subject to temporal change, à la the quantum Zeno effect [22, 23, 24, 30, 37].
Tameness of actions on finite rank median algebras
We prove that for (compact) finite-rank median algebras the geometric rank equals the independence number of all (continuous) median-preserving functions to [0,1]. Combined with Rosenthal's dichotomy, this yields a generalized Helly selection principle: for finite-rank median algebras, the space of all median-preserving functions to [0,1] is sequentially compact in the pointwise topology. Generalizing joint results with E. Glasner on dendrons (rank-1), we establish that every continuous action of a topological group G by median automorphisms on a finite-rank compact median algebra is Rosenthal representable, hence dynamically tame. As an application, the Roller-Fioravanti compactification of finite-rank topological median G-algebras with compact intervals is often a dynamically tame G-system.
Completeness of Randomized Kinodynamic Planners with State-based Steering
Probabilistic completeness is an important property in motion planning. Although it has been established with clear assumptions for geometric planners, the panorama of completeness results for kinodynamic planners is still incomplete, as most existing proofs rely on strong assumptions that are difficult, if not impossible, to verify on practical systems. In this paper, we focus on an important class of kinodynamic planners, namely those that interpolate trajectories in the state space. We provide a proof of probabilistic completeness for these planners under assumptions that can be readily verified from the system's equations of motion and the user-defined interpolation function. Our proof relies crucially on a property of interpolated trajectories, termed second-order continuity (SOC), which we show is tightly related to the ability of a planner to benefit from denser sampling. We analyze the impact of this property in simulations on a low-torque pendulum. Our results show that a simple RRT using a second-order continuous interpolation swiftly finds solution, while it is impossible for the same planner using standard Bezier curves (which are not SOC) to find any solution.
Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees
We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon H with S states, and A actions. The performance of an agent is measured by the regret after interacting with the environment for T episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most mathcal{O}(H^3SAT) ignoring polylog(HSAT) terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order Ω(H^3SAT), thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.
Approximating the Convex Hull via Metric Space Magnitude
Magnitude of a finite metric space and the related notion of magnitude functions on metric spaces is an active area of research in algebraic topology. Magnitude originally arose in the context of biology, where it represents the number of effective species in an environment; when applied to a one-parameter family of metric spaces tX with scale parameter t, the magnitude captures much of the underlying geometry of the space. Prior work has mostly focussed on properties of magnitude in a global sense; in this paper we restrict the sets to finite subsets of Euclidean space and investigate its individual components. We give an explicit formula for the corrected inclusion-exclusion principle, and define a quantity associated with each point, called the moment which gives an intrinsic ordering to the points. We exploit this in order to form an algorithm which approximates the convex hull.
Extensions of Schoen--Simon--Yau and Schoen--Simon theorems via iteration à la De Giorgi
We give an alternative proof of the Schoen--Simon--Yau curvature estimates and associated Bernstein-type theorems (1975), and extend the original result by including the case of 6-dimensional (stable minimal) immersions. The key step is an ε-regularity theorem, that assumes smallness of the scale-invariant L^2 norm of the second fundamental form. Further, we obtain a graph description, in the Lipschitz multi-valued sense, for any stable minimal immersion of dimension ngeq 2, that may have a singular set Σ of locally finite H^{n-2}-measure, and that is weakly close to a hyperplane. (In fact, if H^{n-2}(Σ)=0, the conclusion is strengthened to a union of smooth graphs.) This follows directly from an ε-regularity theorem, that assumes smallness of the scale-invariant L^2 tilt-excess (verified when the hypersurface is weakly close to a hyperplane). Specialising the multi-valued decomposition to the case of embeddings, we recover the Schoen--Simon theorem (1981). In both ε-regularity theorems the relevant quantity (respectively, length of the second fundamental form and tilt function) solves a non-linear PDE on the immersed minimal hypersurface. The proof is carried out intrinsically (without linearising the PDE) by implementing an iteration method à la De Giorgi (from the linear De Giorgi--Nash--Moser theory). Stability implies estimates (intrinsic weak Caccioppoli inequalities) that make the iteration effective despite the non-linear framework. (In both ε-regularity theorems the method gives explicit constants that quantify the required smallness.)
On the Orthogonal Projections
For any {rm E}-rigid presentation e, we construct an orthogonal projection functor to {rm rep}(e^perp) left adjoint to the natural embedding. We establish a bijection between presentations in {rm rep}(e^perp) and presentations compatible with e. For quivers with potentials, we show that {rm rep}(e^perp) forms a module category of another quiver with potential. We derive mutation formulas for the delta-vectors of positive and negative complements and the dimension vectors of simple modules in {rm rep}(e^perp), enabling an algorithm to find the projected quiver with potential. Additionally, we introduce a modified projection for quivers with potentials that preserves general presentations. For applications to cluster algebras, we establish a connection to the stabilization functors.
Verifying Good Regulator Conditions for Hypergraph Observers: Natural Gradient Learning from Causal Invariance via Established Theorems
We verify that persistent observers in causally invariant hypergraph substrates satisfy the conditions of the Conant-Ashby Good Regulator Theorem. Building on Wolfram's hypergraph physics and Vanchurin's neural network cosmology, we formalize persistent observers as entities that minimize prediction error at their boundary with the environment. Applying a modern reformulation of the Conant-Ashby theorem, we demonstrate that hypergraph observers satisfy Good Regulator conditions, requiring them to maintain internal models. Once an internal model with loss function exists, the emergence of a Fisher information metric follows from standard information geometry. Invoking Amari's uniqueness theorem for reparameterization-invariant gradients, we show that natural gradient descent is the unique admissible learning rule. Under the ansatz M=F^2 for exponential family observers and one specific convergence time functional, we derive a closed-form formula for the regime parameter alpha in Vanchurin's Type II framework, with a quantum-classical threshold at kappa(F)=2. However, three alternative convergence models do not reproduce this result, so this prediction is strongly model-dependent. We further introduce the directional regime parameter alpha_{v_k} and the trace-free deviation tensor, showing that a single observer can simultaneously occupy different Vanchurin regimes along different eigendirections of the Fisher metric. This connects Wolfram and Vanchurin frameworks through established theorems, providing approximately 25-30% novel contribution.
Dilations of non-Markovian dynamical systems on graphs
To generalise evolution families we consider systems of contractions {varphi(u, v)}_{(u, v) in E} defined on the edges of a graph G = (Ω, E). In this setup the Markov property, or divisibility, can be modelled via varphi(u, v)varphi(v, w) = varphi(u, w) for edges (u, v), (v, w), (u, w) in E. We obtain results in three settings: 1) contractive Banach space operators; 2) positive unital maps on C^{ast}-algebras; and 3) CPTP-maps on trace class operators on a Hilbert space. In the discrete setting, we are able to dilate possibly indivisible families of contractions to divisible families of operators with 'nice' properties (viz. surjective isometries resp. C^{ast}-algebraic automorphisms resp. unitary representations). In the special case of linearly ordered graphs equipped with the order topology, we establish sufficient conditions for strongly continuous dilations of possibly indivisible families in the Banach space and C^{ast}-algebra contexts. To achieve these results we work with string-rewriting systems, and make use of and extend dilation theorems of Stroescu [44], Kraus [23, 24], and vom Ende--Dirr [50].
Extensions of Erdős's 1962 theorem on non-Hamiltonian graphs
For a positive integer k, a graph property H, and a graph parameter P, let ex_{P}(n, H; δgeq k) denote the maximum value of P over all n-vertex graphs with minimum degree at least k that do not possess the property H. The corresponding extremal families are denoted by EX_{P}(n, H; δgeq k). For two disjoint graphs H_1 and H_2, let H_1 cup H_2 denote their (disjoint) union, i.e., the graph with vertex set V(H_1) cup V(H_2) and edge set E(H_1) cup E(H_2); and let H_1 vee H_2 denote their join. In 1962, Erdős established a classical theorem on the maximum number of edges in a non-Hamiltonian graph of given order and minimum degree. Motivated by recent work on feasible graph parameters in Ai2023, we prove several extensions of Erdős's 1962 theorem on non-Hamiltonian graphs. The first result gives a common generalization of the extremal theorem due to Erdős and its spectral analogs. As direct applications, we obtain complete solutions to open problems raised in the literature since 2016, thereby improving nearly all related prior results in this direction. Our proof technique differs somewhat from those in MR3539577,MR3556876. We also prove an analog theorem for the Hamiltonian-connected property and obtain a result which extends the theorem of Füredi, Kostochka, and Luo MR3843180 on Hamilton cycles.
Interpolation and non-dilatable families of C_{0}-semigroups
We generalise a technique of Bhat and Skeide (2015) to interpolate commuting families {S_{i}}_{i in I} of contractions on a Hilbert space H, to commuting families {T_{i}}_{i in I} of contractive C_{0}-semigroups on L^{2}(prod_{i in I}T) otimes H. As an excursus, we provide applications of the interpolations to time-discretisation and the embedding problem. Applied to Parrott's construction (1970), we then demonstrate for d in N with d geq 3 the existence of commuting families {T_{i}}_{i=1}^{d} of contractive C_{0}-semigroups which admit no simultaneous unitary dilation. As an application of these counter-examples, we obtain the residuality wrt. the topology of uniform wot-convergence on compact subsets of R_{geq 0}^{d} of non-unitarily dilatable and non-unitarily approximable d-parameter contractive C_{0}-semigroups on separable infinite-dimensional Hilbert spaces for each d geq 3. Similar results are also developed for d-tuples of commuting contractions. And by building on the counter-examples of Varopoulos--Kaijser (1973--74), a 0--1-result is obtained for the von Neumann inequality. Finally, we discuss applications to rigidity as well as the embedding problem, viz. that `typical' pairs of commuting operators can be simultaneously embedded into commuting pairs of C_{0}-semigroups, which extends results of Eisner (2009--10).
Learning Control-Oriented Dynamical Structure from Data
Even for known nonlinear dynamical systems, feedback controller synthesis is a difficult problem that often requires leveraging the particular structure of the dynamics to induce a stable closed-loop system. For general nonlinear models, including those fit to data, there may not be enough known structure to reliably synthesize a stabilizing feedback controller. In this paper, we discuss a state-dependent nonlinear tracking controller formulation based on a state-dependent Riccati equation for general nonlinear control-affine systems. This formulation depends on a nonlinear factorization of the system of vector fields defining the control-affine dynamics, which always exists under mild smoothness assumptions. We propose a method for learning this factorization from a finite set of data. On a variety of simulated nonlinear dynamical systems, we empirically demonstrate the efficacy of learned versions of this controller in stable trajectory tracking. Alongside our learning method, we evaluate recent ideas in jointly learning a controller and stabilizability certificate for known dynamical systems; we show experimentally that such methods can be frail in comparison.
Dueling RL: Reinforcement Learning with Trajectory Preferences
We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension d. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of mathcal{O}left( SH d log (T / delta) T right). We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee mathcal{O}((d + H^2 + |S|)dT +|mathcal{S||A|TH} ). To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.
Abstract independence relations in neostability theory
We develop a framework, in the style of Adler, for interpreting the notion of "witnessing" that has appeared (usually as a variant of Kim's Lemma) in different areas of neostability theory as a binary relation between abstract independence relations. This involves extending the relativisations of Kim-independence and Conant-independence due to Mutchnik to arbitrary independence relations. After developing this framework, we show that several results from simplicity, NTP_2, NSOP_1, and beyond follow as instances of general theorems for abstract independence relations. In particular, we prove the equivalence between witnessing and symmetry and the implications from this notion to chain local character and the weak independence theorem, and recover some partial converses. Finally, we use this framework to prove a dichotomy between NSOP_1 and Kruckman and Ramsey's BTP that applies to most known NSOP_4 examples in the literature.
OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment
Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
Random Rank: The One and Only Strategyproof and Proportionally Fair Randomized Facility Location Mechanism
Proportionality is an attractive fairness concept that has been applied to a range of problems including the facility location problem, a classic problem in social choice. In our work, we propose a concept called Strong Proportionality, which ensures that when there are two groups of agents at different locations, both groups incur the same total cost. We show that although Strong Proportionality is a well-motivated and basic axiom, there is no deterministic strategyproof mechanism satisfying the property. We then identify a randomized mechanism called Random Rank (which uniformly selects a number k between 1 to n and locates the facility at the k'th highest agent location) which satisfies Strong Proportionality in expectation. Our main theorem characterizes Random Rank as the unique mechanism that achieves universal truthfulness, universal anonymity, and Strong Proportionality in expectation among all randomized mechanisms. Finally, we show via the AverageOrRandomRank mechanism that even stronger ex-post fairness guarantees can be achieved by weakening universal truthfulness to strategyproofness in expectation.
Lipschitz Stability for an Inverse Problem of Biharmonic Wave Equations with Damping
This paper establishes Lipschitz stability for the simultaneous recovery of a variable density coefficient and the initial displacement in a damped biharmonic wave equation. The data consist of the boundary Cauchy data for the Laplacian of the solution, \(Δu |_{\partial Ω}\) and \( \partial_{n}(Δu)|_{\partial Ω}.\) We first prove that the associated system operator generates a contraction semigroup, which ensures the well-posedness of the forward problem. A key observability inequality is then derived via multiplier techniques. Building on this foundation, explicit stability estimates for the inverse problem are obtained. These estimates demonstrate that the biharmonic structure inherently enhances the stability of parameter identification, with the stability constants exhibiting an explicit dependence on the damping coefficient via the factor \( (1 + γ)^{1/2} \). This work provides a rigorous theoretical basis for applications in non-destructive testing and dynamic inversion.
Proportional Fairness in Obnoxious Facility Location
We consider the obnoxious facility location problem (in which agents prefer the facility location to be far from them) and propose a hierarchy of distance-based proportional fairness concepts for the problem. These fairness axioms ensure that groups of agents at the same location are guaranteed to be a distance from the facility proportional to their group size. We consider deterministic and randomized mechanisms, and compute tight bounds on the price of proportional fairness. In the deterministic setting, we show that our proportional fairness axioms are incompatible with strategyproofness, and prove asymptotically tight epsilon-price of anarchy and stability bounds for proportionally fair welfare-optimal mechanisms. In the randomized setting, we identify proportionally fair and strategyproof mechanisms that give an expected welfare within a constant factor of the optimal welfare. Finally, we prove existence results for two extensions to our model.
Deviation Dynamics in Cardinal Hedonic Games
Computing stable partitions in hedonic games is a challenging task because there exist games in which stable outcomes do not exist. Even more, these No-instances can often be leveraged to prove computational hardness results. We make this impression rigorous in a dynamic model of cardinal hedonic games by providing meta theorems. These imply hardness of deciding about the possible or necessary convergence of deviation dynamics based on the mere existence of No-instances. Our results hold for additively separable, fractional, and modified fractional hedonic games (ASHGs, FHGs, and MFHGs). Moreover, they encompass essentially all reasonable stability notions based on single-agent deviations. In addition, we propose dynamics as a method to find individually rational and contractually individual stable (CIS) partitions in ASHGs. In particular, we find that CIS dynamics from the singleton partition possibly converge after a linear number of deviations but may require an exponential number of deviations in the worst case.
When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.
Online learning with noisy side observations
We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of O(α^* T) after T rounds, where α^* is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of α^*. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.
Derivative-Free & Order-Robust Optimisation
In this paper, we formalise order-robust optimisation as an instance of online learning minimising simple regret, and propose Vroom, a zero'th order optimisation algorithm capable of achieving vanishing regret in non-stationary environments, while recovering favorable rates under stochastic reward-generating processes. Our results are the first to target simple regret definitions in adversarial scenarios unveiling a challenge that has been rarely considered in prior work.
From open learners to open games
The categories of open learners (due to Fong, Spivak and Tuy\'eras) and open games (due to the present author, Ghani, Winschel and Zahn) bear a very striking and unexpected similarity. The purpose of this short note is to prove that there is a faithful symmetric monoidal functor from the former to the latter, which means that any supervised neural network (without feedback or other complicating features) can be seen as an open game in a canonical way. Roughly, each parameter is controlled by a different player, and the game's best response relation encodes the dynamics of gradient descent. We suggest paths for further work exploiting the link.
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space
We consider the reinforcement learning (RL) problem with general utilities which consists in maximizing a function of the state-action occupancy measure. Beyond the standard cumulative reward RL setting, this problem includes as particular cases constrained RL, pure exploration and learning from demonstrations among others. For this problem, we propose a simpler single-loop parameter-free normalized policy gradient algorithm. Implementing a recursive momentum variance reduction mechanism, our algorithm achieves mathcal{O}(epsilon^{-3}) and mathcal{O}(epsilon^{-2}) sample complexities for epsilon-first-order stationarity and epsilon-global optimality respectively, under adequate assumptions. We further address the setting of large finite state action spaces via linear function approximation of the occupancy measure and show a mathcal{O}(epsilon^{-4}) sample complexity for a simple policy gradient method with a linear regression subroutine.
Predictor-Feedback CACC for Vehicular Platoons with Actuation and Communication Delays Based on a Multiple-Predecessor-Following CTH Nominal Strategy
We develop a predictor-feedback cooperative adaptive cruise control (CACC) design relying on a multiple-predecessor-following (MPF) topology-based nominal delay-free CACC law. We consider vehicular platoons with heterogeneous vehicles, whose dynamics are described by a third-order linear system subject to actuation delay, along with vehicle-to-vehicle (V2V) communication delay. The design achieves individual vehicle stability, string stability, and zero, steady-state speed/spacing tracking errors, for any value of the actuation delay. The proofs of individual vehicle stability, string stability, and regulation rely on employment of an input-output approach on the frequency domain, capitalizing on the delay-compensating property of the design, which enables as to derive explicit string stability conditions on control and vehicle models parameters. The theoretical guarantees of string stability and the respective conditions on parameters are illustrated also numerically. We present consistent simulation results, for a ten-vehicle platoon, illustrating the potential of the design in traffic throughput improvement, as compared with a predictor-feedback CACC design in which, each ego vehicle's controller utilizes information only from a single preceding vehicle. We also present simulation results in a realistic scenario in which the leading vehicle's trajectory is obtained from NGSIM data.
Low-dimensional observer design for stable linear systems by model reduction
This paper presents a low-dimensional observer design for stable, single-input single-output, continuous-time linear time-invariant (LTI) systems. Leveraging the model reduction by moment matching technique, we approximate the system with a reduced-order model. Based on this reduced-order model, we design a low-dimensional observer that estimates the states of the original system. We show that this observer establishes exact asymptotic state reconstruction for a given class of inputs tied to the observer's dimension. Furthermore, we establish an exponential input-to-state stability property for generic inputs, ensuring a bounded estimation error. Numerical simulations confirm the effectiveness of the approach for a benchmark model reduction problem.
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of mathcal{O}left(Kright), where K is the number of episodes. This is the first result with the optimal K dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of mathcal{O}left(K^{3{4}} right) and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with mathcal{O}left(K^{4{5}}+polyleft(1{lambda_{min}}right) right) regret, for some problem-dependent constant lambda_{min} that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with mathcal{O}left(K^{6{7}} right) regret.
NeoRL: Efficient Exploration for Nonepisodic RL
We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of O(Gamma_T T) for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model (θ_0) or the task vectors (τ_t) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates (ΔW) that constitute τ_t during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods. Code is available at https://github.com/RL-MIND/OrthoReg{https://github.com/RL-MIND/OrthoReg}.
On Enumerating Higher Bruhat Orders Through Deletion and Contraction
The higher Bruhat orders B(n,k) were introduced by Manin-Schechtman to study discriminantal hyperplane arrangements and subsequently studied by Ziegler, who connected B(n,k) to oriented matroids. In this paper, we consider the enumeration of B(n,k) and improve upon Balko's asymptotic lower and upper bounds on |B(n,k)| by a factor exponential in k. A proof of Ziegler's formula for |B(n,n-3)| is given and a bijection between a certain subset of B(n,n-4) and totally symmetric plane partitions is proved. Central to our proofs are deletion and contraction operations for the higher Bruhat orders, defined in analogy with matroids. Dual higher Bruhat orders are also introduced, and we construct isomorphisms relating the higher Bruhat orders and their duals. Additionally, weaving functions are introduced to generalize Felsner's encoding of elements in B(n,2) to all higher Bruhat orders B(n,k).
Learning-Based Shrinking Disturbance-Invariant Tubes for State- and Input-Dependent Uncertainty
We develop a learning-based framework for constructing shrinking disturbance-invariant tubes under state- and input-dependent uncertainty, intended as a building block for tube Model Predictive Control (MPC), and certify safety via a lifted, isotone (order-preserving) fixed-point map. Gaussian Process (GP) posteriors become (1-α) credible ellipsoids, then polytopic outer sets for deterministic set operations. A two-time-scale scheme separates learning epochs, where these polytopes are frozen, from an inner, outside-in iteration that converges to a compact fixed point Z^star!subseteq!mathcal G; its state projection is RPI for the plant. As data accumulate, disturbance polytopes tighten, and the associated tubes nest monotonically, resolving the circular dependence between the set to be verified and the disturbance model while preserving hard constraints. A double-integrator study illustrates shrinking tube cross-sections in data-rich regions while maintaining invariance.
Bagging Provides Assumption-free Stability
Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant. Empirical results validate our findings, showing that bagging successfully stabilizes even highly unstable base algorithms.
Physics Informed Viscous Value Representations
Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.
